Nonlinear computation in Clusteron

I found this talk yesterday on Youtube, which discussed the possibility of using multi-compartment neuron to replace the classical point neuron (PN) to get more expressive power. The idea emerges from an old NIPS paper https://www.dropbox.com/s/xyzqbkfpqoygmhz/NIPS-1991-the-clusteron-toward-a-simple-abstraction-for-a-complex-neuron-Paper.pdf?dl=0. In this paper, the author proposed a model called Clusteron.

Model structure

In the PN model, the output could be written as

\[y = \phi(\sum_i w_i x_i) \]

while in Clusteron the equation is now:

\[\begin{aligned} y &= \phi(\sum_i a_i)\\ a_i &= w_i x_i (\sum_{j \in D_i} w_j x_j) \end{aligned} \]

The largest difference between the Clusteron and a PN is that in a Clusteron the contribution of a single synapse is influenced by some other synapses \(j\in D_i\). \(D_i\) is a collection of synapses that are spatially close to the synapse \(i\).

This operation reminds me of the attention mechanism. In the attention mechanism, each query Q is compared with a set of keys K, to compute the score and then scores of different Qs are combined to compute the final output. Here in the Clusteron model, the query is \(a_i\) for synapse \(i\), the keys are a set of neighborhood synapses \(D_1, D_2, \cdots\). The corresponding score function then is defined as an indicator function that outputs 1 for neighbor synapses and zeroes otherwise. To generalize this model, one can change the score function from this simple boolean codomain to a continuous codomain, e.g., to allow the spatial distance to play a rule of weighting, with which the equation of \(a_i\) now becomes:

\[a_i = w_i x_i \sum_{j\neq i} \|w_i, w_j \| \cdot w_j x_j \]

In this equation now the activation of a single synapse \(i\) is tuned by potentially the distance between \(i\) and all other synapses \(j\). From a spatial perspective, this is like doing graph convolution operations.

Learning rule

Deriving a learning rule for Clusteron is easy. The first part is the same as the one for perceptrons, while the second part is symmetrical to the first part in the original Clusteron paper. In our generalized version, this is also very simple: we just added a norm function to replace the \(D\). The good part is that, if we allow those synapses to move spatially, we can further tune them by adjusting the corresponding distance between them. Computationally this is to adjust a \(N \times N\) connectivity matrix. So the overall number of parameters now becomes \(N \times (N+1)\).

Can we further extend this approach?

From above we can see that the advantage of the Clusteron over perceptrons is that it exploits some parameters twice to get the nonlinear computation power. We can take this approach to further extend this model. If we consider the computation in Clusteron as a temporal process, then the complexity and be extremely extended as the same parameter could be re-used any arbitrary times that are larger than 2, which in principle gives us a recurrent neural network. This is like the dense associative memory recently proposed by Hopfield et al. However, in this case, the sensitivity of a single parameter will be very high and the derivative can’t be computed in a biologically plausible way. But is this really the case? What if we reduce the complexity of the connectivity matrix to a sparse matrix or other structured matrices? Wouldn’t this reduce the interference of different synapses? What about a hierarchical aggregation process?