Is this really the case? In this project, the researchers re-evaluated these findings in the figure. On the left, we can observe that (one of the most popular ones today) scales well with a strong similarity to , which shows great progress since . On the right, however, we can observe the same problem with . On average, the later k in the sequence should be easier to predict because they are conditioned on more information. This is true for , where the average complexity of each k index keeps decreasing over its k context.
In contrast, the same situation is observed after k. This result lista de números de teléfonos celulares represents an awkward reality for existing _{\partial ... As a compression heuristic, the update rule needs to discover the underlying structure and relationships between thousands or even millions of k. The researchers first observed that self-supervised learning can compress large training sets into the weights of the model, which often shows a deep understanding of the semantic connections between its training data, which is exactly what they need. . Inspired by this, the researchers designed a new class of sequence modeling layers in which the hidden state is the model update rule is a step in self-supervised learning.
Since the process of updating the hidden state on the test sequence is equivalent to training the model at test time, this new class of layers is called test-time training layers. The researchers introduced two simple instances: - and - where the hidden states are linear models and two layers respectively. The layer can be integrated into any network architecture and optimized end-to-end similar to the layer and self-attention. The actual runtime layers are already very efficient in terms of . The researchers go a step further and make two innovations to keep them efficient in terms of runtime as well.