The importance of simplicity as a principle in human learning has been demonstrated experimentally in cognitive science through concept learning studies [1] and in linguistics through grammar learning work [2]. Simplicity also has theoretical promise in explaining how humans can acquire arbitrary computable grammars inductively [3, 4], although it leaves open the question of why humans have the priors over those concepts that they do [5].
But there remains an unexplored corollary to the simplicity principle: two things (concepts, ideas, tasks, etc.) will be more easily learnable if they are mutually compressible. Another way to put this is that if a learner has learned an idea $c_1$, then if $K(c_2 \mid c_1) < K(c_3 \mid c_1)$, the idea $c_2$ will be easier to learn than $c_3$. This theory in some form has been around for a long time—it is quite intuitive. All it really says is that if knowing one thing makes another thing simpler, then that second thing will be easier to learn. However, although it has been around for a while, it has not been formalized as above. I call this principle, really an extension of the simplicity principle, the schematicity corollary, after Frederic Bartlett's notion of schemata in his influential framework for understanding this idea [6].
Schematicity
The initial notion of schemata—as put forward in Bartlett's work—was adapted from earlier work by neurologist Sir Henry Head, who attempted to understand how past motor experiences contribute to present understanding of the body (p. 187) [7]. Bartlett extended the term to more abstract concepts.
In several influential experiments, Bartlett presented participants with stories and drawings, and then had participants recall (or re-draw) the stimuli [6]. Participants routinely altered the stories or drawings upon recreation. Some of these alterations involved deleting elements—in other words very straightforward simplification-by-forgetting—but other alterations involved changes to events such as changes in event order or even content changes. Likewise, image changes involved the forgetting of elements, but they also involved the creation of more recognizable elements. Bartlett believed these alterations were due to influences from participant expectations. He dubbed the expectations exerting this influence “schemata.”
Bartlett's experiments and his notion of schemata have had an enduring legacy. In cognitive science, they inspired constructive memory theory, but they have also resulted in the proliferation of the concept of the schema which has colonized various psychological literatures including education, development, cognitive science, and of course, linguistics [8]. We will not seek a full review of all fields, but it does benefit us to take a deeper dive into the linguistics literature.
Schemata Within Language Change and Language Learning
Admittedly, the idea that the similarity of material to previous experiences facilitates its learnability did not require one researcher, and it is unwise to attribute all researchers after Bartlett who posited similar effects to him. As an example, Uriel Weinreich in Languages in Contact proposed that morphological borrowing is facilitated by “similarity in patterns” (p. 44) [9]. Something like schemata could be attributed to the even older notion of analogical change and analogical extension. Here similarities in some forms are extended to others, for example a pattern very much like the reconstructions on the basis of schemata proposed by Bartlett but pre-dating it by almost 60 years [10, 11]. There are surely more researchers attributing phenomena to this force, and a full literature review on such a general concept would be nigh impossible.
But Bartlett's notion of schemata almost certainly has had a direct effect on modern linguistics. This is traceable via two separate usages which have been common in linguistics. The first is traceable to at least Charles Fillmore and colleagues, who introduced the term “schema” (and “schemata”) to discuss abstract similarities across utterances [12]. This was a replacement for Fillmore's earlier use of the term “frame” [13]. Fillmore and collaborators did not discuss the provenance of the term such that it is not clear if their choice to use the term was motivated by knowledge of Bartlett or just osmotic exposure. However, the term is used in such a similar way that it seems unlikely to be coincidental. It has since then been adopted much more generally, especially in the construction grammar community. A more narrow but distinct use is almost exclusively within cognitive linguistics and seems to have been devised by George Lakoff and Mark Johnson in 1987 and was subsequently picked up by other cognitive linguists [14, 15, 16]. The idea is very similar to that of the schema as used by construction grammarians, sharing with it the notion of some abstract structure, but in the case of an image schema that structure is across modalities (mental representations of space and time and how it informs language use) whereas the construction grammarian notion is across linguistic elements. The construction grammarian notion of the schema has survived to the present day.
Also important to the notion of schemata in cognitive science broadly are several theories on language change positing interactions between elements in a language, which necessarily includes the pressures for simplicity and therefore schematicity. This idea has been termed, for example, intraference [17], memetic pressures [18], and intra-systemic factors [19], and surely many more. This way of thinking about language change is actually more general than the notion we would like to capture with the use of schematicity—presumably other pressures operate between elements—but it is worth a mention since it subsumes our own. The point of mentioning these approaches is that, under the assumption that cognitive pressures exert a force towards simplification (by whatever means), we should expect that the influence elements or groups of elements have on others by mutual compressibility will be an important source of these interactions.
Finally, it would be impossible to end this section without a note on what I think is an exciting development in linguistics: what I would consider the first actual evidence of the schematicity corollary as I construe it, a relation that will be more clear below. In recent work it has been found that common patterns can increase the learnability of less common ones [20]. The AANN pattern (“a long four weeks”) is relatively rare (having both unusual adjective/quantifier ordering and an indefinite used with a plural) yet is nevertheless considered grammatical. It was found that these patterns are preferred over permuted variants even when the training data for large language models is ablated of AANN patterns, and that ablating similar (but not the same) patterns from the data results in accuracy drops. Such results indicate that, given language data without AANNs, the language models tested still show a preference for the AANN pattern, indicating a lower code length for those patterns given natural data.
Relation with Algorithmic Information Theory
It is clear that the schemata above—of both linguists and cognitive scientists—provide ways of talking about shared structure. Memories are influenced by the structure of past events, and coerced into familiar schemata or structures that the learner has previously encountered. When acquiring a language, the construction grammarian views the acquisition of abstract categories as the development of a schema—the recognition of a shared structure. Analogical change—be it extension or leveling—involves the changing of the behavior of some linguistic element into a familiar schema by transplanting structure.
If a structure is repeated across many elements that need to be later recalled, then it is useful to save the structure only a single time and reuse it across elements to save space. This shared structure in an algorithmic framework has a very straightforward interpretation: it amounts to the mutual algorithmic information. We can therefore measure schematicity by measuring the amount of shared information in a dataset (or between datasets). In other words, given two datasets $x_1$ and $x_2$ the schematicity is equivalent to $I_k(x_1; x_2)$.
This definition allows us to translate the above empirical findings into two general statements about learning and recall in algorithmic terms:
- All else being equal, it is easier to learn a highly schematic set of data (ease of learning datasets of high $I_k$).
- All else being equal, when we change data during imperfect recall, there is a tendency to make that data more schematic (pressure to alter datasets to increase $I_k$).
The caveat “all else being equal” is used very specifically: increases to $I_k$ should come about from increases in shared structure, not increases in overall complexity. For this reason, it is useful to use normalized variants of $I_k$. I aim to introduce some of this in later articles.
References
- Feldman J. The simplicity principle in human concept learning. Current directions in psychological science 2003;12(6):227--232.
- Hsu AS, Chater N. The logical problem of language acquisition: A probabilistic perspective. Cognitive science 2010;34(6):972--1016.
- Yang Y, Piantadosi ST. One model for the learning of language. Proceedings of the National Academy of Sciences 2022;119(5):e2021865119.
- Chater N, Vitányi P. ‘Ideal learning’of natural language: Positive results about learning from positive evidence. Journal of Mathematical psychology 2007;51(3):135--163.
- Heinz J, Idsardi W. What complexity differences reveal about domains in language. Topics in Cognitive Science 2013;5(1):111--131.
- Bartlett FC. Remembering: A study in experimental and social psychology. 1932.
- Head H, Holmes G. Sensory disturbances from cerebral lesions. Brain 1911;34(2-3):102--254.
- Ghosh VE, Gilboa A. What is a memory schema? A historical perspective on current neuroscience literature. Neuropsychologia 2014;53:104--114.
- Weinreich U. Languages in contact: Findings and Problems. 1954.
- Blevins J, Garrett A. Analogical morphophonology. The nature of the word: Essays in honor of Paul Kiparsky 2009:527--546.
- Osthoff H, Brugmann K. Morphologische Untersuchungen auf dem Gebiete der indogermanischen Sprachen. 1878;3.
- Fillmore CJ, Kay P, O'Connor MC. Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone. Language 1988:501--538.
- Fillmore CJ. The case for case. Texas Symposium on Linguistic Universals 1967.
- Lakoff G. Position paper on metaphor. Theoretical Issues in Natural Language Processing 3 1987.
- Johnson M. The body in the mind: The bodily basis of meaning, imagination, and reason. 1987.
- Sweetser EE. Grammaticalization and semantic bleaching. Annual Meeting of the Berkeley Linguistics Society 1988:389--405.
- Croft W. Explaining language change: An evolutionary approach. 2000.
- Ritt N. Selfish sounds and linguistic evolution: A Darwinian approach to language change. 2004.
- Zehentner E. Competition in language change: The rise of the English dative alternation. 2019;103.
- Misra K, Mahowald K. Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024:913--929.