DOI: 10.3724/SP.J.1042.2019.00475

Advances in Psychological Science (心理科学进展) 2019/27:3 PP.475-489

Cross-modal integration of audiovisual information in language processing

In daily life, the use of language often occurs in a visual context. A large number of cognitive science studies have shown that visual and linguistic information processing modules do not work independently, but have complex interactions. The present paper centers on the impact of visual information on language processing, and first reviews research progress on the impact of visual information on speech comprehension, speech production and verbal communication. Secondly, the mechanism of visual information affecting language processing is discussed. Finally, computational models of visually situated language processing are reviewed, and the future research directions are prospected.

Key words:visual information processing,language processing,speech comprehension,speech production,verbal communication

ReleaseDate:2019-03-01 06:48:05

Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements:Evidence for continuous mapping models. Journal of Memory and Language, 38(38), 419-439.

Altmann, G. T. M. (2004). Language-mediated eye movements in the absence of a visual world:The "blank screen paradigm." Cognition, 93(2), 79-87.

Altmann, G. T. M., Garnham, A., & Dennis, Y. (1992). Avoiding the garden path:Eye movements in context. Journal of Memory and Language, 31(5), 685-712.

Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs:Restricting the domain of subsequent reference. Cognition, 73(3), 247-264.

Altmann, G. T. M., & Kamide, Y. (2009). Discourse-mediation of the mapping between language and the visual world:Eye movements and mental representation. Cognition, 111(1), 55-71.

Arias-Trejo, N., & Plunkett, K. (2009). Lexical-semantic priming effects during infancy. Philosophical Transactions of the Royal Society B:Biological Sciences, 364(1536), 3633-3647.

Baumgärtner, C., Beuck, N., & Menzel, W. (2012). An architecture for incremental information fusion of cross-modal representations. IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, 498-503.

Beauchamp, M. S. (2016). Chapter 42-Audiovisual speech integration:Neural substrates and behavior. Neurobiology of Language, (2011), 515-526.

Binder, J. R., Frost, J. A., Hammeke, T. A., Cox, R. W., Rao, S. M., & Prieto, T. (1997). Human brain language areas identified by functional magnetic resonance imaging. The Journal of Neuroscience, 17(1), 353-362.

Bobb, S. C., Huettig, F., & Mani, N. (2016). Predicting visual information during sentence processing:Toddlers activate an object's shape before it is mentioned. Journal of Experimental Child Psychology, 151, 51-64.

Brown-Schmidt, S., & Tanenhaus, M. K. (2008). Real-time investigation of referential domains in unscripted conversation:A targeted language game approach. Cognitive Science, 32(4), 643-684.

Bunger, A., Skordos, D., Trueswell, J. C., & Papafragou, A. (2016). How children and adults encode causative events cross-linguistically:Implications for language production and attention. Language, Cognition and Neuroscience, 31(8), 1015-1037.

Carminati, M. N., & Knoeferle, P. (2013). Effects of speaker emotional facial expression and listener age on incremental sentence processing. PLoS ONE, 8(9), e72559.

Chambers, C. G., & Juan, V. S. (2008). Perception and presupposition in real-time language comprehension:Insights from anticipatory processing. Cognition, 108(1), 26-50.

Chambers, C. G., Tanenhaus, M. K., Eberhard, K. M., Carlson, G. N., & Filip, H. (1998). Words and worlds:The construction of context for definite reference. In Proceedings of the 20th Annual Meeting of the Cognitive Science Society, Mahwah, NJ:Lawrence Erlbaum (pp. 220-225).

Chambers, C. G., Tanenhaus, M. K., & Magnuson, J. S. (2004). Actions and affordances in syntactic ambiguity resolution. Journal of Experimental Psychology. Learning, Memory, and Cognition, 30(3), 687-696.

Chen, P-H., & Tsai, J-L. (2015). The influence of syntactic category and semantic constraints on lexical ambiguity resolution:An eye movement study of processing Chinese homographs. Language and Linguistics, 16(4), 555-586.

Clark, H. H., & Wilkes-gibbs, D. (1986). Referring as a collaborative process. Cognition, 22(1), 1-39.

Coco, M. I., & Keller, F. (2009). The impact of visual information on reference assignment in sentence production. Conference of the Cognitive Science Society, 274-279.

Coco, M. I., & Keller, F. (2012). Scan patterns predict sentence production in the cross-modal processing of visual scenes. Cognitive Science, 36(7), 1204-1223.

Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language:A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6(1), 84-107.

De Groot, F., Huettig, F., & Olivers, C. N. L. (2016). Revisiting the looking at nothing phenomenon:Visual and semantic biases in memory search. Visual Cognition, 24, 226-245.

Dilkina, K., McClelland, J. L., & Plaut, D. C. (2010). Are there mental lexicons? The role of semantics in lexical decision. Brain Research, 1365, 66-81.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.

Ferreira, F., Foucart, A., & Engelhardt, P. E. (2013). Language processing in the visual world:Effects of preview, visual complexity, and prediction. Journal of Memory and Language, 69(3), 165-182.

Findlay, J. M., & Gilchrist, I. D. (2003). Active vision:The psychology of looking and seeing. US:Oxford University Press.

Fodor, J. A. (1983). The modularity of mind. MIT press Cambridge.

Frazier, L., & Rayner, K. (1982). Making and correcting errors during sentence comprehension:Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14(2), 178-210.

Garoufi, K., Staudte, M., Koller, A., & Crocker, M. W. (2016). Exploiting listener gaze to improve situated communication in dynamic virtual environments. Cognitive Science, 40(7), 1671-1703.

Gleitman, L. R., January, D., Nappa, R., & Trueswell, J. C. (2007). On the give and take between event apprehension and utterance formulation. Journal of Memory and Language, 57(4), 544-569.

Griffin, Z. M., & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 11(4), 274-279.

Grill-Spector, K., & Malach, R. (2004). The human visual cortex. Annual. Review. Neuroscience, 27, 649-677.

Hafri, A., Trueswell, J. C., & Strickland, B. (2018). Extraction of event roles from visual scenes is rapid, automatic, and interacts with higher-level visual processing. In Proceedings of the 38th Annual Conference of the Cognitive Science Society (Vol. 73).

Hagoort, P. (2005). On Broca, brain, and binding:A new framework. Trends in Cognitive Sciences, 9(9), 416-423.

Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (pp. 1-8).

Heinrich, S., & Wermter, S. (2018). Interactive natural language acquisition in a multi-modal recurrent neural architecture. Connection Science, 30(1), 99-133.

Hintz, F., Meyer, A. S., & Huettig, F. (2017). Predictors of verb-mediated anticipatory eye movements in the visual world. Journal of Experimental Psychology:Learning, Memory, and Cognition, 43(9), 1352-1374. doi:10.1037/xlm0000388.

Huang, Y. T., & Snedeker, J. (2009). Semantic meaning and pragmatic interpretation in 5-year-olds:Evidence from real-time spoken language comprehension. Developmental Psychology, 45(6), 1723-1739.

Huang, Y. T., & Snedeker, J. (2011). Logic and conversation revisited:Evidence for a division between semantic and pragmatic content in real-time language comprehension. Language and Cognitive Processes, 26(8), 1161-1172.

Huettig, F. (2015). Four central questions about prediction in language processing. Brain Research, 1626, 118-135.

Huettig, F., Gaskell, M. G., & Quinlan, P. T. (2004). How speech processing affects our attention to visually similar objects:Shape competitor effects and the visual world paradigm. In Proceedings of the 26th annual meeting of the Cognitive Science Society (pp. 607-612).

Huettig, F., Olivers, C. N. L., & Hartsuiker, R. J. (2011). Looking, language, and memory:Bridging research from the visual world and visual search paradigms. Acta psychologica, 137(2), 138-150.

Ito, A., Pickering, M. J., & Corley, M. (2018). Investigating the time-course of phonological prediction in native and non-native speakers of English:A visual world eye-tracking study. Journal of Memory and Language, 98, 1-11.

Jackendoff, R. (1983). Semantics and cognition (Vol. 8). MIT press.

Johnson, E. K., & Huettig, F. (2011). Eye movements during language-mediated visual search reveal a strong link between overt visual attention and lexical processing in 36-month-olds. Psychological Research, 75(1), 35-42.

Johnson, E. K., McQueen, J. M., & Huettig, F. (2011). Toddlers' language-mediated visual search:They need not have the words for it. The Quarterly Journal of Experimental Psychology, 64(9), 1672-1682.

Knoeferle, P., Crocker, M. W., Scheepers, C., & Pickering, M. J. (2005). The influence of the immediate visual context on incremental thematic role-assignment:Evidence from eye-movements in depicted events. Cognition, 95(1), 95-127.

Knoeferle, P., & Guerra, E. (2016). Visually situated language comprehension. Language & Linguistics Compass, 10(2), 66-82.

Knoeferle, P., & Kreysa, H. (2012). Can speaker gaze modulate syntactic structuring and thematic role assignment during spoken sentence comprehension? Frontiers in Psychology, 3, 538.

Kreysa, H., Knoeferle, P., & Nunneman, E. M. (2014). Effects of speaker gaze versus depicted actions on visual attention during sentence comprehension. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 36, No. 36).

Kuchenbuch, A., Paraskevopoulos, E., Herholz, S. C., & Pantev, C. (2014). Audio-tactile integration and the influence of musical training. PloS One, 9(1), e85743.

Lee, R., Chambers, C. G., Huettig, F., & Ganea, P. A. (2017). Children's semantic and world knowledge overrides fictional information during anticipatory linguistic processing. In The 39th Annual Meeting of the Cognitive Science Society (CogSci 2017) (pp. 730-735).

Leonard, M. K., & Chang, E. F. (2014). Dynamic speech representations in the human temporal lobe. Trends in Cognitive Sciences, 18(9), 472-479.

Linzen, T., & Jaeger, T. F. (2016). Uncertainty and expectation in sentence processing:Evidence from subcategorization distributions. Cognitive Science, 40(6), 1382-1411.

MacDonald, M. C. (1993). The interaction of lexical and syntactic ambiguity. Journal of Memory and Language, 32(5), 692-715.

MacDonald, M. C., Pearlmutter, N. J., & Seidenberg, M. S. (1994). Lexical nature of syntactic ambiguity resolution. Psychological Review, 101(4), 676-703.

Mani, N., Johnson, E., McQueen, J. M., & Huettig, F. (2013). How yellow is your banana? Toddlers' language-mediated visual search in referent-present tasks. Developmental Psychology, 49(6), 1036-1044.

Mani, N., & Schneider, S. (2013). Speaker identity supports phonetic category learning. Journal of Experimental Psychology:Human Perception and Performance, 39(3), 623-629.

Marslen-Wilson, W. D. (1975). Sentence perception as an interactive parallel process. Science, 189(4198), 226-228.

McClelland, J. L., Mirman, D., Bolger, D. J., & Khaitan, P. (2014). Interactive activation and mutual constraint satisfaction in perception and cognition. Cognitive Science, 38(6), 1139-1189.

McCrae, P. (2009). A model for the cross-modal influence of visual context upon language processing. International Conference Recent Advances in Natural Language Processing (RANLP 09, Borovets, Bulgaria), 230-235.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746-748.

Melissa, K., Snedeker, J., & Schulz, L. (2017). Linking language and events:Spatiotemporal cues drive children's expectations about the meanings of novel transitive verbs. Language Learning and Development, 13(1), 1-23.

Milburn, E., Warren, T., & Dickey, M. W. (2015). World knowledge affects prediction as quickly as selectional restrictions:Evidence from the visual world paradigm. Language, Cognition and Neuroscience, 31(4), 536-548.

Ng, H. G., Anton, P., Brügger, M., Churamani, N., Fließwasser, E., Hummel, T.,... Wermter, S. (2017). Hey Robot, Why don't you talk to me. IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, Lisbon, Portugal.

Noh, Y., & Lee, M. (2017). The impact of inhibitory controls on anticipatory sentence processing in L2. Journal of Cognitive Science, 18(1), 21-41.

Nozari, N., Trueswell, J. C., & Thompson-Schill, S. L. (2016). The interplay of local attraction, context and domain-general cognitive control in activation and suppression of semantic distractors during sentence comprehension. Psychonomic Bulletin & Review, 23(6), 1942-1953.

Ostarek, M., & Hüettig, F. (2017). Spoken words can make the invisible visible-Testing the involvement of low-level visual representations in spoken word processing. Journal of Experimental Psychology:Human Perception and Performance, 43(3), 499-508.

Peeters, D., Snijders, T. M., Hagoort, P., & Özyürek, A. (2017). Linking language to the visual world:Neural correlates of comprehending verbal reference to objects through pointing and visual cues. Neuropsychologia, 95, 21-29.

Pickering, M. J., Garrod, S., & McElree, B. (2004). Interactions of language and vision restrict "visual world" interpretations.

Pluciennicka, E., Coello, Y., & Kalénine, S. (2016). Development of implicit processing of thematic and functional similarity relations during manipulable artifact object identification:Evidence from eye-tracking in the Visual World Paradigm. Cognitive Development, 38, 75-88.

Pozzan, L., & Trueswell, J. C. (2016). Second language processing and revision of garden-path sentences:A visual word study. Bilingualism:Language and Cognition, 19(3), 636-643.

Eggermont, J. J. (2017). Hearing loss:Causes, prevention, and treatment. Academic Press.

Richardson, D. C., & Spivey, M. J. (2000). Representation, space and Hollywood Squares:Looking at things that aren't there anymore. Cognition, 76(3), 269-295.

Rossion, B., & Pourtois, G. (2004). Revisiting Snodgrass and Vanderwart's object pictorial set:The role of surface detail in basic-level object recognition. Perception, 33(2), 217-236.

Salverda, A. P., & Tanenhaus, M. K. (2010). Tracking the time course of orthographic information in spoken-word recognition. Journal of Experimental Psychology. Learning, Memory, and Cognition, 36(5), 1108-1117.

Smith, A. C., Monaghan, P., & Huettig, F. (2017). The multimodal nature of spoken word processing in the visual world:Testing the predictions of alternative models of multimodal integration. Journal of Memory and Language, 93, 276-303.

Smith, A. C., Monaghan, P., & Huettig, F. (2014). Modelling language-Vision interactions in the hub and spoke framework. Computational Models of Cognitive Processes, 3-16.

Smith, A. C., Monaghan, P., & Huettig, F. (2014). A comprehensive model of spoken word recognition must be multimodal:Evidence from studies of language-mediated visual attention. In 36th Annual Meeting of the Cognitive Science Society (CogSci 2014). Cognitive Science Society.

Snedeker, J., & Trueswell, J. C. (2004). The developing constraints on parsing decisions:The role of lexical-biases and referential scenes in child and adult sentence processing. Cognitive Psychology, 49(3), 238-299.

Staub, A., Abbott, M., & Bogartz, R. S. (2012). Linguistically guided anticipatory eye movements in scene viewing. Visual Cognition, 20(8), 922-946.

Staub, A., & Clifton Jr, C. (2006). Syntactic prediction in language comprehension:Evidence from either...or. Journal of Experimental Psychology. Learning, Memory, and Cognition, 32(2), 425-436.

Tanenhaus, M. K., & Brown-Schmidt, S. (2008). Language processing in the natural world. Philosophical Transactions of the Royal Society B:Biological Sciences, 363(1493), 1105-1122.

Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632-1634.

Vaidyanathan, P., Prud'hommeaux, E., Alm, C. O., Pelz, J. B., & Haake, A. R. (2015). Alignment of eye movements and spoken language for semantic image understanding. Proceedings of the 11th International Conference on Computational Semantics, 76-81.

van Bergen, G., & Flecken, M. (2017). Putting things in new places:Linguistic experience modulates the predictive power of placement verb semantics. Journal of Memory and Language, 92, 26-42.

Venhuizen, N. J., Brouwer, H., & Crocker, M. (2016). When the food arrives before the menu:Modeling event-driven surprisal in language comprehension. In Abstract Presented at Events in Language and Cognition, Pre-CUNY Workshop on Event Structure (Gainesville, FL).

Yeung, H. H., & Nazzi, T. (2014). Object labeling influences infant phonetic learning and generalization. Cognition, 132(2), 151-163.

Yeung, H. H., & Werker, J. F. (2009). Learning words' sounds before learning how words sound:9-month-olds use distinct objects as cues to categorize speech information. Cognition, 113(2), 234-243.