High-Level Theory Of Vision Research Paper

Academic Writing Service

View sample High-Level Theory Of Vision Research Paper. Browse other  research paper examples and check the list of research paper topics for more inspiration. If you need a religion research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our custom writing service for professional assistance. We offer high-quality assignments for reasonable rates.

The concept of ‘high-level vision’ is used to denote aspects of vision that are influenced by information stored in memory such as an object’s identity, its name, or the context in which it belongs. It is usually distinguished from low-level visual processing, which is driven primarily by the visual input. High-level vision involves functions such as object recognition, face perception, mental imagery, picture naming, visual categorization, or scene perception.

Academic Writing, Editing, Proofreading, And Problem Solving Services

Get 10% OFF with 24START discount code

1. Background

In hierarchical models of vision (e.g., Marr 1982, see also Marr, Da id (1945–80)), higher levels of visual processing operate on the building blocks delivered by more primitive visual mechanisms. In Marr’s approach to perception, each stage has its own algorithms and its own format of representing processing output. The primal sketch represents the basic features and their spatial relations (low-level vision) and the 2,5-D sketch represents surfaces and shapes from the observer’s viewpoint (intermediate-level vision). Recognition, that is, matching of object descriptions stored in visual memory with visual input, Marr believed only possible when the full 3-D structure of an object and its components is represented in a so-called 3-D object model (high-level vision).

Marr’s approach has been highly influential as a framework to integrate disparate work on the processing of, among others, luminance, texture, shading, color, stereo, or motion (see Palmer 1999). However, his assumption of strictly bottom-up processing in sequential stages has gradually lost attraction. One reason for this theoretical evolution has been the increased popularity in the 1980s of so-called interactive activation models in which different levels of processing interact with one another: Partial outputs from lower-level processes initiate higher-level processes and the outputs of higher-level processes feed back into the lower-level processes.

When different levels are distinguished in more recent theories, great care is taken to avoid the assumption of strict sequential processing. However, the issue of where early vision ends, where high-level vision starts, and how to demarcate the two has remained important in current theorizing. Pylyshyn (1999) has proposed the notion of cognitive impenetrability as the major criterion for separating early (or low-level) vision from later (or higher-level) vision. Early visual processes run autonomously; they cannot be influenced by what we know about the world or expect in a specific situation. Most aspects of visual processing which seem cognitive (e.g., recovering unique 3-D percepts from ambiguous 2-D inputs) are attributed to embodied natural constraints derived from principles of optics and projective geometry, internal to the visual mechanisms themselves. Only effects of object-specific beliefs on earlier visual processing would count as cognitive penetration. A large number of so-called ‘top-down effects’ of higher-level processes on early vision are then explained as effects of attention (preceding early vision) or as post-perceptual effects on the decision stage based on the perceptual output. For example, perceptual learning, perceptual expertise and effects of categorization on discrimination might be attributed to learning to focus attention to the important features or attributes of a stimulus. Semantic effects of the context of the whole scene (e.g., a kitchen) on the identification of the objects embedded within the scene (e.g., bread) have been shown to result from response bias rather than from genuine perceptual effects (e.g., Henderson and Hollingworth 1999).

Implicit in Pylyshyn’s (1999) effort to delineate early vision from cognitive influences is his belief—perhaps shared by many vision scientists—that real scientific progress is possible only in early vision. Arguably, however, in recent years considerable progress has been made in the area of high-level vision as well (e.g., Edelman 1999, Ullman 1996). Most of this progress has been made in object recognition, and, thus, this topic will be covered predominantly.

2. Object Recognition

2.1 Biederman (1987)

Object recognition concerns the identification of an object as a specific entity (i.e., semantic recognition) or the ability to tell that one has seen the object before (i.e., episodic recognition). Interest in object recognition is at least partly caused by the development of a new theory of human object recognition by Biederman (1987). Building on Marr’s recognition ideas, where perceptually derived 3-D volumetric descriptions are matched to those stored in memory, Biederman in addition incorporated Gestalt principles of perceptual grouping and some newer ideas from computer vision into his theory.

Biederman proposed that objects are identified at their entry level on the basis of a structural description of their components and their spatial relations. All object parts are modeled by a limited number of simple primitives called ‘geons,’ a specific subset of generalized cones. A generalized cone is the volume swept out by a cross section moving along an axis. Geons are defined by the categorical attributes on the following dimensions: the axis (straight or curved), the shape of the cross-section (straight or curved, symmetric or asymmetric), and the size of the cross-section (constant, expanded, or expanded and contracted).

One major claim in Biederman’s Recognition-byComponents (RBC) theory is that geons themselves are viewpoint-invariant because the underlying dimensions can be distinguished on the basis of so-called ‘nonaccidental properties’ or NAPs (a notion developed in computer vision). These are special features in the image (such as collinearity, curvilinearity, cotermination, parallelism, and symmetry) that are reliable in the sense that they are most likely caused by similar characteristics in 3-D space (under the assumption of a general viewpoint). For example, rarely do nonparallel structures in 3-D space project to parallel structures in the image. Biederman argues that Gestalt principles of grouping support the extraction of these NAPs and that they are thus detected fast and reliably enough to support bottom-up identification of geons.

The spatial relations between object components also have to be specified by a number of parameters and in later versions of the theory, a few coarse metric attributes of the geons themselves must be specified too. The combinatorial possibilities of even only two or three geons are then enormous. This allows Biederman to argue that the recovery of two or three geons is enough to recognize complex objects quickly even when they are occluded, rotated in depth, degraded, or lacking color or texture.

This claim is supported empirically by experiments in which parts of objects (in fact, line drawings of objects) were deleted until only two or three parts were left. For most moderately complex objects consisting of six or nine parts (e.g., an airplane, an elephant) three parts were still enough to recognize the object. Other supporting evidence comes from recognition of contour-deleted line drawings (which is possible as long as the geons are recoverable), priming with contour-deleted pictures (which is situated at the level of the parts, not the image fragments and not the whole picture), and viewpoint-invariance of 3-D object recognition (under certain limited conditions).

While Biederman and his associates were accumulating evidence for his RBC theory, others were accumulating evidence for strong effects of viewpoint changes on 3-D object recognition (see, e.g., Bulthoff et al. 1995). This has given rise to a lot of controversy between viewpoint-dependent and independent theories of object recognition, often centered around the relevance of specific types of objects used as experimental stimuli (e.g., line drawings vs. rendered 3-D objects, all sorts of novel objects such as paperclips, amoebas, and ‘greebles’) or specific types of experimental tasks (e.g., matching, naming, priming). Because the focus in this review is on theory instead of empirical results, this debate will not be reviewed here. Instead, two promising new theoretical developments will be highlighted.

2.2 Ullman (1996)

Ullman’s (1996) book High-level vision: Object recognition and visual cognition, brings together his achievements in three diverse sub-domains of high-level vision: (a) object recognition (alignment, combination of views), (b) visual cognition (classification, visual routines), and (c) sequence-seeking and counterstreams.

In the area of object recognition, Ullman proposed two different theories. According to his alignment theory, object recognition consists of a search for a particular object model in memory, Mi, and a particular transformation, Tij, that will maximize the fit between (Mi, Tij) and the viewed object, V. The set of the allowed transformations, such as changes in scale, position, or 3-D orientation, can be applied either to the incoming image (more bottom-up) or to the stored model (more top-down). The simplest version of this theory works well for simple, rigid objects. Because the set of relevant transformations is restricted in that case, only three points are needed to be able to establish a correspondence between the image and the model. This so-called ‘three-point alignment scheme’ explicitly uses 3-D models but works well only when the features that are used for alignment are visible from different views. For all other cases, his second theory, recognition by the combination of views, works better. This does not explicitly use 3-D models but a representation of 3-D objects as a small collection of 2-D views. Object recognition then consists of matching the viewed object against a linear combination of these 2-D views. Extensions of this method can even handle cases with partial views due to occlusion and cases with nonrigid transformations.

Object recognition is concerned with specific shapes undergoing well-defined transformations, which are relatively easy to specify and compensate for. Object classification, on the other hand, is more difficult because it is not obvious what the critical shape features of object classes like ‘dog’ or ‘chair’ are. Ullman demonstrated that relatively simple pictorial comparisons based on extensions of the alignment method can begin to achieve useful classification. Another contribution of Ullman in the domain of visual cognition concerns the visual analysis of shape properties and spatial relations. For example, judging whether an X lies inside or outside a closed arbitrary curved shape, which of two ellipses is most elongated, whether two dots lie on a common curve, etc., is notoriously difficult for artificial vision systems, while humans can make such judgments remarkably well. Ullman believes that the flexibility of human spatial perception and cognition depends on so-called ‘visual routines,’ efficient sequences of basic operations that are hardwired into the visual system. For different spatial tasks, different visual routines are compiled consisting of different sequences of the same basic operations such as shifting the processing focus, boundary tracing, and marking.

Ullman’s most general contribution to the theory of high-level vision is his proposal for a general model of information flow in the visual cortex. At the heart of the model are a process (called ‘sequence-seeking’) and a structure (called ‘counter-streams’). Sequence seeking is a search for a sequence of transformations that link an input image with a stored object description. It works in two directions, bottom-up as well as top-down, and it explores a large number of alternative sequences in parallel. Counter-streams are two complimentary pathways, an ascending one from low-to high-level visual areas and a descending one from high-to low-level visual areas. The basic idea is that bottom-up processing (starting from the input image) is performed in the ascending pathway and that top-down processing (starting from the stored models) is performed in the descending pathway. The interactions between the two complementary pathways support the integration of bottom-up and top-down processing.

With this model it is possible to match images and models by simultaneously exploring multiple alternatives instead of testing them sequentially. The model also does justice to many well-know anatomical and physiological facts about the visual cortex. For example, the visual cortex has a massively parallel architecture and descending or feedback projections are equally abundant as their ascending or feed forward counterparts. Finally, Ullman’s model offers a detailed account of many well-known psychological findings such as priming, context effects, and perceptual learning.

2.3 Edelman (1999)

Another important theoretical step forward in high-level vision is Edelman’s (1999) book on Representation and recognition in vision. Edelman argues that the attempt to base recognition on geometrically reconstructed representations of distal objects (like Marr’s 2,5-D sketch) suffers from serious theoretical and practical problems. First, it would lead to the homunculus problem and, more pragmatically, it appears, at least up till now, not to work, not even when using the most sophisticated computer vision techniques. The major reason for this failure might be that previous approaches have tried to achieve veridicality of object representation by aiming for isomorphisms between objects in the world and their corresponding internal representations (i.e., first-order isomorphism or representation by similarity). Instead, Edelman proposes to aim for the formally just as good isomorphism between the relations among several external objects and the relations among their corresponding internal representations (i.e., second-order isomorphism or representation of similarity).

Edelman then introduces the notion of a shape space, a formalism borrowed from mathematical statistics. Shape space is a metric space in which each point corresponds to a particular shape and in which geometric similarity between shapes can be defined rigorously as proximity, a quantity inversely related to distance. Edelman argues that shape similarity can be made not to suffer from the objections commonly raised against a metric-space approach to similarity (e.g., arbitrary choice of features, context dependence, and asymmetries of comparisons). Although a large number of parameters is needed to represent distal shape space fully, fewer may be sufficient if, as in second-order isomorphism, only relations between objects must be represented (e.g., a blending coefficient when morphing the shape of a cow and a pig).

A novel shape (or a novel view of a previously seen shape) can be localized in shape space based on its similarity to known objects (or stored views of known objects). A tuned unit (or module of units) that responds optimally to some shape (i.e., the landmark) and progressively less to progressively less similar shapes is enough to implement this mechanism. Edelman uses a radial basis function approximation network that can be trained and generalizes its response from a series of given views of an object to other views of that object. As a byproduct of this learning, the network will also respond progressively less to progressively less similar objects, precisely what is needed for it to work as an active landmark in internal shape space. This scheme not only works for known objects but also for novel objects. The trick is that a new object must be represented by a vector of proximities to several reference shapes or landmarks. For example, even if one has never seen a giraffe before, it is possible to compute the proximities to known animals such as a camel, a goat, a pig, or even a leopard.

Edelman calls this system a ‘Chorus of Prototypes’ to stress that the prototypical landmarks act together in representing shape, in contrast to a winner-take-all scheme such as Selfridge’s Pandemonium. Initial simulation results with this system are encouraging, especially in light of the wide variety of tasks: identification of novel views of familiar objects, categorization of novel object views, discrimination among views of novel objects, local viewpoint invariance for novel objects, recovery of a standard view and of pose for novel objects, prediction of a novel view for a novel object, etc. None of its theoretical competitors (Biederman’s RBC, Ullman’s alignment) can deal with this variety of tasks.

There is also considerable neurobiological and psychophysical evidence that is compatible with Edelman’s system. In terms of neurobiology, Edelman discusses properties of the receptive fields at lower levels of processing (broad tuning and graded response) and at higher levels of processing such as inferotemporal cortex (for a review, see Logothetis and Sheinberg 1996): selectivity to objects, ensemble encoding (i.e., sparse distributed representations) selective invariance, plasticity and learning, and speed of processing (which is important because Chorus is essentially a feed forward model). In terms of psychophysics, Edelman discusses the evidence for the well-known effects of viewpoint on object recognition (see Sect. 2.1).

3. Other Topics

A number of other topics could be covered under the umbrella notion of ‘high-level vision’ but some are addressed more extensively elsewhere in this encyclopedia. Because faces are ‘objects’ with special significance, face perception has become a research topic of its own, somewhat isolated from more general object perception. Indeed, some argue that faces are processed differently, even in a separate area in the brain (i.e., the so-called ‘fusiform face area’), and subject to a specific disorder, prosopagnosia, after brain damage to this and surrounding regions.

A second topic of special interest, which could be regarded as belonging to high-level vision, is mental imagery, the ability to create mental images and operate on them in tasks requiring visual or spatial judgments in the absence of visual input. In his Image and Brain, Kosslyn (1994) has defended the analogue nature of mental images and the use of visual processing mechanisms in mental imagery tasks on the basis of classic psychological experiments and more recent neuroimaging evidence obtained with Positron Emission Tomography (PET) and functional Magnetic Resonance Imaging (fMRI). In addition, he has developed a detailed psychological and neuroanatomical theory of high-level vision, consisting of seven major subsystems: visual buffer, attention window, encoding of object properties, encoding of spatial properties, associative memory, information look-up, and attention shifting. Each of these components is worked out further into its more detailed operations and then localized neuroanatomically on the basis of neuropsychological case studies of brain-damaged patients with specific deficits and available evidence from PET and fMRI studies.

Two other topics of high-level vision, not addressed elsewhere in this encyclopedia, are picture naming and inattentional blindness. Picture naming has always been an interesting task for cognitive psychology because it obviously involves a number of distinct representations and processes. Traditional cognitive theories have proposed different so-called ‘box-and-arrow’ models to chart the intermediate representations and information flow between them, and traditional experimental methods have been used to test hypotheses derived from these models on the basis of performance measures like error rates and response times. More recently, as with mental imagery, the models have become more detailed and neuroanatomically specific thanks to neuroimaging and patient work (e.g., Humphreys et al. 1999).

In the typical experimental paradigm to study inattentional blindness (e.g., Mack and Rock 2000), subjects receive a rather demanding task at fixation (e.g., reading a small letter) while other visual stimuli are present in the visual field. When asked about the irrelevant (distractor) stimuli afterwards, subjects recall very little about them, except perhaps some basic features like color or orientation. In a more spectacular demonstration along the same lines, sudden changes to even large and central parts of a rich visual display like a colored picture of a scene, go unnoticed when attention is drawn away from the change. This so-called ‘change blindness,’ as well as the other forms of inattentional blindness, suggest that our conscious visual experience of the world is really quite limited, despite the illusion that it is rich and visually detailed. These phenomena constitute interesting examples of how visual attention, visual perception, visual memory, and visual consciousness interact with one another in everyday circumstances, but little theoretical progress has been made in understanding this intricate interplay of separate psychological functions.

4. Conclusion And Future Directions

In the area of high-level vision a number of trends emerge. One is the application of computational methods (previously applied with some success to lower-level visual processing) to higher-level visual processing. A second trend is the increased consideration of the neural basis of high-level vision. Theories about high-level vision should be constrained by what we know about their possible implementation in the brain on the basis of monkey work, neuropsychological studies with brain-damaged patients, and neuroimaging studies with healthy subjects.

High-level vision is an active area of interdisciplinary research and much progress may be expected yet.


  1. Biederman I 1987 Recognition-by-components: A theory of human image understanding. Psychological Review 94: 115–47
  2. Bulthoff H H, Edelman S Y, Tarr M J 1995 How are three-dimensional objects represented in the brain? Cerebral Cortex 5: 247–60
  3. Edelman S 1999 Representation and Recognition in Vision. M.I.T. Press, Cambridge, MA
  4. Henderson J M, Hollingworth A 1999 High-level scene perception. Annual Review of Psychology 50: 243–71
  5. Humphreys G W, Price C J, Riddoch M J 1999 From objects to names: A cognitive neuroscience approach. Psychological Research 62: 118–30
  6. Kosslyn S M 1994 Image and Brain: The Resolution of the Imagery Debate. M.I.T. Press, Cambridge, MA
  7. Logothetis N K, Sheinberg D L 1996 Visual object recognition. Annual Review of Neuroscience 19: 577–621
  8. Mack A, Rock I 1998 Inattentional Blindness. M.I.T. Press, Cambridge, MA
  9. Marr D 1982 Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, San Francisco
  10. Palmer S E 1999 Vision science: Photons to Phenomenology. M.I.T. Press, Cambridge, MA
  11. Pylyshyn Z W 1999 Is vision continuous with cognition? The case for cognitive impenetrability of visual perception. Behavioral and Brain Sciences 22: 341–423
  12. Ullman S 1996 High-level Vision: Object Recognition and Visual Cognition. M.I.T. Press, Cambridge, MA
Low-Level Theory Of Vision Research Paper
Vision For Action Research Paper


Always on-time


100% Confidentiality
Special offer! Get 10% off with the 24START discount code!