Workshop on Representation of Visual Scenes: Abstracts

Layered representations for vision and video

Edward H. Adelson
Department of Brain and Cognitive Sciences
and
Media Laboratory
Massachusetts Institute of Technology
Cambridge, MA 02139

Human vision, machine vision, and image coding, all demand representations that are useful and efficient. The best-established techniques today are based on low-level processing. To advance to a new generation of architectures for image analysis and image coding, we need to work with new image representations that involve such concepts as surfaces, lighting, transparency, etc. These representations fall in the domain of "mid-level" vision, and there is accumulating evidence of their importance in human vision. By representing images with these more sophisticated vocabularies we can increase the flexibility and efficiency of our vision and image coding systems. We are developing systems that decompose image sequences into overlapping layers, rather like the "cels" used by a traditional animator. These layers are ordered in depth, sliding over one another and being combined according to the rules of transparency and occlusion. Using the layered representation we can achieve greatly improved motion analysis and image segmentation. By applying layers to image coding we can achieve data compression far better than MPEG, and achieve frame-rate independence as a side benefit. Moreover, the image sequence is decomposed in a meaningful way, which allows flexible image editing and access.

Representation of Scenes from Collections of Images

Rakesh Kumar, P. Anandan, Michal Irani, James Bergen, Keith Hanna
David Sarnoff Research Center
CN5300, Princeton, NJ 08543
EMAIL: kumar@sarnoff.com

The goal of computer vision is to extract information about the world from collections of images. This information might be used to recognize or manipulate objects, to control movement through the environment, to measure or determine the condition of objects, and for many other purposes. The goal of this paper is to consider the representation of information derived from a collection of images and how it may support some of these tasks. By ``collection of images'' we mean any set of images relevant to a given scene. This includes video sequences, multiple images from a single still camera, or multiple images from different cameras. The central thesis of this paper is that the traditional approach to representation of information about scenes by relating each image to an abstract three dimensional coordinate system may not always be appropriate. An approach that more directly represents the relationships among the collection of images has a number of advantages. These relationships can also be computed using practical and efficient algorithms.

This paper presents an hierarchical framework for scene representation. Each increasing level in the hierarchy supports additional types of tasks so that the overall structure grows in capability as more information about the scene is acquired. The proposed hierarchy of representations is as follows: (1) The images themselves (2) Two dimensional image mosaics. (3) Image mosaics with parallax and (4) Layers and tiles with parallax. We develop the algorithms used to build these representations and demonstrate results on real image sequences. Finally, the application of these representations to real world problems is discussed.

Physically-Valid View Synthesis by Image Interpolation

Steven M. Seitz and Charles R. Dyer
Department of Computer Sciences
University of Wisconsin
Madison, WI 53706

Image warping is a popular tool for smoothly transforming one image to another. ``Morphing'' techniques based on geometric image interpolation create compelling visual effects, but the validity of such transformations has not been established. In particular, does 2D interpolation of two views of the same scene produce a sequence of physically valid in-between views of that scene? In this paper, we describe a simple image rectification procedure which guarantees that interpolation does in fact produce valid views, under generic assumptions about visibility and the projection process. Towards this end, it is first shown that two basis views are sufficient to predict the appearance of the scene within a specific range of new viewpoints. Second, it is demonstrated that interpolation of the rectified basis images produces exactly this range of views. Finally, it is shown that generating this range of views is a theoretically well-posed problem, requiring neither knowledge of camera positions nor 3D scene reconstruction. A scanline algorithm for view interpolation is presented that requires only four user-provided feature correspondences to produce valid orthographic views. The quality of the resulting images is demonstrated with interpolations of real imagery.

Direct Methods for Visual Scene Reconstruction

Richard Szeliski and Sing Bing Kang
Digital Equipment Corporation
Cambridge Research Lab, One Kendall Square, Bldg. 700
Cambridge, MA 02139

There has been a lot of activity recently surrounding the reconstruction of photorealistic 3-D scenes and high-resolution images from video sequences. In this paper, we present some of our recent work in this area, which is based on the registration of multiple images (views) in a projective framework. Unlike most other techniques, we do not rely on special features to form a projective basis. Instead, we directly solve a least-squares estimation problem in the unknown structure and motion parameters, which leads to statistically optimal estimates. We discuss algorithms for both constructing planar and panoramic mosaics, and for projective depth recovery. We also speculate about the ultimate usefulness of projective approaches to visual scene reconstruction.

About the correspondence of points between N images

O. Faugeras and B. Mourrain

We analyze the correspondence of points between an arbitrary number of images, from an algebraic and geometric point of views. We use the formalism of the Grassmann-Cayley algebra as the simplest way to make both geometric and algebraic statements in a very synthetic and effective way (i.e. allowing actual computation if needed). We propose a systematic way to describe the algebraic relations which are satisfied by the coordinates of the images of a 3-D point.

They fall in three type: bilinear relations arising when we consider pairs of images among the N and which are the well-known epipolar constraints, trilinear relations arising when we consider triples of images among the N, and quadrilinear relations arising when we consider four-tuples of images among the N. Moreover, we show how two trilinear relations imply the bilinear ones (i.e. the epipolar constraints) We also show how these trilinear constraints can be used to predict the image coordinates of a point in a third image, given the coordinates of the images in the other two images, even in cases where the prediction by the epipolar constraints fails (points in the trifocal plane, or optical centers aligned).

Finally, we show that the quadrilinear relations are in the ideal generated by the bilinearities and trilinearities, and do not bring in any new information. This completes the algebraic description of correspondence between any number of cameras.

Duality of Reconstruction and Positioning from Projective Views

Stefan Carlsson
Computational Vision and Active Perception Laboratory
NADA-KTH, Stockholm, Sweden
Email: stefanc@bion.kth.se

Given multiple image data from a set of points in 3-D, there are two fundamental questions that can be addressed:

What is the structure of the set of points in 3-D ?
What are the positions of the cameras relative to the points ?

In this paper we derive constraint relations between linearly invariant image structure, space point structure and camera positions. These relations can be written as canonical ``projection'' equations where cameras are represented by their positions in space only. In these relations, space points and camera positions occur in a reciprocal way which means that there is a duality between the problems of computing scene structure and camera positions from image data, in the sense that they can be solved with the same kind of algorithm depending on the number of space points and camera views. The problem of computing camera positions from m points in n views can be solved with the same algorithm as the problem of directly reconstructing n+4 points in m-4 views. This unifies different approaches for projective reconstruction, epipolar based vs. direct methods.

Shape tensors for Efficient and Learnable Indexing

Michael Werman and Daphna Weinshall
Institute of Computer Science
The Hebrew University of Jerusalem
91904 Jerusalem, Israel

Object recognition starts from a set of image measurements (including locations of points, lines, surfaces, color, and shading), which provides access into a database where representations of objects are stored. We describe a COMPLEXITY THEORY OF INDEXING, a meta-analysis which identifies the best set of measurements (up to algebraic transformations) such that: (1) the representation of objects are linear subspaces and thus easy to LEARN; (2) direct indexing is EFFICIENT since the linear subspaces are of minimal rank. Index complexity is determined via a simple process, equivalent to computing the rank of a matrix. We readily rederive the index complexity of the few previously analyzed cases. We then compute the best index for new and more interesting cases: 6 points in one perspective image, 6 directions in one para-perspective image, and 2 perspective images of 7 points. With color we get the following result: 4 color sensors are sufficient for color constancy at a point, and the sensor-output index is irreducible; the most efficient representation of a color is a plane in 3-D space. For future applications with any vision problem where the relations between shape and image measurements can be written down, we give an automatic process to construct the most efficient database that can be directly obtained by learning from examples.

Relation Between 3D Invariants and 2D Invariants

S.J. Maybank
Robotics Research Laboratory
Department of Engineering Science
University of Oxford
OX1 3PJ, UK

GEC-Marconi Hirst Research Centre
Elstree Way, Borehamwood
Herts WD6 1RX, UK
email: maybank@robots.oxford.ac.uk

A polynomial relation is established between the invariants of certain mixed sets of points and lines and the invariants of their projected images. The relation is obtained using the properties of a twisted cubic which is a covariant of the given set of points and lines.

Virtualized Reality: Being mobile in a visual scene

Takeo Kanade, P. J. Narayanan, Peter Rander
Robotics Institute
Carnegie Mellon University
Pittsburgh, PA 15213

The visual medium evolved from earliest paintings to the realistic paintings to photographs. The medium of moving imagery started with motion pictures. Television and video recording advanced it to show action "live" or capture and playback later. In all of the above media, the view of the scene is determined at recording, or transcription, time, independent of the viewer.

We have been developing a new visual medium named virtualized reality. It delays the selection of the viewing angle till view time, using techniques from computer vision and computer graphics. The visual event is captured using many cameras that cover the action from all sides. The 3D structure of the event, aligned with the pixels of the image, is computed for a few selected directions using a stereo technique. Triangulation and texture mapping enable the reconstruction of the event from any viewpoint on graphics workstations. By reconstructing two views of the event, one view each for the user's left and right eyes, and supplying these images to a stereo-viewing system, the viewer can experience being in the scene rather than just watching it. Virtualized reality, then, allows the viewer to move freely in the scene, independent of the transcription angles used to record the scene.

Virtualized reality has significant advantages over virtual reality. The virtual reality world is typically described using simplistic, artificially-created CAD models. Virtual reality starts with the real world scene and virtualizes it. It is a fully 3D medium as it knows the 3D structure of every point in the image.

The applications of virtualized reality are many. Training can become safer and more effective by enabling the trainee to move about freely in a virtualized environment. A whole new era of entertainment programming can open by allowing the viewer to watch a basketball game while standing on the court or while running with a particular player. In this paper, we describe the hardware and software setup in our "studio" to make virtualized reality movies. Examples are provided to demonstrate the effectiveness of the system.

Multiframe Structure from Motion in Perspective

John Oliensis
NEC Research Institute
4 Independence Way
Princeton, N.J. 08540

A new approach to multiframe structure from motion for point features is presented. Unlike previous approaches, it gives robust reconstruction in situations commonly encountered in outdoor robot navigation, for general motion and with large perspective effects. Under the appropriate conditions, the algorithm provably gives the correct reconstruction. The typical computation time is seconds. It is argued that the new approach, combined with previous algorithms valid in other domains (e.g., Tomasi's algorithm), gives a general method for reconstructing structure from motion.

A Canonical Framework for Sequences of Images

Anders Heyden, Kalle Åströ m
Dept of Mathematics, Lund University
Box 118, S-221 00 Lund, Sweden
email: andersp@maths.lth.se kalle@maths.lth.se

This paper deals with the problem of analysing sequences of images of rigid objects with distinguishable points taken by uncalibrated cameras. It is assumed that the correspondences between the points in the different images are known. The paper introduces a new framework for this problem. Corresponding points in a sequence of n images are related to each other by a fixed n-linear form. This form is an object invariant property, closely linked to the motion of the camera relative to the fixed world. We first describe a reduced setting in which these multilinear forms are easier to understand and analyse. This new formulation of the multilinear forms is then extended to the traditional uncalibrated formulation and to the calibrated case. The formulation makes apparent the connection between camera motion, camera matrices and multilinear forms and the similarities between the calibrated and uncalibrated cases. The new ideas are then used to derive simple linear methods for extracting camera motion from sequences of images. This is illustrated in a few experiments.

Metric Calibration of a Stereo Rig

Andrew Zisserman, Paul A Beardsley and Ian Reid
Robotics Research Group
Department of Engineering Science
University of Oxford
Oxford, OX1 3PJ, UK

We describe a method to determine affine and metric calibration for a stereo rig. The method does not involve the use of calibration objects or special motions, but simply a single general motion of the rig with fixed parameters (i.e. camera parameters and relative orientation of the camera pair). The novel aspects of this work are: first, relating the distinguished objects of Euclidean geometry to fixed entities of a Euclidean transformation matrix; second, showing that these fixed entities are accessible from the conjugate Euclidean transformation arising from the projective transformation of the structure under a motion of the fixed stereo rig; third, a robust and automatic implementation of the method.

Results are included of affine and metric calibration and structure recovery using images of real scenes.