Perception Pyramid

From TinyCog
Jump to: navigation, search

The "perception pyramid" refers to a number of subsystems that convert sensor (camera and other) data into a representation that consists of Agents executing their Plans in a given environment. The term "pyramid" is used because of the massive amount of sensor data at the base of the pyramid that is reduced successively in several architectural layers, creating a pyramid like shape.

The Action Pyramid is the equivalent concept for executing actions, converting compact information about plans into actuators (robot motors) signals controlled by several closed-loop circuits.

The perception pyramid consists of:

The "One World Problem"

How does TinyCog create a single coherent world-model based on ambiguous and possibly contradictory sensor information?

TinyCog proposes a new approach for perception that is capable to unify data from multiple sensors. The 3D Reconstruction page includes a figure with the schematic algorithm. Instead of aggregating the sensor data to increasingly higher-level structures (from textures and edges to surfaces and objects), we propose an iterative refining algorithm:

  • start with an "initial scene" composed of a number of objects, to
  • "render" this scene for each sensor and
  • compare the rendered scene with the actual sensor data in order to calculate a "delta".

This "delta" can be based on multiple sensors, for example:

  • Camera (stereo) sensors
  • Time-of-flight 3D point clouds
  • LIDAR 3D point clouds
  • 3D audio
  • ...

The actual "perceived" world status is defined as the scene with the least sum of deltas from all sensors.

There are several keys to this algorithm to work effectively:

  • Previous knowledge and "expectations" about the scene,
  • configuring plausible object variants,
  • constructing plausible scene variants and
  • defining useful deltas on various levels.

Previous Knowledge and Expectations

The "normal" operation of the perception pyramid assumes detailed previous knowledge about all objects to be recognized. It is not possible to directly perceive an unknown object. The algorithm requires:

  • the 3D shapes of all objects,
  • the surface textures of all objects,
  • the frequencies of object co-occurrence,
  • the sounds and smells emitted by objects (for integration with audio and olfactory sensors),
  • the mechanical model and possible movements of objects (for analysis of movements)

All of these items include multiple variants of observed real-world objects.

Identifying Individual Objects

The first step in constructing "one world" is to identify all objects that appear in the scene. In order to do so, an algorithm splits the sensor data into multiple chunks in order to identify the objects included in each chunk. The algorithm calculates features of the sensor data (for example edges, types of textures, surfaces, ...) and compares these features with the calculated features of objects in the Object Configurator. The result is a list of possible objects for each sensor data chunk, together with a confidence measure.

There are many options to implement this object identification ranging from classical image processing plus statistical analysis to deep learning etc. It is not import for TinyCog which algorithm is actually used.

Constructing a Plausible Scene

Once a number of potential objects are identified individually, they are combined into a single Scene using information about:

  • Knowledge about the context of the scene (i.e. the robot running TinyCog just entered a kitchen)
  • Previous knowledge about the scene (because the robot has seen this kitchen before or other kitchens and the included objects),
  • Occurrence of objects in a given context (i.e. forks and spoons are frequent in kitchens but not in a parking lot) and
  • Co-occurrences of objects in a scene (i.e. if there is a fork, then it's likely that there will also be a spoon).

The result of the algorithm are multiple sets of objects, together with a confidence score. All of these sets are check later in detail. These "plausible scenes" are the first guess for interpreting sensor data after "switching on the camera".

Configuring Individual Objects

The Object Configurator subsystem maintains a database of 3D shapes and textures per object type, and may also include noise, smell and other sensor data related to an object and details about it's mechanics. Sensor data about object variants are stored together with the parameters characterizing the variant, particularly internal states and sub-types of the object. This information is used by the Episodic Memory which works with abstracted representations of scene.

When comparing sensor data with an actual object, the system calculates:

  • Edge delta - extract and compare sensor data edges vs. the edges from the current "guess" object configuration using classical computer vision edge extraction algorithms,
  • Texture deltas - extract and compare sensor data textures vs. textures known from previously recognized objects and
  • Other deltas - for example based on abstract shapes or point clouds.

Once the basic object type is determined (or guessed), the delta measures are fed to a classical "gradient parameter search" in order to optimize parameter like position, rotation, scale, inner object state and is-part-of composed objects for best fit. In the case of a mismatch, the algorithm is restarted with other guesses for the object type. In the case of no match at all (after all known objects have been tested against the sensor data) the algorithm "escalates" to the algorithm for learning of new object configurations below.

As a result, the algorithm returns for each object in a "plausible scene":

  • Object type and variant,
  • Object position and
  • The state of the object (if observable from the outside and available for the object configurator variant)


Learning processes are triggered every time sensor data can not completely explained with a scene composed of multiple known objects.

Learning New Object Configurations

A failure of the algorithm above for configuring object variants indicates that sensor data show a new object. The first step in learning new object consists of trying to compose the new object from multiple known constituents. For example, the system may identify a new car model composed of wheels, doors, windows and other features known from other cars, but in a new composition.

  • Focus the Attention on the new object,
  • obtain higher quality sensor data from the object object (possibly using camera "zoom"),
  • classify each of the constituents and compare with constituents from the object configurator and
  • use co-occurrence bi-grams for guessing unknown constituents when located beside known constituents.

Identifying an object composition with multiple identified constituents leads to a learning event by storing the newly recognized object in the Object Configurator and the Episodic Memory. "Escalate" to the learning of a completely new object if the identification failed.

Learning a Completely New Object

TinyCog needs a wealth of data for a complete model of a new object:

  • 3D shapes,
  • textures (optical surface characteristics),
  • tactile surface characteristics,
  • component (is-part-of) structure,
  • a mechanical model for object dynamics,
  • co-occurrences of the object with other objects in scenes and
  • possibly the sounds and smells emitted by objects.

In order to obtain these data, TinyCog needs to investigate any completely new object in detail. Investigation consists of "getting closer", obtaining sensor data from multiple angles for shape and texture identification and "playing around" with the object in order to determine it's mechanical characteristics.

Defining Useful Deltas on Various Levels