This glossary provides definitions of various terms which are often encountered when working in immersive media.
Read Time: 5 Minutes
Immersive video is video that surrounds the viewer in 360 or 180 degrees when viewed in a VR headset. Although viewers are surrounded by video and can look around freely, they cannot move within the video. Thus, immersive videos are a 3DoF (3 degrees of freedom) experience.
360 videos are full spheres that wrap all the way around viewers, while 180 videos are hemispherical, covering only half of the world. Both 360 and 180 videos can be 2D (monoscopic) or 3D (stereoscopic). 3D immersive videos impart a strong sense of immersion and presence, and can make viewers feel like they are in the scene.
Equirectangular projection is a projection that allows 360 images to be represented in 2:1 aspect-ratio rectangular frames. The top and bottom of the spherical images are unwrapped and stretched to the corners of the frame. You might recognize this projection because it’s commonly used in maps of the world. When in a compatible VR video player, equirectangular frames are reprojected to look natural in 360-degree immersion. 180 videos are also commonly distributed in equirectangular projection, but are cropped in half. Equirectangular projection is also used in stereoscopic 3D 360 and 180 videos; one image is required for each eye and the resulting two images are tiled horizontally or vertically into a single frame.
A spherical image in equirectangular projection, with the front, left, back, right, zenith (up) and nadir (down) positions marked. Image: Steve Cooper/Keith Martin
Rectilinear content is the flat rectangular media we are most familiar with seeing day to day, such as in broadcast television, movies or photographs. Rectilinear images or video frames are normally viewed on a traditional display or screen, but this form of media can also be experienced within VR on a virtual screen. Rectilinear images and footage can be cropped and extracted from equirectangular content in a process known as reframing.
Image: Steve Cooper
Monoscopic content, also referred to as mono or 2D, consists of a single identical image shown to both left and right eyes at the same time. In VR, monoscopic content does not impart a sense of depth and appears to be flat, whether the content is rectilinear 2D or immersive 360/180.
A monoscopic 2D image seen through the left and right lenses in a headset. The same image is displayed in both lenses, so there is no parallax to provide a sense of stereoscopic depth. See below for a stereo example. Image: Steve Cooper/Keith Martin
Stereoscopic content, also referred to as stereo or 3D, contains a pair of offset left-right images or video frames captured simultaneously by two side-by-side lenses. Ideally these lenses would be separated by the average human interpupillary distance (IPD) of 63mm. The images captured are shown to each eye, respectively, and emulates how humans see with our two eyes. In VR, stereoscopic videos impart a strong sense of depth and provide the illusion of real 3D (when in reality there are only 2 viewpoints represented).
A stereoscopic 3D image seen through the headset’s left and right lenses. Note the disparity (offset left to right) in the image pairs; the disparity in this illustration is slightly exaggerated to emphasize the parallax shift between the left eye and right eye views. Image: Steve Cooper/Keith Martin
A stereoscopic spherical image (3D-360) with the left top and right bottom views tiled vertically as a top/bottom equirectangular pair. Image: Steve Cooper/Keith Martin
A stereoscopic 180-degree image (3D-180) with the left and right views tiled horizontally as a side-by-side, cropped equirectangular pair. Image: Steve Cooper/Keith Martin
Media Content Types
2D video is the conventional ‘flat’ rectangular video we are all used to seeing. Although “2D video” is commonly used, it can be confusing because “2D” is also an adjective that usually means “monoscopic” and can be applied to many forms of content. See the entry for “rectilinear”.
Traditional 2D video watched in a Meta Quest 2 (using the Bigscreen app). It works well but there is no sense of depth or immersion in the video itself. Image: Keith Martin
3D video is stereoscopic rectilinear video like what is presented in a 3D movie. 3D movies in theaters require 3D glasses, but no such glasses are required in VR because VR headset already by design show each eye its own independent image. See the entry for “stereoscopic”.
3D videos watched in a VR headset present a similar experience to being at a 3D movie theater. Image: Keith Martin
Monoscopic 180 video is also referred to as ‘2D-180 video’, and is a single monoscopic fisheye image with a 180-degree field of view (half the world). It can be easily captured using traditional video cameras and circular fisheye lenses.
Image: Steve Cooper/Keith Martin
Stereoscopic 180 video is widely referred to as 3D-180 or VR180 video. The video is usually captured with a left/right pair of horizontally-offset circular fisheye lenses. 3D-180 is a powerful form of immersive video–a good balance between viewer immersion and ease of capture and post processing.
Image: Steve Cooper/Keith Martin
Monoscopic 360 video is widely referred to as 2D 360 video (or just 360º video with no qualifier). The video is typically captured using a dedicated 360 camera or camera rig with a minimum of two lenses which together can view and record the entire 360 scene. The captured video is converted into an equirectangular frame for display in VR format either in camera or with software stitching tools. Affordable consumer cameras such as the Insta 360 One X2, Ricoh Theta and GoPro Max are commonly used to produce basic 360 video for VR. 2D 360 is a good form of immersive video, with presence conveyed via the full scene being viewable in 360.
Image: Steve Cooper/Keith Martin
Stereoscopic 360 video is often referred to as 3D-360 video. The video is typically captured using a dedicated 3D-360 camera or custom camera rig with multiple fisheye lenses arranged in a cylinder. The captured video must be “stitched” using software, which can be challenging and requires specialized knowledge. 3D-360 video done well is very convincing in viewer immersion, instilling a sense of presence conveyed both by stereoscopic presentation and full 360 coverage.
Image: Steve Cooper/Keith Martin
Three degrees of freedom (3DOF, pronounced “three doff”) refers to rotation around three axes; X, Y and Z. Viewers are free to look around–roll, pitch and yaw their head– but they are unable to change their point of view (i.e., no walking).. In VR, all immersive video content is 3DOF, allowing the viewer to look around at will. It’s important to note that 3D 360 and 180 are only effectively 2DOF or 2.5DOF because rolling one’s head makes the stereoscopic video very uncomfortable. The left and right images become separated vertically (“vertical disparity”), which is a situation not found in nature.
Six degrees of freedom (6DOF, pronounced “six doff”) includes the three degrees of rotations (3DOF) plus an additional three degrees of translation along the same X, Y and Z axes. Up-down movement on the vertical axis (Z) is referred to as heave. Side to side movement along the horizontal axis (Y) is referred to as sway or slide. Movement in and out along the front-to-back axis X is referred to as surge. In VR, both game engine experiences and experimental volumetric media experiences support full 6DOF, allowing the viewer to both look and move in every direction.
The six degrees of freedom are 1. the rotation yaw along the Z axis, 2. rotation pitch along the Y axis, 3. rotation roll along the X axis, 4. side to side slide movement along the Y axis, 5. up down heave movement along the Y axis, and 6. front to back motion surge along the X axis. Image: Steve Cooper
Augmented Reality or AR refers to technology that superimposes computer-generated images on a user's view of the real world, providing an overlay with additional information. This is often used in conjunction with geolocation and image-recognizing algorithms in order to enable the overlaid graphics to provide contextual information relating to the real world location and things that are visible within the scene. Smartphones have many “AR” apps, but these bring in the real world through passthrough cameras and display those images on the phone’s display.
Convergence is the angle formed between the viewer’s left and right eyes and an observed object. The greater the angle, the closer the brain perceives that object to be. In immersive video, when stereoscopic content is captured through a pair of lenses separated by a distance close to human IPD, viewing that content in VR provides a strong hint of 3D similar to seeing in the physical world. This makes stereoscopic content very immersive.
The angle of convergence is greater when objects are viewed closer as shown at the top in this illustration. Farther away objects have a narrower angle of convergence. Convergence conveys a perception of 3D, depth and distance to the brain. Image: Steve Cooper
Dewarping is a mathematical process that corrects the distortions caused by lenses. In the immersive video context, this is usually done to correct for fisheye distortion between two or more fisheye lenses and camera sensors. The post-processing of 3D-180 media from the original captured fisheye images to the final side-by-side ‘cropped equirectangular’ images is commonly referred to as stitching, but is in fact dewarping; no stitching is involved.
In the immersive media context, disparity is the positional offset of a given subject in a stereoscopic image. Horizontal disparity causes convergence when viewing stereoscopic content. Vertical disparity is very uncomfortable and can often cause nausea. Vertical disparity can be caused by sloppy camera positioning in stereoscopic capture as well as poor stitching or de-warping in post production.
Eye tracking is a feature that uses cameras to track a user's eyes. Eye tracking can be used as an interface for displays or in VR, and can also be used in foveated rendering, in which small details are only rendered where a user is looking (reducing the data bandwidth required to represent detailed worlds).
A VR headset’s guardian system displays visual wall and floor markers when users get near predefined physical boundaries that were defined during setup. When the user gets too close to the edge of that boundary, translucent markers and/or pass-through video from the headset’s external cameras are displayed in a layer that is superimposed over the game or experience. The guardian helps users in VR avoid collisions with objects such as furniture and walls in the real world.
Using a Meta Quest or Meta Quest 2 headset’s outward-facing cameras, a user’s hands and fingers can be accurately tracked. This information is used to replace physical controllers in VR and allows users to make gestures which drive actions in VR, such as picking up virtual objects and interacting with virtual interfaces.
Head tracking uses the VR headset’s outward-facing cameras to detect the movements of the user's head. This technology provides 3DOF and 6DOF support in VR headsets such as the Meta Quest 2.
IAD - Interaxial distance
This refers to the distance between the two lenses in a VR camera that is roughly close to human IPD (Interpupillary Distance), which results in human-scale stereoscopic 3D experiences when viewed in VR.
IPD - Interpupillary distance
This refers to the distance between the pupils of a human’s left and right eyes. The average human IPD is 63mm. The Meta Quest 2 headset supports IPDs ranging from 56 to 70mm, which accounts for around 95% of adults. To produce human-scale immersive stereoscopic content for VR, productions should use camera rigs with lenses as close to the average human IPD as possible. A larger than typical IPD can exaggerate 3D and make objects appear miniaturized (“hyperstereo”). A smaller than typical IPD will reduce stereoscopic 3D and make subjects appear to be larger than normal (“hypostereo”).
Image: Steve Cooper
Nadir refers to the bottom of a spherical scene. Both 180° and 360 content, whether monoscopic 2D or stereoscopic 3D, will have a nadir. When capturing immersive content for VR, care should be taken to minimize the camera’s tripod footprint, and/or remove it from the nadir in post production. Hiding all capture elements can increase the sense of immersion in VR media experiences.
Parallax is the difference in the position of an object when viewed from different positions. Subjects in stereoscopic images have parallax which the brain uses to do 3D scene reconstruction. Close objects have a larger parallax than do objects that are farther away. When a viewpoint is moving, the difference between apparent movements of objects at different distances is called motion parallax.
Presence refers to the level of embodiment felt in a VR experience. Presence is achieved when users feel like they're fully immersed in the virtual experience, suspending belief of any outside elements beyond that experience. In immersive video, stereoscopic 180 and 360 videos can impart a strong sense of presence. Monoscopic 360 and 180 videos can also provide a sense of presence, but the lack of a sense of 3D makes it more challenging to achieve.
Stereo misalignment occurs when left eye and right eye images are poorly positioned, out of focus with each other or offset in time. Misalignment results in stereoscopic content that is uncomfortable to view in VR due to vertical disparity between the eyes. When a viewer feels “strange” or nauseated in a VR media experience, one of the main culprits is stereo misalignment. This can be a byproduct of many things including lenses being misfocused, the left-right images not being accurately aligned, footage being out of sync from left to right, and more.
Misaligned stereoscopic image with a reference grid, seen through the headset’s left and right lenses. Note the vertical rotation and horizontal drop in the left image compared to the right image. This misalignment between left and right image is extremely uncomfortable in 3D VR. Image: Steve Cooper/Keith Martin
Stitching is the process of overlapping, aligning, and merging multiple images to create a single larger one. This is typically done to create panoramas from multiple pictures. In the immersive media context, stitching is used to create a spherical 360-degree image using shots taken by a 360 camera, which all have multiple images and lenses. 360 camera lenses are typically arranged back to back (monoscopic 360) or as 6+ cameras facing outward in cylindrical arrangement (stereoscopic 360). Stitching is sometimes done in camera, but for highest quality, it is a standard step in the post production workflow using software tools provided by the manufacturer, or specialist third-party software.
Virtual reality is the reproduction of 3D environments that can be seen and interacted with in a way that simulates some sort of reality, usually via a head-mounted display. Common VR experiences include games and the watching of immersive videos.
Zenith refers to the top of an equirectangular video or image frame. Both 180 and 360 content, whether monoscopic 2D or stereoscopic 3D, will have a zenith. The zenith of immersive videos is usually clean, but if cameras are suspended from above, removing the rigging in post production can increase a sense of immersion.
Ambisonic audio is a full-sphere surround sound format commonly used to represent spatial audio in immersive video playback. Ambisonic audio exists in multiple formats and “orders”. Meta Quest 2 supports various ambisonics formats including Facebook 360 Audio, a “hybrid higher-order ambisonics” format.
Ambient audio can be used to represent audio in a full sphere around a user. Image: Steve Cooper
Binaural audio is a form of stereo recording historically created by recording from within the ears of a physical model of a head. When played back, this is an effective way to reproduce the way we naturally hear sounds. In VR, spatial audio is rendered in real time as binaural audio, which allows two speakers to reproduce sounds that appear to come from all around a user.
Binaural microphone system records natural audio using a pair of mics mounted inside an ear shaped form. Image: Steve Cooper/3DIO
Facebook 360 Audio
Facebook 360 Audio is a form of hybrid higher order ambisonics similar in quality to 2nd-order ambisonics. The format is also called Two Big Ears (TBE), and consists of 8 channels of ambisonics and 2 channels of headlocked stereo or binaural audio.
Headlocked audio combined with spatial audio allows sound designers to reproduce 360 degree ambient sounds from the environment with non-directional audio for narration or music. Image: Steve Cooper
Headlocked Audio / Non-Spatial Audio
Headlocked audio is non-spatial–it doesn’t shift its apparent position when the user turns their head. Headlocked audio is commonly used for voiceovers, narration, and background stereo tracks such as music.
Headlocked audio, whether stereo or mono, is fixed to the headphone speakers; as the listener’s head rotates, the audio follows the head position. Image: Steve Cooper
Monaural audio, also referred to as mono, is a single channel of audio. It is typically rendered to both ears at once, but can also be used in audio mixing and placed in space by an audio designer.
Spatial Audio, also referred to as positional audio, allows a user to experience sound spatially, in three dimensions, where sounds in the virtual environment can have their own positions and qualities.
Spatial audio in VR plays through the device speakers; as the listener’s head rotates the audio rendered as binaural audio, giving listeners a sense of a convincing 3D audio landscape. Image: Steve Cooper
Stereo audio uses two different audio channels, one per ear, to reproduce a sense of one-dimensional depth (supporting panning from left to right in the audio mix itself, unrelated related to head movements in VR). Stereo audio is usually rendered headlocked, but can also be “played” out of virtual speakers in VR environments.