Introduction

Matroska aims to become THE standard of multimedia container formats. It was derived from a project called MCF, but differentiates from it significantly because it is based on EBML (Extensible Binary Meta Language), a binary derivative of XML. EBML enables significant advantages in terms of future format extensibility, without breaking file support in old parsers.

First, it is essential to clarify exactly “What an Audio/Video container is”, to avoid any misunderstandings:

  • It is NOT a video or audio compression format (codec)
  • It is an envelope for which there can be many audio, video and subtitles streams, allowing the user to store a complete movie or CD in a single file.

Matroska is designed with the future in mind. It incorporates features like:

  • Fast seeking in the file
  • Chapter entries
  • Full metadata (tags) support
  • Selectable subtitle/audio/video streams
  • Modularly expandable
  • Error resilience (can recover playback even when the stream is damaged)
  • Streamable over the internet and local networks (HTTP, CIFS, FTP, etc)
  • Menus (like DVDs have)

Matroska is an open standards project. This means for personal use it is absolutely free to use and that the technical specifications describing the bitstream are open to everybody, even to companies that would like to support it in their products.

Status of this document

This document is a work-in-progress specification defining the Matroska file format as part of the IETF Cellar working group. But since it’s quite complete it is used as a reference for the development of libmatroska. A legacy version of the specification can be found here (PDF doc by Alexander Noé – outdated).

For a simplified diagram of the layout of a Matroska file, see the Diagram page.

A more refined and detailed version of the EBML specifications is being worked on here.

The table found below is now generated from the “source” of the Matroska specification. This XML file is also used to generate the semantic data used in libmatroska and libmatroska2. We encourage anyone to use and monitor its changes so your code is spec-proof and always up to date.

Note that versions 1, 2 and 3 have been finalized. Version 4 is currently work in progress. There MAY be further additions to v4.

Security Considerations

Matroska inherits security considerations from EBML.

Attacks on a Matroska Reader could include:

  • Storage of a arbitrary and potentially executable data within an Attachment Element. Matroska Readers that extract or use data from Matroska Attachments SHOULD check that the data adheres to expectations.
  • A Matroska Attachment with an inaccurate mime-type.

IANA Considerations

To be determined.

Notation and Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [@!RFC2119] [@!RFC8174] when, and only when, they appear in all capitals, as shown here.

This document defines specific terms in order to define the format and application of Matroska. Specific terms are defined below:

Matroska: a multimedia container format based on EBML (Extensible Binary Meta Language)

Matroska Reader: A Matroska Reader is a data parser that interprets the semantics of a Matroska document and creates a way for programs to use Matroska.

Matroska Player: A Matroska Player is a Matroska Reader with a primary purpose of playing audiovisual files, including Matroska documents.

Basis in EBML

Matroska is a Document Type of EBML (Extensible Binary Meta Language). This specification is dependent on the EBML Specification. For an understanding of Matroska’s EBML Schema, see in particular the sections of the EBML Specification covering EBML Element Types, EBML Schema, and EBML Structure.

Added Constraints on EBML

As an EBML Document Type, Matroska adds the following constraints to the EBML specification.

  • The docType of the EBML Header MUST be ‘matroska’.
  • The EBMLMaxIDLength of the EBML Header MUST be 4.
  • The EBMLMaxSizeLength of the EBML Header MUST be between 1 and 8 inclusive.

Matroska Design

All top-levels elements (Segment and direct sub-elements) are coded on 4 octets, i.e. class D elements.

Language Codes

Matroska from version 1 through 3 uses language codes that can be either the 3 letters bibliographic ISO-639-2 form (like “fre” for french), or such a language code followed by a dash and a country code for specialities in languages (like “fre-ca” for Canadian French). The ISO 639-2 Language Elements are “Language Element”, “TagLanguage Element”, and “ChapLanguage Element”.

Starting in Matroska version 4, either ISO 639-2 or BCP 47 MAY be used, although BCP 47 is RECOMMENDED. The BCP 47 Language Elements are “LanguageIETF Element”, “TagLanguageIETF Element”, and “ChapLanguageIETF Element”. If a BCP 47 Language Element and an ISO 639-2 Language Element are used within the same Parent Element, then the ISO 639-2 Language Element MUST be ignored and precedence given to the BCP 47 Language Element.

Country codes are the same as used for internet domains.

Physical Types

Each level can have different meanings for audio and video. The ORIGINAL_MEDIUM tag can be used to specify a string for ChapterPhysicalEquiv = 60. Here is the list of possible levels for both audio and video :

ChapterPhysicalEquiv Audio Video Comment
70 SET / PACKAGE SET / PACKAGE the collection of different media
60 CD / 12” / 10” / 7” / TAPE / MINIDISC / DAT DVD / VHS / LASERDISC the physical medium like a CD or a DVD
50 SIDE SIDE when the original medium (LP/DVD) has different sides
40 - LAYER another physical level on DVDs
30 SESSION SESSION as found on CDs and DVDs
20 TRACK - as found on audio CDs
10 INDEX - the first logical level of the side/medium

Block Structure

Bit 0 is the most significant bit.

Frames using references SHOULD be stored in “coding order”. That means the references first and then the frames referencing them. A consequence is that timestamps might not be consecutive. But a frame with a past timestamp MUST reference a frame already known, otherwise it’s considered bad/void.

Block Header

Offset Player Description
0x00+ MUST Track Number (Track Entry). It is coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc) (most significant bits set to increase the range).
0x01+ MUST Timestamp (relative to Cluster timestamp, signed int16)

Block Header Flags

Offset Bit Player Description
0x03+ 0-3 - Reserved, set to 0
0x03+ 4 - Invisible, the codec SHOULD decode this frame but not display it
0x03+ 5-6 MUST Lacing
      * 00 : no lacing
      * 01 : Xiph lacing
      * 11 : EBML lacing
      * 10 : fixed-size lacing
0x03+ 7 - not used

Lacing

Lacing is a mechanism to save space when storing data. It is typically used for small blocks of data (referred to as frames in Matroska). There are 3 types of lacing:

  1. Xiph, inspired by what is found in the Ogg container
  2. EBML, which is the same with sizes coded differently
  3. fixed-size, where the size is not coded

For example, a user wants to store 3 frames of the same track. The first frame is 800 octets long, the second is 500 octets long and the third is 1000 octets long. As these data are small, they can be stored in a lace to save space. They will then be stored in the same block as follows:

Xiph lacing

  • Block head (with lacing bits set to 01)
  • Lacing head: Number of frames in the lace -1, i.e. 2 (the 800 and 500 octets one)
  • Lacing sizes: only the 2 first ones will be coded, 800 gives 255;255;255;35, 500 gives 255;245. The size of the last frame is deduced from the total size of the Block.
  • Data in frame 1
  • Data in frame 2
  • Data in frame 3

A frame with a size multiple of 255 is coded with a 0 at the end of the size, for example 765 is coded 255;255;255;0.

EBML lacing

In this case, the size is not coded as blocks of 255 bytes, but as a difference with the previous size and this size is coded as in EBML. The first size in the lace is unsigned as in EBML. The others use a range shifting to get a sign on each value:

Bit Representation Value
1xxx xxxx value -(2^6-1) to 2^6-1 (ie 0 to 2^7-2 minus 2^6-1, half of the range)
01xx xxxx xxxx xxxx value -(2^13-1) to 2^13-1
001x xxxx xxxx xxxx xxxx xxxx value -(2^20-1) to 2^20-1
0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx value -(2^27-1) to 2^27-1
0000 1xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx value -(2^34-1) to 2^34-1
0000 01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx value -(2^41-1) to 2^41-1
0000 001x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx value -(2^48-1) to 2^48-1
  • Block head (with lacing bits set to 11)
  • Lacing head: Number of frames in the lace -1, i.e. 2 (the 800 and 500 octets one)
  • Lacing sizes: only the 2 first ones will be coded, 800 gives 0x320 0x4000 = 0x4320, 500 is coded as -300 : - 0x12C + 0x1FFF + 0x4000 = 0x5ED3. The size of the last frame is deduced from the total size of the Block.
  • Data in frame 1
  • Data in frame 2
  • Data in frame 3

Fixed-size lacing

In this case, only the number of frames in the lace is saved, the size of each frame is deduced from the total size of the Block. For example, for 3 frames of 800 octets each:

  • Block head (with lacing bits set to 10)
  • Lacing head: Number of frames in the lace -1, i.e. 2
  • Data in frame 1
  • Data in frame 2
  • Data in frame 3

SimpleBlock Structure

The SimpleBlock is inspired by the Block structure. The main differences are the added Keyframe flag and Discardable flag. Otherwise everything is the same.

Bit 0 is the most significant bit.

Frames using references SHOULD be stored in “coding order”. That means the references first and then the frames referencing them. A consequence is that timestamps might not be consecutive. But a frame with a past timestamp MUST reference a frame already known, otherwise it’s considered bad/void.

SimpleBlock Header
Offset Player Description
0x00+ MUST Track Number (Track Entry). It is coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc) (most significant bits set to increase the range).
0x01+ MUST Timestamp (relative to Cluster timestamp, signed int16)
SimpleBlock Header Flags
Offset Bit Player Description
0x03+ 0 - Keyframe, set when the Block contains only keyframes
0x03+ 1-3 - Reserved, set to 0
0x03+ 4 - Invisible, the codec SHOULD decode this frame but not display it
0x03+ 5-6 MUST Lacing
      * 00 : no lacing
      * 01 : Xiph lacing
      * 11 : EBML lacing
      * 10 : fixed-size lacing
0x03+ 7 - Discardable, the frames of the Block can be discarded during playing if needed

Laced Data

When lacing bit is set.

Offset Player Description
0x00 MUST Number of frames in the lace-1 (uint8)
0x01 / 0xXX MUST* Lace-coded size of each frame of the lace, except for the last one (multiple uint8). *This is not used with Fixed-size lacing as it is calculated automatically from (total size of lace) / (number of frames in lace).

For (possibly) Laced Data

Offset Player Description
0x00 MUST Consecutive laced frames