OSM File Formats Manual

Table of Contents

1. Introduction
2. File Types
3. File Formats
4. XML Format
5. PBF Format
6. O5M/O5C Format
7. OPL ("Object Per Line") Format
8. Debug Format
9. Blackhole Format
10. Format Comparison

1. Introduction

OpenStreetMap uses several different types of files containing different types of data and it uses different formats to “encode” this data into bits and bytes on your disk.

This manual gives an overview over the different file formats and encodings and explains what they have in common and what their differences are. It has been written for the users of the Libosmium library or any of the tools built on top of this library, but it is useful beyond that.

If you are an Osmium user, read this first to get an overview on OSM files, possibly together with the Osmium Concepts Manual. After you have understood the information in here, you can read the other documenation for the details.

Seeing what’s in an OSM file

If you have an OSM file and want to take a quick look at its content, the osmium command line tool is your friend.

Use the fileinfo command to get a quick overview of the file. This will only read the metadata available from the file system and the header of the file, so it is very fast:

osmium fileinfo OSMFILE

Use the -e option to get more information about the file contents. This will actually read the complete file and give you some statistics etc.

osmium fileinfo -e OSMFILE

If you want to look at the actual contents, use the show command:

osmium show OSMFILE

It will convert the file to the DEBUG format and pipe the result into your favorite pager program.

2. File Types

OSM uses three types of files for its main data:

Data files: These are the most common files containing the OSM data at a specific point in time. This can either be a planet file containing all OSM data or some kind of extract. At most one version of every object (node, way, or relation) is contained in this file. Deleted objects are not in this file. The usual suffix used is .osm.
History files: These files contain not only the current version of an object, but their history, too. So for any object (node, way, or relation) there can be zero or more versions in this file. Deleted objects can also be in this file. The usual suffix used is .osm or .osh. Because sometimes the same suffix is used as for normal data files (.osm) and because there is no clear indicator in the header, it is not always clear what type of file you have in front of you.
Change files: Sometimes called diff files or replication diffs these files contain the changes between one state of the OSM database and another state. Change files can contains several versions of an object. The usual suffix used is .osc.

All these files have in common that they contain OSM objects (nodes, ways, and relations). History files and change files can contain several versions of the same object and also deleted objects, data files can’t.

Osmium handles all these files in the same way. It knows about the different ways those files are formatted, but semantically all these files produce the same internal objects. The only difference is that the visible flag on OSM objects is always true for data files, but not for history and change files.

(Note that this is different from how Osmosis handles these files: Osmosis differentiates between “entity streams” and “change streams”.)

XML Change files have each object in a section called <create>, <modify> or <delete>. When reading change files, Osmium gives you normal OSM objects and sets the visible flag to false for objects in <delete> sections. When writing out OSM objects into change files, deleted objects are marked so and all other objects are either marked as <create> if their version is 1 or <modify> if their version is greater than 1. (This is technically correct, because in OSM all objects are created at version 1 and all other versions are necessarily modifications of this first version. Other software interprets the details differently and uses create/modify in slightly different circumstances. Any software using change files must handle both cases (create/modify) anyway, so this shouldn’t make a difference.)

You can also see a change file as a partial history file with a strange format.

And then there are changeset files. They don’t contain OSM objects, but changesets. Some changeset files contain the discussion comments together with the changesets, some files don’t have the comments (the num_comments attribute is always set, though). Changeset files can be combined with OSM data or history files into one. So there can be one file that contains both the OSM objects and the changesets.

Don’t mix up “change files” and “changeset files”, those are completely different concepts. The “change files” contain the new versions of OSM objects and describe the changes that way. The “changeset files” contain changesets containing the change metadata.

While Osmium itself is mostly file type agnostic, applications built on top of Osmium usually only handle specific types of files for their use cases.

3. File Formats

There are several different OSM file formats in common use. File formats describe the way the content is encoded in bits and bytes on disk or on the wire. Osmium can read and write most of these formats. Here is an overview, later chapters will go into more details.

XML: The original XML-based OSM format. This format is rather verbose and working with it is slow, but it is still used often and in some cases there is no alternative. The main OSM database API also returns its data in this format. More information about this format on the OSM Wiki.
PBF: The binary format based on the Protocol Buffers encoding. This is the most compact format. More information on the OSM Wiki.
O5M/O5C: This binary format is simpler than the PBF format but not used as widely. Osmium can read this format to be compatible with other software, but not write it. O5m is the format for data files, O5c the version for change files. More information on the [OSM Wiki]https://wiki.openstreetmap.org/wiki/O5m).
OPL: A simple format similar to CSV-files with one OSM entity per line. This format is intended for easy use with standard UNIX command line tools such as grep, cut, and awk. See the OPL File Format Manual for details.
DEBUG: A nicely formatted text-based format that is easier to read for a human than the XML or OPL formats. As the name implies this is intended for debugging. The format can only be written by Osmium, not read.
BLACKHOLE: A “dummy” format that throws away all data. Can only be written to, not read from.

See below for more detailed descriptions.

Compression

Files in the text-based formats (XML, OPL, Debug) can optionally be compressed using gzip or bzip2.

Osmium will handle this internally. Just use the right file name suffix (.osm.gz, or .opl.bz2 for instance) for this to work.

Ordering of objects in files

All OSM files can have the entities they contain in any order. This is independant of the type or format of the file. Usually the entities are sorted in a specific way, but whether the entities are sorted or not and in what way is not part of the file format itself.

When you tell Osmium to read a file, it will always gives you the entities in the order they are in the file. And when you write to a file, you give the entities to Osmium in a certain order and they will end up in the file in that order. To be consistent and performant, Osmium doesn’t re-order anything for you. If it would enforce some kind of order, it might have to do extra work, that you might not need or want.

All of this being said, OSM files are almost always ordered in a specific way: First nodes, then ways, then relations. Each group ordered by ascending ID (and ascending version in history files). Changeset files are usually ordered by changeset ID.

If you write software built on Osmium you have to decide whether you impose any restrictions on the internal order of input files and whether you want to guarantee any order when writing out files. This mostly has something to do with performance and ease of programming. Ordered files are often easier and faster to work with, but not necessarily so. You should always think about this issue and document what your programs expect or generate.

While reading and writing files with Osmium is independant of entity order, some other parts of Osmium might expect certains orders or guarantee to generate data in certain orders. Look for those details in the rest of the documentation.

Header data

Some file formats (XML, PBF, O5M, and Debug, but not OPL) have a file header that contains metadata about the file. Which data is available differs widely between formats and most of the data is optional and often not available or inaccurate.

The Osmium library gives you access to the header data when reading files and you can set header fields when writing a file.

Accessing Files

Usually Osmium-based programs will allow you to tell them the name of an input or output file and, optionally a format description. Osmium detects the format of a file from the file name suffix, so usually you do not have to set the format explicitly.

Osmium knows about the following suffixes:

Format	Suffix	Description
XML	.osm	XML data or changeset file, can also be a history file
XML	.osh	XML history file
XML	.osc	XML change file
PBF	.pbf	PBF
OPL	.opl	OPL
O5M	.o5m	o5m data file
O5C	.o5m	o5c change file
DEBUG	.debug	DEBUG

You can stack formats: For example .osm.pbf is the same as .pbf, .osh.pbf is a history file in PBF format.

The change file format (.osc) is only available in the XML version, use .osh instead for other formats.

Osmium supports compression and decompression of XML, OPL, and DEBUG files internally using the GZIP and BZIP2 formats. As usual, these files have an additional suffix .gz, or .bz2.

So a typical PBF file will be named planet.pbf or planet.osm.pbf, a packed history file in XML format could be named history.osh.bz2.

If the file name does not end in the suffix needed for autodetection, you have to supply a format string to Osmium describing the format. Just use the suffix the file name would have as a format string:

File name: foobar, Format: .osm.opl

This is needed most often when referring to STDIN or STDOUT. To refer to STDIN or STDOUT use an empty filename or a single hyphen (-).

File name: -, Format: .osm.pbf

File Format Options

Some file formats allow different options to be set. Options follow in a comma-separated list after the file name format. So, for instance, the PBF format allows two different ways of writing nodes to the file, by default the dense format is used, but you can disable it like this:

File name: foo.pbf, Format: .pbf,pbf_dense_nodes=false

Note that, if a format is given, it must always start with the format description, even if the file name has the correct suffix.

Here is a list of optional settings currently supported:

Format	Option	Default	Description
PBF	`pbf_dense_nodes`	true	Use DenseNodes (more space efficient)
PBF	`pbf_compression`	zlib	Compression for PBF blocks (`none`, `zlib`, `lz4`)
PBF	`pbf_compression_level`		Compression level for PBF blobs
XML	`xml_change_format`	false	Set change format, can also be set by using `osc` instead of `osm` suffix
XML	`force_visible_flag`	false	Write out `visible` flag on each object, also set if `osh` instead of `osm` suffix used
all	`add_metadata`	true	see below
PBF, XML, OPL	`locations_on_ways`	false	Add node locations to way nodes (libosmium-specific extension)
DEBUG	`use_color`	false	Output with ANSI colors
DEBUG	`add_crc32`	false	Add CRC32 checksum to all objects

Writing metadata on OSM objects (`add_metadata`)

There are several metadata attributes on OSM objects:

id
version
timestamp
changeset
uid
user

Usually all these attributes are written out to a file, but you can decide which attributes you want and which you want to leave out. The id attribute will always be added.

You can set the file format option to these values:

true, yes, all: Add all attributes. This is the default.
false, no, none: Only add id attribute
A list of one or more attributes separated by + (plus sign): Only add those attribute (and the id attribute) (Example: add_metadata=version+timestamp).

Note: In libosmium versions up to 2.13.x it was only possible to set this option to true or false. Adding only some attributes to OSM files but not others was not possible.

Note that some programs reading OSM files might not work correctly if no or only some of the attributes are present.

4. XML Format

There are several different XML formats in use in the OSM project. The main formats are the one used for planet files, extracts, and API responses (suffix .osm), the format used for change files (suffix .osc) and the history format (suffixes .osm or .osh).

Some variants are also used, such as the JOSM format which is similiar to the normal OSM format but has some additions. Support for the features of these formats varies.

When reading, the OSM change format (.osc) is detected automatically. When writing, you have to set it using the format specifier osc or the format parameter xml_change_format=true.

5. PBF Format

The PBF file format is based on the Google Protocol Buffers. PBF files are very space efficient and faster to use than XML files. PBF files can contain normal OSM data or OSM history data, but there is no equivalent to the XML .osc format.

Osmium supports reading and writing of nodes in DenseNodes and non-DenseNodes formats. Default is DenseNodes, as this is much more space-efficient. Add the format parameter pbf_dense_nodes=false to disable DenseNodes.

Osmium usually will compress PBF blocks using zlib. To disable this, use the format parameter pbf_compression=none. This makes reading and writing faster, but the resulting files are larger.

From Libosmium 2.16, the compression type LZ4 is also supported (pbf_compression=lz4). Compression and decompression with LZ4 is much faster than with zlib, but the compression ratio is not quite as good. Note that LZ4 compression is optional and only available if it was compiled in. Most other programs reading PBF files will not be able to read it.

Also from Libosmium version 2.16 you can set the compression level with the file format option pbf_compression_level. Allowed values depend on the PBF compression used.

PBF compression	Option	Level
No compression	`none`	n/a
ZLIB	`zlib`	0 - 9
LZ4	`lz4`	1 - 65537

PBF files contain a string table in each data block. Some programs sort this string table for slightly better compression. Osmium does not do this to make writing of PBF files faster.

Usually PBF files contain all the metadata for objects such as changeset id, username, etc. To save some space you can disable writing of metadata with the format option add_metadata=false.

6. O5M/O5C Format

The o5m and o5c formats were invented and are mainly used by the osmconvert, osmfilter, and osmupdate tools. The two versions are for data files (.o5m) and change files (.o5c). History files are not supported.

Osmium can read those files to be compatible with other tools, but it can’t write the file format.

O5M/O5C files are larger than PBF files (unless you compress them again, which is possible, but makes them slower to read and write of course).

7. OPL ("Object Per Line") Format

See the OPL File Format Manual.

8. Debug Format

The DEBUG format is only used for displaying the data to the user in a way that is readable to a human. It can not be read programmatically.

9. Blackhole Format

The BLACKHOLE format is special. All data written to a blackhole “file” is thrown away without being encoded. This is useful in some cases, for instance when you are benchmarking. Unlike writing to /dev/null which will encode the data before throwing it away, the “blackhole” file type doesn’t have any overhead.

It is not possible to read the “blackhole” file format. Combinations like “osc.blackhole” etc. are possible.

10. Format Comparison

Which format should I use?

In many cases you can’t choose which format to use, because you get a file in a specific format and have to work with it. Osmium can read all popular formats, so you are covered here. Osmium can also create most popular formats, so if some other software needs a specific format, you should be okay.

But sometimes you can decide. Here are some guidelines:

Usually PBF is the right format. The files are very small and reading and writing is fast in Osmium, because it uses multithreading. The only drawback is that you can’t easily look inside those files because of the binary format.
You can use LZ4 compression (option pbf_compression=lz4) on PBF files or disable compression altogether (option pbf_compression=none) which makes them larger but faster to read. This might make sense if you will read those files very often and aren’t concerned about disk usage. You have to experiment. Note that most other PBF-reading programs will not support the LZ4 compression.
The OSM API uses the XML format, so if you interact with that API, you’ll want to use XML. Also OSM change files only come in XML format, so most software can only use them in that format.
O5M files are about the same size as PBF files or slightly larger, but they are slower to read than PBF, because they can only be read in a single thread. (If you have multiple CPUs.)
OPL files are reasonably fast to read or write, but they are much bigger than files in one of the binary formats. You can use compression, but that makes reading and writing slower and you loose the advantage that you can easily read the contents. Use OPL if you want to filter or manipulate the OSM data with scripting languages or command line tools.
The debug format is nice for a quick glance at the contents of a file.