Table of Contents
OpenStreetMap uses several different types of files containing different types of data and it uses different formats to “encode” this data into bits and bytes on your disk.
This manual gives an overview over the different file formats and encodings and explains what they have in common and what their differences are. It has been written for the users of the Libosmium library or any of the tools built on top of this library, but it is useful beyond that.
If you are an Osmium user, read this first to get an overview on OSM files, possibly together with the Osmium Concepts Manual. After you have understood the information in here, you can read the other documenation for the details.
Seeing what’s in an OSM file
If you have an OSM file and want to take a quick look at its content, the osmium command line tool is your friend.
fileinfo command to get a quick overview of the file. This will only
read the metadata available from the file system and the header of the file,
so it is very fast:
osmium fileinfo OSMFILE
-e option to get more information about the file contents. This will
actually read the complete file and give you some statistics etc.
osmium fileinfo -e OSMFILE
If you want to look at the actual contents, use the
osmium show OSMFILE
It will convert the file to the DEBUG format and pipe the result into your favorite pager program.
2. File Types
OSM uses three types of files for its main data:
- Data files
- These are the most common files containing the OSM data at a specific point
in time. This can either be a planet file containing all OSM data or some
kind of extract. At most one version of every object (node, way, or
relation) is contained in this file. Deleted objects are not in this
file. The usual suffix used is
- History files
- These files contain not only the current version of an object, but their
history, too. So for any object (node, way, or relation) there can be zero
or more versions in this file. Deleted objects can also be in this file.
The usual suffix used is
.osh. Because sometimes the same suffix is used as for normal data files (
.osm) and because there is no clear indicator in the header, it is not always clear what type of file you have in front of you.
- Change files
- Sometimes called diff files or replication diffs these files
contain the changes between one state of the OSM database and another
state. Change files can contains several versions of an object.
The usual suffix used is
All these files have in common that they contain OSM objects (nodes, ways, and relations). History files and change files can contain several versions of the same object and also deleted objects, data files can’t.
Osmium handles all these files in the same way. It knows about the different
ways those files are formatted, but semantically all these files produce the
same internal objects. The only difference is that the
visible flag on OSM
objects is always true for data files, but not for history and change files.
(Note that this is different from how Osmosis handles these files: Osmosis differentiates between “entity streams” and “change streams”.)
XML Change files have each object in a section called
<delete>. When reading change files, Osmium gives you normal OSM objects
and sets the
visible flag to false for objects in
When writing out OSM objects into change files, deleted objects are marked
so and all other objects are either marked as
<create> if their version is
<modify> if their version is greater than 1. (This is technically
correct, because in OSM all objects are created at version 1 and all other
versions are necessarily modifications of this first version. Other software
interprets the details differently and uses create/modify in slightly different
circumstances. Any software using change files must handle both cases
(create/modify) anyway, so this shouldn’t make a difference.)
You can also see a change file as a partial history file with a strange format.
And then there are changeset files. They don’t contain OSM objects, but
changesets. Some changeset files contain the discussion comments together
with the changesets, some files don’t have the comments (the
attribute is always set, though). Changeset files can be combined with OSM
data or history files into one. So there can be one file that contains both
the OSM objects and the changesets.
Don’t mix up “change files” and “changeset files”, those are completely different concepts. The “change files” contain the new versions of OSM objects and describe the changes that way. The “changeset files” contain changesets containing the change metadata.
While Osmium itself is mostly file type agnostic, applications built on top of Osmium usually only handle specific types of files for their use cases.
3. File Formats
There are several different OSM file formats in common use. File formats describe the way the content is encoded in bits and bytes on disk or on the wire. Osmium can read and write most of these formats. Here is an overview, later chapters will go into more details.
- The original XML-based OSM format. This format is rather verbose and working with it is slow, but it is still used often and in some cases there is no alternative. The main OSM database API also returns its data in this format. More information about this format on the OSM Wiki.
- The binary format based on the Protocol Buffers encoding. This is the most compact format. More information on the OSM Wiki.
- This binary format is simpler than the PBF format but not used as widely. Osmium can read this format to be compatible with other software, but not write it. O5m is the format for data files, O5c the version for change files. More information on the [OSM Wiki]https://wiki.openstreetmap.org/wiki/O5m).
- A simple format similar to CSV-files with one OSM entity per line. This
format is intended for easy use with standard UNIX command line tools such
awk. See the OPL File Format Manual for details.
- A nicely formatted text-based format that is easier to read for a human than the XML or OPL formats. As the name implies this is intended for debugging. The format can only be written by Osmium, not read.
See below for more detailed descriptions.
Files in the text-based formats (XML, OPL, Debug) can optionally be compressed
Osmium will handle this internally. Just use the right file name suffix
.opl.bz2 for instance) for this to work.
Ordering of objects in files
All OSM files can have the entities they contain in any order. This is independant of the type or format of the file. Usually the entities are sorted in a specific way, but whether the entities are sorted or not and in what way is not part of the file format itself.
When you tell Osmium to read a file, it will always gives you the entities in the order they are in the file. And when you write to a file, you give the entities to Osmium in a certain order and they will end up in the file in that order. To be consistent and performant, Osmium doesn’t re-order anything for you. If it would enforce some kind of order, it might have to do extra work, that you might not need or want.
All of this being said, OSM files are almost always ordered in a specific way: First nodes, then ways, then relations. Each group ordered by ascending ID (and ascending version in history files). Changeset files are usually ordered by changeset ID.
If you write software built on Osmium you have to decide whether you impose any restrictions on the internal order of input files and whether you want to guarantee any order when writing out files. This mostly has something to do with performance and ease of programming. Ordered files are often easier and faster to work with, but not necessarily so. You should always think about this issue and document what your programs expect or generate.
While reading and writing files with Osmium is independant of entity order, some other parts of Osmium might expect certains orders or guarantee to generate data in certain orders. Look for those details in the rest of the documentation.
Some file formats (XML, PBF, O5M, and Debug, but not OPL) have a file header that contains metadata about the file. Which data is available differs widely between formats and most of the data is optional and often not available or inaccurate.
The Osmium library gives you access to the header data when reading files and you can set header fields when writing a file.
Usually Osmium-based programs will allow you to tell them the name of an input or output file and, optionally a format description. Osmium detects the format of a file from the file name suffix, so usually you do not have to set the format explicitly.
Osmium knows about the following suffixes:
|XML||.osm||XML data or changeset file, can also be a history file|
|XML||.osh||XML history file|
|XML||.osc||XML change file|
|O5M||.o5m||o5m data file|
|O5C||.o5m||o5c change file|
You can stack formats: For example
.osm.pbf is the same as
is a history file in PBF format.
The change file format (
.osc) is only available in the XML version, use
.osh instead for other formats.
Osmium supports compression and decompression of XML, OPL, and DEBUG files
internally using the GZIP and BZIP2 formats. As usual, these files have an
So a typical PBF file will be named
packed history file in XML format could be named
If the file name does not end in the suffix needed for autodetection, you have to supply a format string to Osmium describing the format. Just use the suffix the file name would have as a format string:
This is needed most often when referring to STDIN or STDOUT. To refer to
STDIN or STDOUT use an empty filename or a single hyphen (
File Format Options
Some file formats allow different options to be set. Options follow in a comma-separated list after the file name format. So, for instance, the PBF format allows two different ways of writing nodes to the file, by default the dense format is used, but you can disable it like this:
Note that, if a format is given, it must always start with the format description, even if the file name has the correct suffix.
Here is a list of optional settings currently supported:
||true||Use DenseNodes (more space efficient)|
||gzip||Compress blocks using gzip (use “none” to disable)|
||false||Set change format, can also be set by using
||true||Add metadata (version, timestamp, etc. to objects)|
|PBF, XML, OPL||
||false||Add node locations to way nodes (libosmium-specific extension)|
||false||Output with ANSI colors|
||false||Add CRC32 checksum to all objects|
4. XML Format
There are several different XML formats in use in the OSM project. The main
formats are the one used for planet files, extracts, and API responses (suffix
.osm), the format used for change files (suffix
.osc) and the history
Some variants are also used, such as the JOSM format which is similiar to the normal OSM format but has some additions. Support for the features of these formats varies.
When reading, the OSM change format (
.osc) is detected automatically. When
writing, you have to set it using the format specifier
osc or the format
5. PBF Format
The PBF file format is based on the Google Protocol Buffers. PBF files are very space efficient and faster to use than XML files. PBF files can contain normal OSM data or OSM history data, but there is no equivalent to the XML .osc format.
Osmium supports reading and writing of nodes in DenseNodes and non-DenseNodes
formats. Default is DenseNodes, as this is much more space-efficient. Add the
pbf_dense_nodes=false to disable DenseNodes.
Osmium usually will compress PBF blocks using zlib. To disable this, use the
pbf_compression=none. This makes reading and writing faster,
but the resulting files are larger.
PBF files contain a string table in each data block. Some programs sort this string table for slightly better compression. Osmium does not do this to make writing of PBF files faster.
Usually PBF files contain all the metadata for objects such as changeset id,
username, etc. To save some space you can disable writing of metatdata with the
6. O5M/O5C Format
The o5m and o5c formats were invented and are mainly used by the osmconvert, osmfilter, and osmupdate tools. The two versions are for data files (.o5m) and change files (.o5c). History files are not supported.
Osmium can read those files to be compatible with other tools, but it can’t write the file format.
O5M/O5C files are larger than PBF files (unless you compress them again, which is possible, but makes them slower to read and write of course).
7. OPL ("Object Per Line") Format
See the OPL File Format Manual.
8. Debug Format
The DEBUG format is only used for displaying the data to the user in a way that is readable to a human. It can not be read programmatically.
9. Format Comparison
Which format should I use?
In many cases you can’t choose which format to use, because you get a file in a specific format and have to work with it. Osmium can read all popular formats, so you are covered here. Osmium can also create most popular formats, so if some other software needs a specific format, you should be okay.
But sometimes you can decide. Here are some guidelines:
- Usually PBF is the right format. The files are very small and reading and writing is fast in Osmium, because it uses multithreading. The only drawback is that you can’t easily look inside those files because of the binary format.
- You can disable compression (option
pbf_compression=none) on PBF files which makes them larger but faster to read. This might make sense if you will read those files very often and aren’t concerned about disk usage. You have to experiment.
- The OSM API uses the XML format, so if you interact with that API, you’ll want to use XML. Also OSM change files only come in XML format, so most software can only use them in that format.
- O5M files are about the same size as PBF files or slightly larger, but they are slower to read than PBF, because they can only be read in a single thread. (If you have multiple CPUs.)
- OPL files are reasonably fast to read or write, but they are much bigger than files in one of the binary formats. You can use compression, but that makes reading and writing slower and you loose the advantage that you can easily read the contents. Use OPL if you want to filter or manipulate the OSM data with scripting languages or command line tools.
- The debug format is nice for a quick glance at the contents of a file.