Osmium Concepts Manual

Table of Contents

1. Introduction
2. OSM Entities
3. OSM Files
4. Buffers
5. Handlers
6. Indexes
7. Areas

1. Introduction

“Simple things should be simple and complex things should be possible.” - Alan Kay

This manual introduces some high level concepts that users of the Libosmium C++ library, the PyOsmium Python bindings and the node-osmium Node.JS bindings of this library need to understand to work effectively. Read this first before you dive into the details of how to use the libraries.

While this manual was written for the users of those libraries it can also be helpful for users of the Osmium command line tool and many other OSM tools, because it contains lots of general information about the OpenStreetMap data model.

2. OSM Entities

When working with OSM data you will always encounter the three basic types of objects: Nodes, Ways, and Relations (in Osmium collectively known as OSM Objects). In addition Areas are supported in Osmium, which are not native OSM objects, but Osmium can create Areas from closed ways and multipolygon relations and then treat those Areas almost like the other, real OSM objects. Sometimes you also want to work with Changesets which Osmium also supports. OSM Objects and Changesets together are called OSM Entities in Osmium.

OSM Objects

The OSM data is in OSM objects: nodes, ways, and relations. All OSM objects have a set of attributes and zero or more tags.

Attribute: ID

All OSM objects have a unique integer ID. ID spaces for object types are different, so there is a node with ID 17 and it is different from the way with ID 17. OSM uses only positive IDs starting with 1, but some software (notably JOSM) uses negative IDs for internal use.

Osmium can handle any 64bit signed integer ID in most operations. Some parts only work with positive IDs, notable the node location indexes. For detail see there.

If not explicitly set, objects in Osmium have ID 0.

Note that Osmium uses the same data type for IDs of all types of objects. This simplifies handling, but it can also mean that some memory is wasted. For nodes 64bit IDs are necessary (because there are so many of them), but the IDs of ways or relations will fit into a 32bit signed integer.

Attribute: Version

Objects are created in OSM with version 1 and each change to an object increments this version.

If not explicitly set, objects in Osmium have version 0.

Deleting on object will also create a new object version with a the visible flag (see below) set to false.

Attribute: Visible Flag

Objects can be visible or not visible, deleted.

If not explicitly set, objects in Osmium are flagged as visible.

Usual OSM data files only contain visible objects, but OSM history files contain also deleted objects.

Attribute: Timestamp

Each object has a timestamp giving the date and time when this version of the object was created.

Timestamps have one second resolution. They are always in UTC.

If not explicitly set, objects in Osmium have an “invalid” timestamp.

Attribute: Changeset ID

Changes to the OSM database are done in Changesets (see below), so each version of an object belongs to a changeset. The ID of this changeset is stored in the object.

The changeset ID is a 32bit unsigned integer.

If not explicitly set, objects in Osmium have changeset ID 0.

Attribute: User ID

The user who created or changed this object. A 32 bit unsigned integer.

The user ID 0 isn’t used for a real user and is usually used to mark anonymous users. It used to be possible to mark your edits in OSM as anonymous, but today it is not possible any more. So user ID 0 will show up in the data, but new changes will always have a valid user.

If not explicitly set, objects in Osmium have user ID 0.

Attribute: User Name

An UTF-8 string with a maximum length of 256 characters. Note that these are characters not bytes and note that all valid Unicode characters can be used.

If not explicitly set, objects in Osmium have an empty user name.

Nodes

In addition to the attributes and tags nodes have a location (Location), a position on the planet. A location consists of two coordinates, a longitude and latitude.

Coordinates are stored internally as 32 bit signed integer. This gives us a resolution of about 1cm or better. This is the same storage format as used internally in the main OSM database.

Historically it was possible to add invalid locations (outside the ranges -180 < longitude < 180, -90 < latitude < 90) to the OSM database. Today this isn’t possible any more, but Osmium can still handle those invalid locations.

In function calls etc. Osmium always uses the coordinates in the order first longitude, then latitude, because of the usual order of coordinates in mathematics (first x, then y) and professional GIS use.

Ways

In addition to the attributes and tags ways have an ordered list of node references (NodeRef).

In Osmium, ways can optionally also have a location for each node reference. This will usually be empty but can be filled, for instance using the NodeLocationsForWays handler (see below). This is very convenient for many use cases.

Ways with zero, one or more node references are allowed. In current OSM data ways have a maximum length of 2000 nodes, but this limit is not enforced by Osmium. Historical OSM data might contain longer node lists.

Relations

In addition to the attributes and tags ways have an ordered list of members (RelationMember). Each member has a type (node, way, or relation), a reference to an object ID of the given type, and a role. The role is a 256 character UTF-8 string and can be empty.

Relation with zero, one or more members are allowed. There is no upper limit on the number of members.

Areas

Areas are “synthetic OSM objects”. They can be created from closed ways and multipolygon relations. Areas have all the same attributes as real OSM objects and they have tags, too. In addition they have a set of outer and inner rings describing the MultiPolygon geometry. See the chapter on Areas for details.

Changesets

Changesets describe a set of associated changes in the OSM database. They have some attributes, an optional list of tags, and an optional list of comments (“discussion”).

Attribute: Id

Unique Id of this changeset.

The changeset ID is a 32bit unsigned integer.

If not explicitly set, changesets in Osmium have ID 0.

Attribute: Bounds

Bounding box of this changeset. Can be empty. Osmium doesn’t check the validity of the coordinates in the bounding box.

If not explicitly set, changesets in Osmium have invalid bounds.

Attribute: Created at

The timestamp when the changeset was opened.

If not explicitly set, timestamps in Osmium are invalid.

Attribute: Closed at

The timestamp when the changeset was closed. This is the invalid timestamp if the changeset is still open.

If not explicitly set, timestamps in Osmium are invalid.

Attribute: Num changes

The number of changes in this changeset.

If not explicitly set, Osmium sets this to 0.

Attribute: Num comments

The number of comments in this changeset.

If not explicitly set, Osmium sets this to 0.

Attribute: Uid

The user who created this changeset. A 32 bit unsigned integer.

If not explicitly set, changesets in Osmium have user ID 0.

Attribute: User Name

An UTF-8 string with a maximum length of 256 characters. Note that these are characters not bytes and note that all valid Unicode characters can be used.

If not explicitly set, changesets in Osmium have an empty user name.

Discussion Comments

Changesets can have zero or more comments. Each contains a timestamp when the comment was made, the user ID and user name of the user making the comment and the text of the comment. Comments are usually, but not necessarily, ordered by the timestamp.

3. OSM Files

OSM uses several different types of files containing different types of data and it uses different formats to “encode” this data into bits and bytes in the files.

Most programs using OSM data will need to read OSM files and/or write to OSM files. Osmium supports most common types and formats.

Please read the File Formats Manual for the details of these formats and how they are used in Osmium.

I/O Multithreading

Osmium uses multithreading behind the scenes to speed up reading and writing files. This is something the user usually doesn’t have to be concerned with. It doesn’t matter if you use the command line tools or the library, for the user it looks like the file is simply read sequentially. But internally Osmium does some magic to speed things up. This works better for some file types than for others and it might influence your choice of file types. Try different file types to get an idea of their relative speeds. Generally XML can’t be parallelized and is slow (reading and writing), PBF can be parallelized well and especially reading with many CPUs is very fast. O5M can not be parallelized but is fast even on a single CPU. OPL can be parallelized and is reasonably fast.

URLs

If a file name looks like a URL (i.e. if it starts with http: or https:), Osmium will fork and execute curl to get the file for you. This happens transparently and will work for all programs using Osmium.

On Windows this feature is not available. You need to have curl installed on your system.

Note that if there is an error during download, Osmium might not be able to detect it. So use caution if you use this feature.

4. Buffers

5. Handlers

6. Indexes

Osmium is built around the idea that a lot of the things you want to do with OSM data can be done one OSM object at the time without having all (or large parts of) the OSM data in memory or in some kind of database. But there are many things you can not do this way. You do need some kind of storage to hold the data and some indexes to access it efficiently. Osmium provides several class templates that implement different types of indexes.

Index Types

Osmium provides indexes modelled after the STL map and multimap classes, respectively. These classes are to be found in the osmium::index::map and osmium::index::multimap namespaces.

Map Index

Often we need some small, fixed amount of data stored for each OSM object. Read and write access is by ID only. Typical use cases include…

storage of node locations where for each node ID we store the longitude and latitude of that node.
storing the offset of an OSM object in a buffer.
a lookup table that gives you for each node ID all IDs of the way (or ways) that include this node.

Storage types

There are different strategies of storing this data efficiently and there are several sub-classes of the Map and Multimap classes that use different strategies. It is important that you understand the differences and use the class that is most appropriate for your case.

The differences can be understood along different axes:

First, the question is whether the ID space is dense or not. If you are using the full planet data or large portions (such as entire continents) thereof, your ID space is dense, ie most of the possible IDs are actually present in the index. If you are only using small extracts (even with whole countries in them), your ID space is sparse, ie most of the possible IDs are not present in the index. For dense indexes data is often best stored in a kind of array indexed by the ID. For sparse indexes there are several other possibilities. The first component of the index type is either dense or sparse to show for which data it is suitable.

The second question is whether you have enough RAM to hold all the data in the index. Of course it is more efficient to keep the index in RAM, but if you don’t have enough, you need to use a disk-based index. The second component of the index type is either mem for in-memory storage or file for storage on disk. The third option is mmap which also stores the data in memory but uses the mmap and mremap system calls. This allows for dynamic resizing of the storage area without the overhead of copying data around and without the need for twice the memory while data is copied into another, larger buffer. This option is only available on Linux systems, not on OSX and Windows which don’t provide the necessary mremap system call.

Another issue to keep in mind is whether your input data is sorted and/or if you need to interleave reads and writes to the index. Some indexes are automatically sorted, this makes adding items to the index more expensive, but works better when the input data is not sorted or if you are dealing with updates. OSM files normally come pre-sorted, first all nodes sorted by ID, then all ways sorted by ID, then all relations sorted by ID. In that case you can use an index that doesn’t sort its data which is probably faster. But if you ever need to sort the data, it is an extra, expensive step.

List of map index classes

Different index formats are suitable for different sized OSM files. In the descriptions below the following sizes are used. Note that these are only rough numbers shown as indication. If you are not sure, try out which index format works best for your specific case as there are many factors playing into this.

Small OSM files: city or small country sized extracts (<500 MBytes PBF)
Medium OSM files: medium or large country sized extracts (<5 GBytes PBF)
Large OSM files: planet file or continent sized extracts (>5 GBytes PBF)

sparse_mem_map: Uses the STL std::map class. Use for unsorted data.

sparse_mem_table: Uses the sparsetable class from the Google SparseHash library. This uses a lot of RAM for small files, but is very space efficient for medium sized extracts (for instance countries). It is slower than all other (memory based) formats.

sparse_mem_array: Use instead of sparse_mmap_array, if you can’t use that (ie on OSX and Windows).

dense_mem_array: Use instead of dense_mmap_array, if you can’t use that (ie on OSX and Windows). You’ll need a lot of memory!

sparse_mmap_array: Stores the data in a (sorted) array with (ID, value) pairs. Most space efficient format for small or medium sized OSM files.

dense_mmap_array: Best format for large OSM files if you have enough memory.

sparse_file_array: Use if you don’t have much memory or if you need persistent storage.

dense_file_array: Use for large OSM files if you don’t have enough memory.

flex_mem: Automatically uses a sparse or dense index based on the input data. Good as a default value. Works a bit like sparse_mmap_array for small input data and dense_mmap_array for large inputs.

1. Introduction

2. OSM Entities

OSM Objects

Attribute: ID

Attribute: Version

Attribute: Visible Flag

Attribute: Timestamp

Attribute: Changeset ID

Attribute: User ID

Attribute: User Name

Tags

Nodes

Ways

Relations

Areas

Changesets

Attribute: Id

Attribute: Bounds

Attribute: Created at

Attribute: Closed at

Attribute: Num changes

Attribute: Num comments

Attribute: Uid

Attribute: User Name

Tags

Discussion Comments

3. OSM Files

I/O Multithreading

URLs

4. Buffers

5. Handlers

6. Indexes

Index Types

Map Index

Storage types

List of map index classes

List of multimap index classes

7. Areas