GeoDesk: A fast and compact database engine for OSM features

Jochen_Topf · January 12, 2023, 3:38pm

Osmium apply-changes and pyosmium-up-to-date are doing essentially the same internally. And yes, as @GeoDeskTeam mentions, there is some multithreading involved.

Multithreading works well for reading PBFs, but writing in multiple threads is not straightforward. The reason is less so because of the ordering requirement, but because of the way PBFs are encoded in blocks. Ideally you want blocks to contain a “reasonable” number of objects and/or bytes. But objects have widely different sizes, so you basically have to write them into the blocks before you know when the blocks have a “good” size. But you can’t start with the next block until you know where the previous block ends, so you can’t really have a different thread start on the next block while you encode the current one. You can just use fixed number of objects per block to solve this, the blocks might have different sizes, but that might not matter in every case. It might make reading somewhat less efficient, though, due to the extra per-block overhead. And you have to make sure that blocks stay below the maximum size defined by the PBF format, so you need to cover that case somehow.

And the more stuff you are doing at the same time, the more memory you need for all the “in-flight” data that’s in the process of being assembled. This can amount to many GBs of data that you keep in memory while processing. That’s why the multithreading is limited on writing in Osmium. If anybody has an idea how to make this better, please implement it and tell me.