Paul Houle said--- > That said, my new strategy for dealing with "large dump files" is > to cut the file into segments (like 'split') and recompress the > fragments. If your processing chain allows it, this can be a powerful > way to get a concurrency speedup. If more dump files were published in > this format, we could get the benefits of "parallel compression" > without the cost.
This reminds me of an excellent solution to a similar problem<http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html>that may be applicable to dealing efficiently with the planet.osm file. It comes from dealing with a similarly sized wikipedia english language bzip file. Basically you split it into chunks as you've already done but in addition you build an index that tells you which are the first complete entries of each chunk. Then what you've got is O(1) searching of a huge binary file. Piping the output of bzcat to osmarender means you're seeking through the entire file every time. For Wikipedia at least the entries are self-contained and in alphabetical order so this works. It's a great idea and allows a really fast offline wikipedia reader using all open source tools. Conceivably someone could adapt that for more quickly working with the planet.osm file. Now there's probably several huge reasons the concept wouldn't work with the planet.osm file, I don't know a thing about it's internal organization so I can't say... But perhaps there's some amount of data locality that can be exploited to make this work. If there's at least one type of information that we can use to seek through the file and find perhaps a country or a boundary of some sort, then it could be possible. Your post reminded me of the wikipedia dump solution so I thought I'd mention it. Regards, -DC
_______________________________________________ talk mailing list [email protected] http://lists.openstreetmap.org/listinfo/talk

