Re: [osmosis-dev] Reading OSM History dumps
On Wed, Aug 25, 2010 at 11:14 PM, Peter Körner osm-li...@mazdermind.dewrote: Brett, the pgsql tasks currently write (in COPY mode) all data to temp files first. The process seems to be PlanetFile - NodeStoreTempFile - CopyFormatTempFile - PgsqlCopyImport in osm2pgsql the copy data is pushed to pgsql via unix pipes (5 or 6 COPY transactions running at the same time in different connections). This approach skips the CopyFormatTempFile stage. Is there any special reason this approach isn't used in the pgsnapshot package? Not too sure now :-) I think it was the simplest way to share code between both the --write-pgsql-dump task and what was then the --fast-write-pgsql (now simply --write-pgsql) task. In practice the COPY file creation and loading is fairly fast. The biggest downside is the extra disk space. The slowest parts of the whole process are the way geometry creation, index building, and the CLUSTER statements (in the newest schema). On relatively low-end hardware it takes many days to import an entire planet, only a small part of which is the COPY processing. In most cases I create the COPY files using --write-pgsql-dump and load them via the provided load script so that I can better monitor progress and resume if processing is interrupted. In short it just hasn't been a high priority to change it. While I'm on the topic, I've mostly completed the changes to the schema now. Performance is drastically improved over the old version for bounding box query processing. The --read-pgsql --dataset-bounding-box task combination would previously take approximately an hour to retrieve a 1x1 degree box in a populated area, now it is down to around 5 minutes due to far better disk layout. The biggest downside is that the table clustering takes a long time during initial database creation. ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
Am 25.08.2010 15:16, schrieb Marco Lechner - FOSSGIS e.V.: Hi Peter, I'm very intersted in your history-extension and I'm going to test as soon as a first snapshot is available. Will it be possible to eat an --bound-polygon stream from osmosis? Or will it just import the whole history-plane? You will be able to add a bbox or bound-polygon task before pushing things into the database. But without having special tasks to handle bonding boxes in regard to history dumps, you will get problems with nodes moving in- and out of your bounding box. The plugin will, for the time being, also not be able to handle change streams, so it will not be possible to keep the database updated. This is still work in progress in its earliest stage, so please don't expect it solving any real problems. Peter ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
Am 25.08.2010 15:26, schrieb Brett Henderson: In short it just hasn't been a high priority to change it. I was planning to share on FileInputStream/FileOutputStream level. You can feed a FileInputStream into the CopyManager as well as into a file, can't you? Maybe want to can copy the relevant bits later to pgsnapshot. Peter ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
Hi Marco The first snapshot is out. Unfortunately the hstore migration progress Brett is still in let the pgsnapshot tests fail, which is why hudson is not providing nightly builds anymore. Because of that you'll need to compile osmosis yourself. I attached instructions to this mail that also include the concrete plugin usage. The following tasks are available:--write-pgsql-history and --write-pgsql-history-dump. They correlate closely to --write-pgsql and --write-pgsql-dump. All features that are marked as experimental may work or they may not and of course they will be painfully memory intensive on larger datasets because of the lack of a good store implementation. Peter Am 25.08.2010 15:16, schrieb Marco Lechner - FOSSGIS e.V.: Hi Peter, I'm very intersted in your history-extension and I'm going to test as soon as a first snapshot is available. Will it be possible to eat an --bound-polygon stream from osmosis? Or will it just import the whole history-plane? Marco Am 25.08.2010 15:14, schrieb Peter Körner: Hi all After a little playing around I now got an idea of how I'm going to implement everything. I'll keep as close as possible at the regular simple schema and at the way the pgsql tasks work. Just as with the optional linestring/bbox builder, the history import tasks will serve more then one scheme. I'm leaving relations out, again. the regular simple scheme - its the basis of all but not capable of holding history data + history columns - create and populate an extra column in way_nodes to store the way version. - change the PKs of way_nodes to allow more then one version of an element + way_nodes version builder - create and populate an extra column in way_nodes that holds the node version that corresponds to the way's timestamp + minor version builder - create and populate an extra column in ways and way_nodes to store the ways minor versions, which are generated by changes to the nodes of the way between version changes of the way self. + from-to-timestamp builder - create and populate an extra column in the nodes and ways table that specifies the date until which a version of an item was the current one. After that time, the next version of the same item was current (or the item was deleted). the tstamp field in contrast contains the starting date from which an item was current. + linestring / bbox builder - just the same as with the regular simple scheme, works for all version and minor-version rows Until the end of the week I'll get a pre snapshot out that can populate the history table with version columns and changed PKs. The database created from this can be used to test Scotts SQL-Only solution [1]. It will also contain a first implementation of the way_nodes version builder but only with an example implementation of the NodeStore, that performs bad on bigger files. Brett, the pgsql tasks currently write (in COPY mode) all data to temp files first. The process seems to be PlanetFile - NodeStoreTempFile - CopyFormatTempFile - PgsqlCopyImport in osm2pgsql the copy data is pushed to pgsql via unix pipes (5 or 6 COPY transactions running at the same time in different connections). This approach skips the CopyFormatTempFile stage. Is there any special reason this approach isn't used in the pgsnapshot package? Peter [1] http://lists.openstreetmap.org/pipermail/dev/2010-August/020308.html ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev # download osmosis svn export http://svn.openstreetmap.org/applications/utils/osmosis/trunk/ osmosis-trunk # enter the source directory cd osmosis-trunk # download the history plugin svn export http://svn.toolserver.org/svnroot/mazder/osmhist/osmosis-plugin/history/ # enable the history plugin patch -p0 history/script/source-activation.patch # compile ant clean publish # reate a postgis user, if not already done sudo -u postgres createuser osmosis # create an empty database with hstore and postgis capabilities, if not already done sudo -u postgres createdb -E UTF8 -O osmosis osmosis-history sudo -u postgres createlang plpgsql osmosis-history # create the simple schema database psql -U osmosis osmosis-history package/script/pgsql_simple_schema_0.6.sql # add the history extension to the database psql -U osmosis osmosis-history history/script/pgsql_simple_schema_0.6_history.sql # the following lines add extra features to the database # execute before them before the import # they are experimental and very memory intensive # use only with small data sets # enable the node version builder #psql -U osmosis osmosis-history history/script/pgsql_simple_schema_0.6_history_way_nodes_version.sql # enable
Re: [osmosis-dev] Reading OSM History dumps
On Wed, Aug 25, 2010 at 11:33 PM, Peter Körner osm-li...@mazdermind.dewrote: Am 25.08.2010 15:26, schrieb Brett Henderson: In short it just hasn't been a high priority to change it. I was planning to share on FileInputStream/FileOutputStream level. You can feed a FileInputStream into the CopyManager as well as into a file, can't you? Sorry, I'm not sure what you mean. I think the only way to feed data into the CopyManager is via an InputStream. That InputStream can be a FileInputStream or a piped input stream or whatever you wish. But there are also classes like PGCopyOutputStream so perhaps you can use those directly to avoid using multiple threads. It's been a while since I looked at it. Maybe want to can copy the relevant bits later to pgsnapshot. Yep, sure. ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
On Thu, Aug 26, 2010 at 8:19 AM, Peter Körner osm-li...@mazdermind.dewrote: Hi Marco The first snapshot is out. Unfortunately the hstore migration progress Brett is still in let the pgsnapshot tests fail, which is why hudson is not providing nightly builds anymore. I hope to have this fixed over the next few days. I'm working with the server admins to get hstore support added to the database. Because of that you'll need to compile osmosis yourself. I attached instructions to this mail that also include the concrete plugin usage. If you wish to avoid compiling yourself you can also get nightly builds from: http://bretth.dev.openstreetmap.org/osmosis-build/ The 0.37.SNAPSHOT version in the above location is built via a cron job. No tests are run during the build. Brett ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
On Tue, Aug 24, 2010 at 1:09 AM, Peter Körner osm-li...@mazdermind.dewrote: Am 23.08.2010 13:35, schrieb Brett Henderson: To create your own store implementation you can build on the Osmosis persistence support. All classes that are persistable implement the Storeable interface and have a constructor with StoreReader sr, StoreClassRegister scr arguments. The existing IndexedObjectStore assumes that the key is a long but provides a good example to start from. The underlying IndexStore it uses can support any type of key as long as it has a fixed width (ie. always persists to the same number of bytes). It would need a key of 96 bit (id long + version int). I was not aware of any type 64bit in java so I'm not sure how I could build a store with a 96bit index, but I think I have to take a deeper look into the IndexStore company. IndexStore just requires an IndexElement implementation that holds both the key and the value. You can define a key implementation class that holds as many individual long or int values as you like, so long as it persists through the Storeable interface to a fixed number of bytes. You also have to provide the IndexStore with a comparator that knows how to compare the order of two keys. The timestamp is just a 64bit long value, so the only problem is here to do the comparison but this is the easy past, i think. It may be possible to make the existing IndexedObjectStore more generic but I'd need to experiment with it. I'll try to keep the whole changes local to my project. Once its finished you can take classes over to core as they're needed. Hmm, but thinking more about your problem it may make more sense to stick with the IndexedObjectStore and store a list of Nodes as each element instead of single Nodes. I suspect in most cases you won't know the exact version you're looking for when you're loading a Node In the first phase when selecting the versions of the nodes used to create a version of a way I'll have a lot of timestamp searches (find the oldest node that is younger then the timestamp of the way) that need the timestamp index. later on, when the intermediate versions are calculated, i'll need a lookup for all versions of an id. a direct request for a known id/version will, as far as I see in this early stage, not be used too often (maybe during linestring building) (you'll only know node ids when looking at a way after all), and will only know a timestamp range. When looking up a specific node/version/timestamp combination you would have to load all versions of a node from the IndexedObjectStore then linearly search for a match in the (usually fairly limited) list of objects. You will possibly need to create you own Storeable list type to hold all versions of a particular Node because I don't think one exists. The main problem I see is, that such a list won't be of fixed size. When I write it to the store and later on add another version, it will grow bigger and have to be re-allocated in the store file, freing up space at the beginning. Basically a malloc/realloc/free in files. If you need the ability to write values randomly then it won't work. But if you have sorted input (ie. all versions of a node are together on input) then you can write them all to the store at once. IndexedObjectStore will allow you to write variable length objects to the store which is already necessary to hold entities with variable numbers of tags. Just keep in mind that Osmosis stores aren't particularly fast to query because they're based on very simple data structures. They tend to result in huge amounts of disk seeks when processing, so there may be libraries out there that perform better. The main reason they were originally developed was to minimise external library dependencies and I haven't revisited that decision since Osmosis put on weight (ie. it now relies on many third-party jars). Thinking about all this I find that we're re-inventing the wheel. I'll try to use a JavaDB as the backend store. It is entirely written in Java ant thus cross platform compatible, supports btree indexes on multiple fields an can reside both, in-memory and on-disk. If it shows that it's fast enough, it may be a good alternative to a custom binary file/memory store. I hope it works out because I've been down a similar path here. After I gave up on custom stores I tried Berkeley DB Java edition and performance was horrible. I finally bit the bullet and went the PostgreSQL path and created the pgsql tasks. I hope JavaDB works out though because requiring a full database server really complicates usage. Be very careful with btree indexes on multiple fields because they usually only work well when you're looking up values for specific values of indexed fields. If you ever need to use range queries (eg. timestamp range) involving multiple fields they tend to fall down. I suspect you'll be just as well off
Re: [osmosis-dev] Reading OSM History dumps
Hi Peter, This all sounds very interesting and will no doubt have many uses that I can't anticipate. I can't give you much assistance but will try to answer any specific questions you have. My wife is going to give birth sometime within the next month which means my priorities are about to change drastically ;-) You seem to have thought about most of the complexities of the problem already so you know what you're dealing with. You mentioned the problem of obtaining test data. I'd suggest using: http://planet.openstreetmap.org/history/ That is a full history from day one of the project up until now. It is already in the OSM change format that Osmosis understands. Cutting bounding boxes out of full history data is a difficult (but not impossible) problem that you may have to solve in order to move forward. In order to build way linestrings for all way versions and for all node versions impacting the way you will have to solve a similar problem to understanding how to cut bbox data so you may be able to kill a couple of birds with one stone. One thing to note is that I'm currently changing the simple schema a bit to improve performance. I've moved the tags into hstore columns, and have duplicated the way_node table info into a nodes array column on the way table. This improves bounding box style query performance by several times on large datasets. I don't think it will impact you too much. Good luck! Cheers, Brett On Sun, Aug 22, 2010 at 12:18 AM, Peter Körner osm-li...@mazdermind.dewrote: Hi I during the last week I thought intensively about the new full history dump and how we could use it. I wrote some kind of paper and also some demo code to check how we could get osm history information into a postgis database with linestings and all this delicate features osmosis offers. I've put it on the wiki at http://wiki.openstreetmap.org/wiki/User:MaZderMind/Reading_OSM_History_dumps I'd love to hear some comments about it. Peter ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Reading OSM History dumps
Am 22.08.2010 08:26, schrieb Brett Henderson: Hi Peter, This all sounds very interesting and will no doubt have many uses that I can't anticipate. I can't give you much assistance but will try to answer any specific questions you have. My wife is going to give birth sometime within the next month which means my priorities are about to change drastically ;-) Oh, congratulations on this! You seem to have thought about most of the complexities of the problem already so you know what you're dealing with. I think that all is solvable using just enough logic :) I did the demo implementation in PHP to see if this is possible and I think I know the OSM data structure enough to know what it means. But I don't know Osmosis and Java enough to know how tow to implement the simple multi-level arrays from PHP in a way that will work with those really big files. What I need is a store that can - store all versions of a Node* - access a specific version of a node - access all versions of a node - the oldest version of a node that has been created before Date X *not only the Node's location but also the Meta-Info (Timestamp, User, UserID) because you would want to have this as the Meta-Info on the generated intermediate Way-Versions. I looked into the three implementations of NodeLocationStore (especially the InMemoryNodeLocationStore) and I was thinking how I could extend the really simple fixed-size memory store to be able to store a complete Node and index by Id and Version at the same time. Because there is no fixed number of versions per Node I can't go with a simple offset=NodeID*NodeSize calculation but I have to write the nodes one after another just as they come in and save the Offsets in a List, but I'm not sure how to build a List that allows Random Access to the offset to all versions of a node as well as to a specific version in Java. I also found the IndexedObjectStore class in org.openstreetmap.osmosis.core.store and I thought about extending it to track three Indexes (NodeID, Version and Timestamp). Do you know if this would be workable? You mentioned the problem of obtaining test data. I'd suggest using: http://planet.openstreetmap.org/history/ They are in .osc format but I need a task to convert from .osc to history-.osm and back, too. That is a full history from day one of the project up until now. It is already in the OSM change format that Osmosis understands. Cutting bounding boxes out of full history data is a difficult (but not impossible) In regard to the Node-Moded-In/-Out problem, yes. At the moment I'm working with self-including history files, that contain all referenced items from version 1 on. When I start to convert .osc files into history-.osm files I will have to deal with objects with incomplete histories (when a node has been moved I only know its new position). There is a need to feed in a second data-source like an already existing database. problem that you may have to solve in order to move forward. In order to build way linestrings for all way versions and for all node versions impacting the way you will have to solve a similar problem to understanding how to cut bbox data so you may be able to kill a couple of birds with one stone. I'm not really sure if this will work as all I'm focusing on now is to get a complete dump analyzed, but we may get closer to this goal. One thing to note is that I'm currently changing the simple schema a bit to improve performance. Yes I tracked that and it like the step towards hstore as I already used it a lot with osm2pgsql. Peter ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
[osmosis-dev] Reading OSM History dumps
Hi I during the last week I thought intensively about the new full history dump and how we could use it. I wrote some kind of paper and also some demo code to check how we could get osm history information into a postgis database with linestings and all this delicate features osmosis offers. I've put it on the wiki at http://wiki.openstreetmap.org/wiki/User:MaZderMind/Reading_OSM_History_dumps I'd love to hear some comments about it. Peter ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev