[OSM-dev] Distributed Data Store Follow-Up
Hi all, Earlier I posted about how my friend and I were creating a distributed data store for OSM data. We've finished our project and gotten the most difficult queries going. All of our code is freely available along with a report about our design and findings on or github wiki at http://wiki.github.com/tannewt/menzies. As it says in our report we were able to do bounding box and regular get queries faster than the production 0.5 OSM server. We, however, did not manage to get our own instance of the OSM api running on machines we had because of a number of planet import errors. Thus, we only have a rough idea of how well we do latency wise and no idea how the two solutions differ under varying loads. Please let us know what you think. We firmly believe that distributing the data over a number of computers is a far better solution than one single supercomputer. Thanks, Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store Follow-Up
Scott Shawcroft wrote: Please let us know what you think. We firmly believe that distributing the data over a number of computers is a far better solution than one single supercomputer. This conclusion (divide and conquer) is right for fetch. What was your update performance? Did you explore the performance of within queries? Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store Follow-Up
Stefan, Our update performance shouldn't be too different. We simply send the update request to all the node machines. By within do you mean a bounding box query? Could you be more specific? Thanks, Scott Stefan de Konink wrote: Scott Shawcroft wrote: Please let us know what you think. We firmly believe that distributing the data over a number of computers is a far better solution than one single supercomputer. This conclusion (divide and conquer) is right for fetch. What was your update performance? Did you explore the performance of within queries? Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store Follow-Up
Scott Shawcroft wrote: Our update performance shouldn't be too different. We simply send the update request to all the node machines. And your node machines do not cache their partition results? (Thus is a scan always required?) By within do you mean a bounding box query? Could you be more specific? For bbox you will have results for this: || | o-+--o || for within/touches you will have results for this: || o--++--o || Now the above example is trivial to support the interesting case is diagonal lines. This would allow perfect viewport calls. Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store Follow-Up
Stefan de Konink wrote: Scott Shawcroft wrote: Our update performance shouldn't be too different. We simply send the update request to all the node machines. And your node machines do not cache their partition results? (Thus is a scan always required?) We don't do any caching ourselves but the underlying BerkeleyDB does. Therefore, we can update as we please. By within do you mean a bounding box query? Could you be more specific? For bbox you will have results for this: || | o-+--o || for within/touches you will have results for this: || o--++--o || Now the above example is trivial to support the interesting case is diagonal lines. This would allow perfect viewport calls. We don't do within. It is purely node based. I suppose a spacial way index could be built to do within queries though. Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store
On Thu, Jan 22, 2009 at 12:02 AM, Stefan de Konink ste...@konink.de wrote: Scott Shawcroft wrote: We're interested in what kind of computing resources to design for (how many machines) and whether we can get access logs in order to test our implementation against. To simulate OSM API traffic you can use the anonymized access logs. Problems are; VERY old data, and it doesn't contain the calls made by potlatch. http://wiki.openstreetmap.org/wiki/Database#Access_logs http://steve.dev.openstreetmap.org/osm-api.anony.gz You can also use the minutely diffs to create write traffic, this is recent data, contains what is inserted by potlatch. But it doesn't contain the read data so you will loose important data. http://planet.openstreetmap.org/minute/ ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store
Hi Scott, Scott Shawcroft wrote: Stefan de Konink wrote: - Admins don't want to maintain multiple systems - The fear of anything new not developed by the devs (especially if it is not build in Ruby) Who are the admins for the systems? Tom Hughes is a factor to take in account. All your base... We're open to particular solutions and if there is a bias towards Ruby we'd look closer at it. However, it may be that there is a better solution. Who are the designated devs? That is basically a 'free for all'. Read the history of SVN and/or this list to find out which people are working on OSM. Personally I am working on a C implementation of the API. Other people tend to work on the official RubyOnRails one. Also, Amazon WebServices could be used to have virtual machines instead of real ones which need maintenance. If Amazon wants to sponsor OSM, that is a great thing ;) Technical problems might be more interesting: - Synchronization issues, even for a proxy solution; single or multiple write databases should distribute their data. Out of sync scenarios etc. - Especially geo related issues, how to distribute a real geoquery. Totally, synchronization is important. Simple partitioning wouldn't have this problem but if multiple copies will be shared then we could get into trouble. I think the geo element is what makes this more interesting than the standard data storage issue. The main point is that OSM by design in not a GIS database, we can make it one, but the current features approach the dataset in a 'traditional' way, this is not bad perse, though some problems would tend to love GIS solutions. We're interested in trying our hand at creating a better system for storing OSM data. We're interested in what kind of computing resources to design for (how many machines) and whether we can get access logs in order to test our implementation against. Related to accesslogs I found a long brick wall, it might be a better thing to use a requester that just makes random requests. Sources are available for that. Well, randomness is probably not the best model. I imagine that the server's traffic patterns are also geo related. For example, people are more likely to work on areas they are near and areas on the earth in daylight or evening are more likely to have those people accessing the site. Or perhaps a mapping party has a number of people working on the same area all at once. A simple geo partitioning would drive all of this traffic to one particular server. This simple access does work better when retrieving data because it will utilize all the different machines. Like Erik pointed out, diffs will give you writes. I think reads are more interesting. Also, we'd love to have OSM community members involved since we're new to the organization. Lastly, I think we plan to donate our code to the community with the hope that it is useful. What do you think? I love to brainstorm with you :) The next month I want to spend on my MSc thesis about improving native geospatial support in MonetDB. And the OSM data in it. It would ofcourse be great if the ideas comming out of such session can make it to State of the Map 2009. It would be good to point you at DBslayer (the standard implementation or the Cherokee one), it will balance requests but with a better balancer could do geobalancing too :) I'll have to take a look at it. Existing solutions are good but we are really looking at laying down some code too I think. Creating for example a specific SQL based scheduler that can handle partitions was a thing I was thinking about in the night: http://code.google.com/p/cherokee/issues/detail?id=328 Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store
2009/1/22 Scott Shawcroft scott.shawcr...@gmail.com: Stefan, My thoughts are below. Stefan de Konink wrote: Hey, Scott Shawcroft wrote: My friend Jason (cced) and I are seniors at the University of Washington in Computer Science and Engineering. On your FAQ you say people interested in distributing the database across multiple computers should email the list. Well, here we are. We are currently in a distributed systems capstone course during which we need to spend the quarter (until mid March) on a single substantial project. Sounds fun :) There are a lot of 'ideas' here around, geographical balancing etc. The standard divide and conquer methods in databases, etc. The main problems in OSM: - Admins don't want to maintain multiple systems - The fear of anything new not developed by the devs (especially if it is not build in Ruby) Who are the admins for the systems? We're open to particular solutions and if there is a bias towards Ruby we'd look closer at it. However, it may be that there is a better solution. Who are the designated devs? The Ruby thing is just that the current API is written in Ruby on Rails. Obviously any work you do is most useful if it's applicable to the systems and data we currently have. If there's reasonable evidence of a better way of doing something then there's generally no problem in implementing it. A great example is the GPX importer daemon which was rewritten in C to make it a lot faster. The important points are ensuring any new stuff is actually better, that the admins can reasonably maintain it, and that there's a sensible strategy to get the data from where it currently is to where it needs to be. Dave ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
[OSM-dev] Distributed Data Store
Hi all, My friend Jason (cced) and I are seniors at the University of Washington in Computer Science and Engineering. On your FAQ you say people interested in distributing the database across multiple computers should email the list. Well, here we are. We are currently in a distributed systems capstone course during which we need to spend the quarter (until mid March) on a single substantial project. We're interested in trying our hand at creating a better system for storing OSM data. We're interested in what kind of computing resources to design for (how many machines) and whether we can get access logs in order to test our implementation against. Also, we'd love to have OSM community members involved since we're new to the organization. Lastly, I think we plan to donate our code to the community with the hope that it is useful. What do you think? ~Scott Shawcroft ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store
Hey, Scott Shawcroft wrote: My friend Jason (cced) and I are seniors at the University of Washington in Computer Science and Engineering. On your FAQ you say people interested in distributing the database across multiple computers should email the list. Well, here we are. We are currently in a distributed systems capstone course during which we need to spend the quarter (until mid March) on a single substantial project. Sounds fun :) There are a lot of 'ideas' here around, geographical balancing etc. The standard divide and conquer methods in databases, etc. The main problems in OSM: - Admins don't want to maintain multiple systems - The fear of anything new not developed by the devs (especially if it is not build in Ruby) Technical problems might be more interesting: - Synchronization issues, even for a proxy solution; single or multiple write databases should distribute their data. Out of sync scenarios etc. - Especially geo related issues, how to distribute a real geoquery. We're interested in trying our hand at creating a better system for storing OSM data. We're interested in what kind of computing resources to design for (how many machines) and whether we can get access logs in order to test our implementation against. Related to accesslogs I found a long brick wall, it might be a better thing to use a requester that just makes random requests. Sources are available for that. Also, we'd love to have OSM community members involved since we're new to the organization. Lastly, I think we plan to donate our code to the community with the hope that it is useful. What do you think? I love to brainstorm with you :) The next month I want to spend on my MSc thesis about improving native geospatial support in MonetDB. And the OSM data in it. It would ofcourse be great if the ideas comming out of such session can make it to State of the Map 2009. It would be good to point you at DBslayer (the standard implementation or the Cherokee one), it will balance requests but with a better balancer could do geobalancing too :) Yours Sincerely, Stefan de Konink ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Distributed Data Store
Stefan, My thoughts are below. Stefan de Konink wrote: Hey, Scott Shawcroft wrote: My friend Jason (cced) and I are seniors at the University of Washington in Computer Science and Engineering. On your FAQ you say people interested in distributing the database across multiple computers should email the list. Well, here we are. We are currently in a distributed systems capstone course during which we need to spend the quarter (until mid March) on a single substantial project. Sounds fun :) There are a lot of 'ideas' here around, geographical balancing etc. The standard divide and conquer methods in databases, etc. The main problems in OSM: - Admins don't want to maintain multiple systems - The fear of anything new not developed by the devs (especially if it is not build in Ruby) Who are the admins for the systems? We're open to particular solutions and if there is a bias towards Ruby we'd look closer at it. However, it may be that there is a better solution. Who are the designated devs? Also, Amazon WebServices could be used to have virtual machines instead of real ones which need maintenance. Technical problems might be more interesting: - Synchronization issues, even for a proxy solution; single or multiple write databases should distribute their data. Out of sync scenarios etc. - Especially geo related issues, how to distribute a real geoquery. Totally, synchronization is important. Simple partitioning wouldn't have this problem but if multiple copies will be shared then we could get into trouble. I think the geo element is what makes this more interesting than the standard data storage issue. We're interested in trying our hand at creating a better system for storing OSM data. We're interested in what kind of computing resources to design for (how many machines) and whether we can get access logs in order to test our implementation against. Related to accesslogs I found a long brick wall, it might be a better thing to use a requester that just makes random requests. Sources are available for that. Well, randomness is probably not the best model. I imagine that the server's traffic patterns are also geo related. For example, people are more likely to work on areas they are near and areas on the earth in daylight or evening are more likely to have those people accessing the site. Or perhaps a mapping party has a number of people working on the same area all at once. A simple geo partitioning would drive all of this traffic to one particular server. This simple access does work better when retrieving data because it will utilize all the different machines. Also, we'd love to have OSM community members involved since we're new to the organization. Lastly, I think we plan to donate our code to the community with the hope that it is useful. What do you think? I love to brainstorm with you :) The next month I want to spend on my MSc thesis about improving native geospatial support in MonetDB. And the OSM data in it. It would ofcourse be great if the ideas comming out of such session can make it to State of the Map 2009. It would be good to point you at DBslayer (the standard implementation or the Cherokee one), it will balance requests but with a better balancer could do geobalancing too :) I'll have to take a look at it. Existing solutions are good but we are really looking at laying down some code too I think. ~Scott Yours Sincerely, Stefan de Konink ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev