[OSM-dev] Distributed Data Store Follow-Up

2009-03-24 Thread Scott Shawcroft
Hi all,
Earlier I posted about how my friend and I were creating a distributed 
data store for OSM data.  We've finished our project and gotten the most 
difficult queries going.  All of our code is freely available along with 
a report about our design and findings on or github wiki at 
http://wiki.github.com/tannewt/menzies.

As it says in our report we were able to do bounding box and regular get 
queries faster than the production 0.5 OSM server.  We, however, did not 
manage to get our own instance of the OSM api running on machines we had 
because of a number of planet import errors.  Thus, we only have a rough 
idea of how well we do latency wise and no idea how the two solutions 
differ under varying loads.

Please let us know what you think.  We firmly believe that distributing 
the data over a number of computers is a far better solution than one 
single supercomputer.

Thanks,
Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store Follow-Up

2009-03-24 Thread Stefan de Konink
Scott Shawcroft wrote:
 Please let us know what you think.  We firmly believe that distributing 
 the data over a number of computers is a far better solution than one 
 single supercomputer.

This conclusion (divide and conquer) is right for fetch. What was your 
update performance?

Did you explore the performance of within queries?


Stefan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store Follow-Up

2009-03-24 Thread Scott Shawcroft
Stefan,
Our update performance shouldn't be too different.  We simply send the 
update request to all the node machines.

By within do you mean a bounding box query?  Could you be more specific?
Thanks,
Scott

Stefan de Konink wrote:
 Scott Shawcroft wrote:
 Please let us know what you think.  We firmly believe that 
 distributing the data over a number of computers is a far better 
 solution than one single supercomputer.

 This conclusion (divide and conquer) is right for fetch. What was your 
 update performance?

 Did you explore the performance of within queries?


 Stefan


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store Follow-Up

2009-03-24 Thread Stefan de Konink
Scott Shawcroft wrote:
 Our update performance shouldn't be too different.  We simply send the 
 update request to all the node machines.

And your node machines do not cache their partition results? (Thus is a 
scan always required?)

 By within do you mean a bounding box query?  Could you be more specific?

For bbox you will have results for this:
  
||
|  o-+--o
||

for within/touches you will have results for this:
 
||
o--++--o
||

Now the above example is trivial to support the interesting case is 
diagonal lines. This would allow perfect viewport calls.


Stefan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store Follow-Up

2009-03-24 Thread Scott Shawcroft
Stefan de Konink wrote:
 Scott Shawcroft wrote:
 Our update performance shouldn't be too different.  We simply send 
 the update request to all the node machines.

 And your node machines do not cache their partition results? (Thus is 
 a scan always required?)
We don't do any caching ourselves but the underlying BerkeleyDB does.  
Therefore, we can update as we please.


 By within do you mean a bounding box query?  Could you be more specific?

 For bbox you will have results for this:
  
 ||
 |  o-+--o
 ||

 for within/touches you will have results for this:
 
||
 o--++--o
||

 Now the above example is trivial to support the interesting case is 
 diagonal lines. This would allow perfect viewport calls.
We don't do within.  It is purely node based.  I suppose a spacial way 
index could be built to do within queries though.


 Stefan


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store

2009-01-22 Thread Erik Johansson
On Thu, Jan 22, 2009 at 12:02 AM, Stefan de Konink ste...@konink.de wrote:
 Scott Shawcroft wrote:
 We're interested in what kind of computing resources
 to design for (how many machines) and whether we can get access logs in
 order to test our implementation against.

To simulate OSM API traffic you can use the anonymized access logs.
Problems are; VERY old data, and it doesn't contain the calls made by
potlatch.

http://wiki.openstreetmap.org/wiki/Database#Access_logs
http://steve.dev.openstreetmap.org/osm-api.anony.gz


You can also use the minutely diffs to create write traffic, this is
recent data, contains what is inserted by potlatch. But it doesn't
contain the read data so you will loose important data.

http://planet.openstreetmap.org/minute/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store

2009-01-22 Thread Stefan de Konink
Hi Scott,

Scott Shawcroft wrote:
 Stefan de Konink wrote:
 - Admins don't want to maintain multiple systems
 - The fear of anything new not developed by the devs (especially if it 
 is not build in Ruby)
 Who are the admins for the systems?

Tom Hughes is a factor to take in account. All your base...

  We're open to particular solutions 
 and if there is a bias towards Ruby we'd look closer at it.  However, it 
 may be that there is a better solution.  Who are the designated devs?

That is basically a 'free for all'. Read the history of SVN and/or this 
list to find out which people are working on OSM. Personally I am 
working on a C implementation of the API. Other people tend to work on 
the official RubyOnRails one.

 Also, Amazon WebServices could be used to have virtual machines instead 
 of real ones which need maintenance.

If Amazon wants to sponsor OSM, that is a great thing ;)

 Technical problems might be more interesting:

 - Synchronization issues, even for a proxy solution; single or 
 multiple write databases should distribute their data. Out of sync 
 scenarios etc.
 - Especially geo related issues, how to distribute a real geoquery.
 Totally, synchronization is important.  Simple partitioning wouldn't 
 have this problem but if multiple copies will be shared then we could 
 get into trouble.
 
 I think the geo element is what makes this more interesting than the 
 standard data storage issue.

The main point is that OSM by design in not a GIS database, we can make 
it one, but the current features approach the dataset in a 'traditional' 
way, this is not bad perse, though some problems would tend to love GIS 
solutions.

 We're interested in trying our hand at creating a better system for 
 storing OSM data.  We're interested in what kind of computing 
 resources to design for (how many machines) and whether we can get 
 access logs in order to test our implementation against.

 Related to accesslogs I found a long brick wall, it might be a better 
 thing to use a requester that just makes random requests. Sources are 
 available for that.
 Well, randomness is probably not the best model.  I imagine that the 
 server's traffic patterns are also geo related.  For example, people are 
 more likely to work on areas they are near and areas on the earth in 
 daylight or evening are more likely to have those people accessing the 
 site.   Or perhaps a mapping party has a number of people working on the 
 same area all at once.  A simple geo partitioning would drive all of 
 this traffic to one particular server.  This simple access does work 
 better when retrieving data because it will utilize all the different 
 machines.

Like Erik pointed out, diffs will give you writes. I think reads are 
more interesting.

 Also, we'd love to have OSM community members involved since we're 
 new to the organization.

 Lastly, I think we plan to donate our code to the community with the 
 hope that it is useful.

 What do you think?

 I love to brainstorm with you :) The next month I want to spend on my 
 MSc thesis about improving native geospatial support in MonetDB. And 
 the OSM data in it. It would ofcourse be great if the ideas comming 
 out of such session can make it to State of the Map 2009.

 It would be good to point you at DBslayer (the standard implementation 
 or the Cherokee one), it will balance requests but with a better 
 balancer could do geobalancing too :)
 I'll have to take a look at it.  Existing solutions are good but we are 
 really looking at laying down some code too I think.

Creating for example a specific SQL based scheduler that can handle 
partitions was a thing I was thinking about in the night:

http://code.google.com/p/cherokee/issues/detail?id=328


Stefan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store

2009-01-22 Thread Dave Stubbs
2009/1/22 Scott Shawcroft scott.shawcr...@gmail.com:
 Stefan,
 My thoughts are below.

 Stefan de Konink wrote:
 Hey,

 Scott Shawcroft wrote:
 My friend Jason (cced) and I are seniors at the University of
 Washington in Computer Science and Engineering.  On your FAQ you say
 people interested in distributing the database across multiple
 computers should email the list.  Well, here we are.  We are
 currently in a distributed systems capstone course during which we
 need to spend the quarter (until mid March) on a single substantial
 project.

 Sounds fun :) There are a lot of 'ideas' here around, geographical
 balancing etc. The standard divide and conquer methods in databases,
 etc. The main problems in OSM:

 - Admins don't want to maintain multiple systems
 - The fear of anything new not developed by the devs (especially if it
 is not build in Ruby)
 Who are the admins for the systems?  We're open to particular solutions
 and if there is a bias towards Ruby we'd look closer at it.  However, it
 may be that there is a better solution.  Who are the designated devs?


The Ruby thing is just that the current API is written in Ruby on Rails.

Obviously any work you do is most useful if it's applicable to the
systems and data we currently have. If there's reasonable evidence of
a better way of doing something then there's generally no problem in
implementing it. A great example is the GPX importer daemon which was
rewritten in C to make it a lot faster.

The important points are ensuring any new stuff is actually better,
that the admins can reasonably maintain it, and that there's a
sensible strategy to get the data from where it currently is to where
it needs to be.


Dave

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


[OSM-dev] Distributed Data Store

2009-01-21 Thread Scott Shawcroft
Hi all,
My friend Jason (cced) and I are seniors at the University of Washington 
in Computer Science and Engineering.  On your FAQ you say people 
interested in distributing the database across multiple computers should 
email the list.  Well, here we are.  We are currently in a distributed 
systems capstone course during which we need to spend the quarter (until 
mid March) on a single substantial project.

We're interested in trying our hand at creating a better system for 
storing OSM data.  We're interested in what kind of computing resources 
to design for (how many machines) and whether we can get access logs in 
order to test our implementation against.

Also, we'd love to have OSM community members involved since we're new 
to the organization.

Lastly, I think we plan to donate our code to the community with the 
hope that it is useful.

What do you think?

~Scott Shawcroft

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store

2009-01-21 Thread Stefan de Konink
Hey,

Scott Shawcroft wrote:
 My friend Jason (cced) and I are seniors at the University of Washington 
 in Computer Science and Engineering.  On your FAQ you say people 
 interested in distributing the database across multiple computers should 
 email the list.  Well, here we are.  We are currently in a distributed 
 systems capstone course during which we need to spend the quarter (until 
 mid March) on a single substantial project.

Sounds fun :) There are a lot of 'ideas' here around, geographical 
balancing etc. The standard divide and conquer methods in databases, 
etc. The main problems in OSM:

- Admins don't want to maintain multiple systems
- The fear of anything new not developed by the devs (especially if it 
is not build in Ruby)


Technical problems might be more interesting:

- Synchronization issues, even for a proxy solution; single or multiple 
write databases should distribute their data. Out of sync scenarios etc.
- Especially geo related issues, how to distribute a real geoquery.

 We're interested in trying our hand at creating a better system for 
 storing OSM data.  We're interested in what kind of computing resources 
 to design for (how many machines) and whether we can get access logs in 
 order to test our implementation against.

Related to accesslogs I found a long brick wall, it might be a better 
thing to use a requester that just makes random requests. Sources are 
available for that.

 Also, we'd love to have OSM community members involved since we're new 
 to the organization.
 
 Lastly, I think we plan to donate our code to the community with the 
 hope that it is useful.
 
 What do you think?

I love to brainstorm with you :) The next month I want to spend on my 
MSc thesis about improving native geospatial support in MonetDB. And the 
OSM data in it. It would ofcourse be great if the ideas comming out of 
such session can make it to State of the Map 2009.

It would be good to point you at DBslayer (the standard implementation 
or the Cherokee one), it will balance requests but with a better 
balancer could do geobalancing too :)


Yours Sincerely,

Stefan de Konink

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Distributed Data Store

2009-01-21 Thread Scott Shawcroft
Stefan,
My thoughts are below.

Stefan de Konink wrote:
 Hey,

 Scott Shawcroft wrote:
 My friend Jason (cced) and I are seniors at the University of 
 Washington in Computer Science and Engineering.  On your FAQ you say 
 people interested in distributing the database across multiple 
 computers should email the list.  Well, here we are.  We are 
 currently in a distributed systems capstone course during which we 
 need to spend the quarter (until mid March) on a single substantial 
 project.

 Sounds fun :) There are a lot of 'ideas' here around, geographical 
 balancing etc. The standard divide and conquer methods in databases, 
 etc. The main problems in OSM:

 - Admins don't want to maintain multiple systems
 - The fear of anything new not developed by the devs (especially if it 
 is not build in Ruby)
Who are the admins for the systems?  We're open to particular solutions 
and if there is a bias towards Ruby we'd look closer at it.  However, it 
may be that there is a better solution.  Who are the designated devs?

Also, Amazon WebServices could be used to have virtual machines instead 
of real ones which need maintenance.


 Technical problems might be more interesting:

 - Synchronization issues, even for a proxy solution; single or 
 multiple write databases should distribute their data. Out of sync 
 scenarios etc.
 - Especially geo related issues, how to distribute a real geoquery.
Totally, synchronization is important.  Simple partitioning wouldn't 
have this problem but if multiple copies will be shared then we could 
get into trouble.

I think the geo element is what makes this more interesting than the 
standard data storage issue.

 We're interested in trying our hand at creating a better system for 
 storing OSM data.  We're interested in what kind of computing 
 resources to design for (how many machines) and whether we can get 
 access logs in order to test our implementation against.

 Related to accesslogs I found a long brick wall, it might be a better 
 thing to use a requester that just makes random requests. Sources are 
 available for that.
Well, randomness is probably not the best model.  I imagine that the 
server's traffic patterns are also geo related.  For example, people are 
more likely to work on areas they are near and areas on the earth in 
daylight or evening are more likely to have those people accessing the 
site.   Or perhaps a mapping party has a number of people working on the 
same area all at once.  A simple geo partitioning would drive all of 
this traffic to one particular server.  This simple access does work 
better when retrieving data because it will utilize all the different 
machines.

 Also, we'd love to have OSM community members involved since we're 
 new to the organization.

 Lastly, I think we plan to donate our code to the community with the 
 hope that it is useful.

 What do you think?

 I love to brainstorm with you :) The next month I want to spend on my 
 MSc thesis about improving native geospatial support in MonetDB. And 
 the OSM data in it. It would ofcourse be great if the ideas comming 
 out of such session can make it to State of the Map 2009.

 It would be good to point you at DBslayer (the standard implementation 
 or the Cherokee one), it will balance requests but with a better 
 balancer could do geobalancing too :)
I'll have to take a look at it.  Existing solutions are good but we are 
really looking at laying down some code too I think.

~Scott


 Yours Sincerely,

 Stefan de Konink


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev