Re: Geospatial + Partitioned Index

Josh Elser Fri, 16 Jan 2015 12:20:12 -0800

Russ Weeks wrote:

Hey, all,


I'm looking at switching my geospatial index to a partitioned index to
smooth out some hotspots. So for any query, I'll have a bunch of ranges
representing intervals on a Hilbert curve, plus a bunch of partitions,
each of which needs to be scanned for every range.

The way that the (excellent!) Accumulo Recipes geospatial store
addresses this is to take the product of the partitions and the curve
intervals[1]. It seems like an alternative would be to encode the curve
intervals as a property of a custom iterator (I need one anyways to
filter out extraneous points from the search area) and then the client
would just scan (-inf, +inf), which I think is more typical when
querying a partitioned index?

I'm no expert on storing geo-spatial data, but having to scan(-inf,+inf) on a table for a query is typically the reason people dealwith the pain of hot-spotting, although it is the easiest to implement.

If you can be "tricky" in how you're encoding your data in the row suchthat you can reduce the search space over your partitioned index, youcan try to get the best of both worlds (avoid reading all data and stillget a good distribution).

Since that was extremely vague, here's an example: say you had a textindex and wanted to look up the word "the" and your index had 100partitions, [0,99]. If you knew that it was only possible for "the" toshow up on partitions 5, 27 and 83 (typically by use of some hashingfunction), you could drastically reduce your search space while stillavoiding hot spotting on a single server.

Can anybody comment on which approach is preferred? Is it common to
expose the number of partitions in the index and the encoding of those
partitions to client code? Am I needlessly worried that taking the
product of the curve intervals and the partitions will produce too many
ranges?

In the trivial sense, the client doesn't need to know the partitions andwould just scan the entire index like you said earlier. You could alsotrack the partitions that you have created in a separate table and theclient could read that table to know ahead of time (if you have a reasonto do so in your implementation).

Depending on the amount of data you have, lots of ranges to check couldtake some time. YMMV

Thanks,
-Russ

1:
https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/impl/AccumuloGeoSpatialStore.java#L190

Re: Geospatial + Partitioned Index

Reply via email to