Hi, Josh, Thanks for your response. I think I should clarify something. When I said, "the client would just scan (-inf, +inf)", I didn't mean that the net effect would be to read all data. I just meant that my custom Iterator would seek() to ranges which are a function of its configuration and its knowledge of the partitioning scheme, just like the IntersectingIterator. Except that instead of its configuration defining a set of keyword terms, it would define a set of disjoint intervals on a space-filling curve.
My understanding is that setting the scan range to (-inf,+inf) in this case is just a way to tell Accumulo, "run this scan across all tablets". -Russ On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <[email protected]> wrote: > Russ Weeks wrote: > >> Hey, all, >> >> I'm looking at switching my geospatial index to a partitioned index to >> smooth out some hotspots. So for any query, I'll have a bunch of ranges >> representing intervals on a Hilbert curve, plus a bunch of partitions, >> each of which needs to be scanned for every range. >> >> The way that the (excellent!) Accumulo Recipes geospatial store >> addresses this is to take the product of the partitions and the curve >> intervals[1]. It seems like an alternative would be to encode the curve >> intervals as a property of a custom iterator (I need one anyways to >> filter out extraneous points from the search area) and then the client >> would just scan (-inf, +inf), which I think is more typical when >> querying a partitioned index? >> > > I'm no expert on storing geo-spatial data, but having to scan (-inf,+inf) > on a table for a query is typically the reason people deal with the pain of > hot-spotting, although it is the easiest to implement. > > If you can be "tricky" in how you're encoding your data in the row such > that you can reduce the search space over your partitioned index, you can > try to get the best of both worlds (avoid reading all data and still get a > good distribution). > > Since that was extremely vague, here's an example: say you had a text > index and wanted to look up the word "the" and your index had 100 > partitions, [0,99]. If you knew that it was only possible for "the" to show > up on partitions 5, 27 and 83 (typically by use of some hashing function), > you could drastically reduce your search space while still avoiding hot > spotting on a single server. > > Can anybody comment on which approach is preferred? Is it common to >> expose the number of partitions in the index and the encoding of those >> partitions to client code? Am I needlessly worried that taking the >> product of the curve intervals and the partitions will produce too many >> ranges? >> > > In the trivial sense, the client doesn't need to know the partitions and > would just scan the entire index like you said earlier. You could also > track the partitions that you have created in a separate table and the > client could read that table to know ahead of time (if you have a reason to > do so in your implementation). > > Depending on the amount of data you have, lots of ranges to check could > take some time. YMMV > > > Thanks, >> -Russ >> >> 1: >> https://github.com/calrissian/accumulo-recipes/blob/master/ >> store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/ >> geospatialstore/impl/AccumuloGeoSpatialStore.java#L190 >> >
