Good to know, thanks Josh! -Russ On Fri, Jan 16, 2015 at 2:40 PM, Josh Elser <[email protected]> wrote:
> Thanks for the clarification, Russ, I assumed something of the sort was > the case. > > It's important to remember that there is still benefit to "partition > elimination". Doing the entire table, while it won't read all the data on > the backend, you'll likely incur extra RPC to servers, file opens, iterator > creation, etc. If your query is only going to match a few records, this can > turn out to be a significant portion of your execution time. Something to > keep in mind :) > > Russ Weeks wrote: > >> Hi, Josh, >> >> Thanks for your response. I think I should clarify something. When I >> said, "the client would just scan (-inf, +inf)", I didn't mean that the >> net effect would be to read all data. I just meant that my custom >> Iterator would seek() to ranges which are a function of its >> configuration and its knowledge of the partitioning scheme, just like >> the IntersectingIterator. Except that instead of its configuration >> defining a set of keyword terms, it would define a set of disjoint >> intervals on a space-filling curve. >> >> My understanding is that setting the scan range to (-inf,+inf) in this >> case is just a way to tell Accumulo, "run this scan across all tablets". >> >> -Russ >> >> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <[email protected] >> <mailto:[email protected]>> wrote: >> >> Russ Weeks wrote: >> >> Hey, all, >> >> I'm looking at switching my geospatial index to a partitioned >> index to >> smooth out some hotspots. So for any query, I'll have a bunch of >> ranges >> representing intervals on a Hilbert curve, plus a bunch of >> partitions, >> each of which needs to be scanned for every range. >> >> The way that the (excellent!) Accumulo Recipes geospatial store >> addresses this is to take the product of the partitions and the >> curve >> intervals[1]. It seems like an alternative would be to encode >> the curve >> intervals as a property of a custom iterator (I need one anyways >> to >> filter out extraneous points from the search area) and then the >> client >> would just scan (-inf, +inf), which I think is more typical when >> querying a partitioned index? >> >> >> I'm no expert on storing geo-spatial data, but having to scan >> (-inf,+inf) on a table for a query is typically the reason people >> deal with the pain of hot-spotting, although it is the easiest to >> implement. >> >> If you can be "tricky" in how you're encoding your data in the row >> such that you can reduce the search space over your partitioned >> index, you can try to get the best of both worlds (avoid reading all >> data and still get a good distribution). >> >> Since that was extremely vague, here's an example: say you had a >> text index and wanted to look up the word "the" and your index had >> 100 partitions, [0,99]. If you knew that it was only possible for >> "the" to show up on partitions 5, 27 and 83 (typically by use of >> some hashing function), you could drastically reduce your search >> space while still avoiding hot spotting on a single server. >> >> Can anybody comment on which approach is preferred? Is it common >> to >> expose the number of partitions in the index and the encoding of >> those >> partitions to client code? Am I needlessly worried that taking the >> product of the curve intervals and the partitions will produce >> too many >> ranges? >> >> >> In the trivial sense, the client doesn't need to know the partitions >> and would just scan the entire index like you said earlier. You >> could also track the partitions that you have created in a separate >> table and the client could read that table to know ahead of time (if >> you have a reason to do so in your implementation). >> >> Depending on the amount of data you have, lots of ranges to check >> could take some time. YMMV >> >> >> Thanks, >> -Russ >> >> 1: >> https://github.com/calrissian/__accumulo-recipes/blob/ >> master/__store/geospatial-store/src/__main/java/org/ >> calrissian/__accumulorecipes/__geospatialstore/impl/__ >> AccumuloGeoSpatialStore.java#__L190 >> <https://github.com/calrissian/accumulo-recipes/ >> blob/master/store/geospatial-store/src/main/java/org/ >> calrissian/accumulorecipes/geospatialstore/impl/ >> AccumuloGeoSpatialStore.java#L190> >> >> >>
