Re: Geospatial + Partitioned Index

Russ Weeks Fri, 16 Jan 2015 16:58:16 -0800

Good to know, thanks Josh!
-Russ

On Fri, Jan 16, 2015 at 2:40 PM, Josh Elser <[email protected]> wrote:


> Thanks for the clarification, Russ, I assumed something of the sort was
> the case.
>
> It's important to remember that there is still benefit to "partition
> elimination". Doing the entire table, while it won't read all the data on
> the backend, you'll likely incur extra RPC to servers, file opens, iterator
> creation, etc. If your query is only going to match a few records, this can
> turn out to be a significant portion of your execution time. Something to
> keep in mind :)
>
> Russ Weeks wrote:
>
>> Hi, Josh,
>>
>> Thanks for your response. I think I should clarify something. When I
>> said, "the client would just scan (-inf, +inf)", I didn't mean that the
>> net effect would be to read all data. I just meant that my custom
>> Iterator would seek() to ranges which are a function of its
>> configuration and its knowledge of the partitioning scheme, just like
>> the IntersectingIterator. Except that instead of its configuration
>> defining a set of keyword terms, it would define a set of disjoint
>> intervals on a space-filling curve.
>>
>> My understanding is that setting the scan range to (-inf,+inf) in this
>> case is just a way to tell Accumulo, "run this scan across all tablets".
>>
>> -Russ
>>
>> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Russ Weeks wrote:
>>
>>         Hey, all,
>>
>>         I'm looking at switching my geospatial index to a partitioned
>>         index to
>>         smooth out some hotspots. So for any query, I'll have a bunch of
>>         ranges
>>         representing intervals on a Hilbert curve, plus a bunch of
>>         partitions,
>>         each of which needs to be scanned for every range.
>>
>>         The way that the (excellent!) Accumulo Recipes geospatial store
>>         addresses this is to take the product of the partitions and the
>>         curve
>>         intervals[1]. It seems like an alternative would be to encode
>>         the curve
>>         intervals as a property of a custom iterator (I need one anyways
>> to
>>         filter out extraneous points from the search area) and then the
>>         client
>>         would just scan (-inf, +inf), which I think is more typical when
>>         querying a partitioned index?
>>
>>
>>     I'm no expert on storing geo-spatial data, but having to scan
>>     (-inf,+inf) on a table for a query is typically the reason people
>>     deal with the pain of hot-spotting, although it is the easiest to
>>     implement.
>>
>>     If you can be "tricky" in how you're encoding your data in the row
>>     such that you can reduce the search space over your partitioned
>>     index, you can try to get the best of both worlds (avoid reading all
>>     data and still get a good distribution).
>>
>>     Since that was extremely vague, here's an example: say you had a
>>     text index and wanted to look up the word "the" and your index had
>>     100 partitions, [0,99]. If you knew that it was only possible for
>>     "the" to show up on partitions 5, 27 and 83 (typically by use of
>>     some hashing function), you could drastically reduce your search
>>     space while still avoiding hot spotting on a single server.
>>
>>         Can anybody comment on which approach is preferred? Is it common
>> to
>>         expose the number of partitions in the index and the encoding of
>>         those
>>         partitions to client code? Am I needlessly worried that taking the
>>         product of the curve intervals and the partitions will produce
>>         too many
>>         ranges?
>>
>>
>>     In the trivial sense, the client doesn't need to know the partitions
>>     and would just scan the entire index like you said earlier. You
>>     could also track the partitions that you have created in a separate
>>     table and the client could read that table to know ahead of time (if
>>     you have a reason to do so in your implementation).
>>
>>     Depending on the amount of data you have, lots of ranges to check
>>     could take some time. YMMV
>>
>>
>>         Thanks,
>>         -Russ
>>
>>         1:
>>         https://github.com/calrissian/__accumulo-recipes/blob/
>> master/__store/geospatial-store/src/__main/java/org/
>> calrissian/__accumulorecipes/__geospatialstore/impl/__
>> AccumuloGeoSpatialStore.java#__L190
>>         <https://github.com/calrissian/accumulo-recipes/
>> blob/master/store/geospatial-store/src/main/java/org/
>> calrissian/accumulorecipes/geospatialstore/impl/
>> AccumuloGeoSpatialStore.java#L190>
>>
>>
>>

Re: Geospatial + Partitioned Index

Reply via email to