Re: Geospatial + Partitioned Index

Josh Elser Fri, 16 Jan 2015 14:42:21 -0800

Thanks for the clarification, Russ, I assumed something of the sort wasthe case.

It's important to remember that there is still benefit to "partitionelimination". Doing the entire table, while it won't read all the dataon the backend, you'll likely incur extra RPC to servers, file opens,iterator creation, etc. If your query is only going to match a fewrecords, this can turn out to be a significant portion of your executiontime. Something to keep in mind :)


Russ Weeks wrote:

Hi, Josh,

Thanks for your response. I think I should clarify something. When I
said, "the client would just scan (-inf, +inf)", I didn't mean that the
net effect would be to read all data. I just meant that my custom
Iterator would seek() to ranges which are a function of its
configuration and its knowledge of the partitioning scheme, just like
the IntersectingIterator. Except that instead of its configuration
defining a set of keyword terms, it would define a set of disjoint
intervals on a space-filling curve.

My understanding is that setting the scan range to (-inf,+inf) in this
case is just a way to tell Accumulo, "run this scan across all tablets".

-Russ

On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:

    Russ Weeks wrote:

        Hey, all,

        I'm looking at switching my geospatial index to a partitioned
        index to
        smooth out some hotspots. So for any query, I'll have a bunch of
        ranges
        representing intervals on a Hilbert curve, plus a bunch of
        partitions,
        each of which needs to be scanned for every range.

        The way that the (excellent!) Accumulo Recipes geospatial store
        addresses this is to take the product of the partitions and the
        curve
        intervals[1]. It seems like an alternative would be to encode
        the curve
        intervals as a property of a custom iterator (I need one anyways to
        filter out extraneous points from the search area) and then the
        client
        would just scan (-inf, +inf), which I think is more typical when
        querying a partitioned index?


    I'm no expert on storing geo-spatial data, but having to scan
    (-inf,+inf) on a table for a query is typically the reason people
    deal with the pain of hot-spotting, although it is the easiest to
    implement.

    If you can be "tricky" in how you're encoding your data in the row
    such that you can reduce the search space over your partitioned
    index, you can try to get the best of both worlds (avoid reading all
    data and still get a good distribution).

    Since that was extremely vague, here's an example: say you had a
    text index and wanted to look up the word "the" and your index had
    100 partitions, [0,99]. If you knew that it was only possible for
    "the" to show up on partitions 5, 27 and 83 (typically by use of
    some hashing function), you could drastically reduce your search
    space while still avoiding hot spotting on a single server.

        Can anybody comment on which approach is preferred? Is it common to
        expose the number of partitions in the index and the encoding of
        those
        partitions to client code? Am I needlessly worried that taking the
        product of the curve intervals and the partitions will produce
        too many
        ranges?


    In the trivial sense, the client doesn't need to know the partitions
    and would just scan the entire index like you said earlier. You
    could also track the partitions that you have created in a separate
    table and the client could read that table to know ahead of time (if
    you have a reason to do so in your implementation).

    Depending on the amount of data you have, lots of ranges to check
    could take some time. YMMV


        Thanks,
        -Russ

        1:
        
https://github.com/calrissian/__accumulo-recipes/blob/master/__store/geospatial-store/src/__main/java/org/calrissian/__accumulorecipes/__geospatialstore/impl/__AccumuloGeoSpatialStore.java#__L190
        
<https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/impl/AccumuloGeoSpatialStore.java#L190>

Re: Geospatial + Partitioned Index

Reply via email to