On Calrissian, we had once considered making a lexicoder for lat/long that really transformed the two-dimensions (lat/lon) down into a geohash based on the z-curve.
The reason we decided against a first class data-type for this is exactly the same reason that Anthony brings up in his previous comment- the geohash only makes sense as a query tool when you have a good way to structure (at least as many as possible) the non-sequential ranges you would need in order to find all the overlapping regions (with minimal as possible amount of false-positives). On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <[email protected]> wrote: > As several people have commented, a single range for a query can produce a > lot of false positives that need to be filtered out. I had made this > visualization a while back that lets you interactively (click-drag a > bounding box) see this behavior. > > http://bl.ocks.org/jaredwinick/5073432 > > > > > On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <[email protected]> wrote: > >> My first thought was just something simple for a first pass - lat/lon -> >> a single lexicographic dimension - as it would cover most basic use >> cases. Precision (number of bits in encoding) could be an arg or a config >> variable. For WITHIN/INTERSECTS topological predicates, we need to >> decompose the query geometry into the (possibly/probably non-contiguous) 1D >> ranges that cover the region in question. GeoMesa has an algorithm to >> quickly perform this decomposition that computes the minimum number of >> geohash prefixes at different resolutions to fully cover the query >> polygon. Basically, it recursively traverses through levels of geohash >> resolutions, prioritizing rectangles that intersect the region and >> discarding non-intersecting rectangles at the lowest precisions, to produce >> a sequence of ranges to scan over. Fully contained rectangles are >> discovered at their lowest resolution at which point the algorithm pops the >> stack and searches the next prioritized prefix. I think something like >> this would definitely need to be ported over and included in a lexicoder >> implementation to make it useful. Also, rather than materialize the entire >> set of ranges in memory, we can either return a lazy iterator of prefixes >> that can be fed into a scanner in batches or we can have a short-circuit >> config that tunes the amount of slop that's tolerable and cuts off the >> traversal at a certain level of precision. GeoMesa uses something like the >> former on attribute indexes to coordinate parallel scanners on separate >> index and record tables. >> >> Thoughts? I'm inclined to keep the implementation to the bare minimum >> necessary for the basic use cases (lat/lon and bbox queries) though I do >> think a general dimensionality reducing lexicoder would be very useful. >> >> >> >> On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <[email protected]> >> wrote: >> >>> A couple of related issues come up when considering implementing a >>> dimensionality reducing encoding -- just want to toss those out to see what >>> people think the interface might look like. >>> >>> There's a couple of aspects that could be brought in here, but lets keep >>> it simple and considering the original question: (lat/lon) -> number. >>> >>> --Desired precision of the binning process >>> The more bits we add to the z-curve, the more precise our comparison - >>> i.e. a 63 bit key would have more "locations" to sort by than a 24 bit key. >>> >>> Would you see a reasonable default getting picked, make this user >>> configurable, or both? (i.e. default to a value, extended options with a >>> new constructor?) >>> >>> --Semantics for turning two lat/long pairs into a range >>> I'm extrapolating here, but the only reason I see that locality matters >>> is if we want to preserve locality for range searches. The internal >>> implementation of the encoding/lexicoding process is going to directly >>> impact the implementation of the range query. >>> Now sure, someone could encode the lower left point, encode the upper >>> right point, and construct a range out of that to pass for a scan, but >>> that's going to be wildly inefficient in most cases. See: >>> https://dl.dropboxusercontent.com/u/6649380/bbox.png >>> If we just lexicode the lower left and upper right we traverse across >>> the entire curve - hitting lots of areas that aren't actually in the >>> original range. >>> Now we can turn a single 2D range into a set of 1D ranges. There is >>> some potential tuning here now, as the algorithm has a tradeoff on time to >>> compute the ranges (and number of ranges) vs. "slop" (or inclusion of >>> ranges which aren't actually in the original query). >>> Would you see a static method perhaps on the z-curve lexicoder that >>> returns a series of ranges based on an input window? Some other mechanism? >>> And in the case of "slop" - would we just document that the ranges could >>> actually include values not expected - or would we always fully decompose? >>> >>> ------------------ >>> >>> The other, "making it more complicated" questions would probably resolve >>> around generalizing this to a multi-dimensional lexicoder. I would expect >>> the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180 >>> -90/+90 bounds. There are probably completely different cases where it >>> would be useful to have a locality sensitive hash. >>> But with the big caveat that this requires more methods - I now need a >>> way of defining a range for each of the dimensions (which before were >>> defined as lat/lon), and I need an interface/class to pass those dimensions >>> to the encoder - and should be able to use those same definitions to decode >>> my values on the way out. >>> >>> I'm not trying to make a mountain out of a molehill - one could >>> definitely pick some sane defaults, document behaviour, and put in a WGS84 >>> specific implementation. I'm just thinking what's the minimum viable bit >>> that's needed for this to be useful - and I suspect it's also the range >>> decomposition piece - as I suspect *everyone* would really need that (if >>> they cared about locality in the first place). >>> >>> My main question here would be to what extent would you see this going? >>> A one off for WGS84, or a more generic multi-dimensional lexicoder; and >>> in either case, what would the thoughts/consensus be for implementation of >>> the additional methods (and possibly classes) required? >>> >>> >>> >>> >>> >>> On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <[email protected]> >>> wrote: >>> >>>> >>>> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <[email protected]> wrote: >>>> >>>>> >>>>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE >>>>> (which is waiting on approval by the Eclipse Foundation's IP review >>>>> process >>>>> and should happen in a month or so). I think we could break out a z-curve >>>>> lexicoder and contribute before then though. I'll poke around at the >>>>> lexicoder stuff in 1.6 and see what it would take to use that interface >>>>> now. >>>>> >>>>> >>>> >>>> That sounds great Anthony. Let us know if we can help with anything! >>>> >>>> >>>> >>>> -- >>>> Sean >>>> >>> >>> >> >
