On Nov 17, 2009, at 10:35 AM, Yonik Seeley wrote:

> On Mon, Nov 16, 2009 at 9:20 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Mon, Nov 16, 2009 at 8:23 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>>> One of the other things I think we are going to need is a cache for 
>>> functions that are used this way.  For instance, in the geo case, it is 
>>> likely that we would both filter and score by distance,
>> 
>> Filtering (bounding box) should be a separate, more efficient
>> operation than calculating distance, so I don't think any sort of
>> generic cache is needed for geo.
> 
> Actually, you're right.
> I was thinking of filtering by a bounding box, but people will also
> want to filter by a radius (which should presumably use bounding boxes
> first to limit the number of points that we calculate the distance
> for).

Yep, I think frange actually works quite nice for this case.

> 
> If someone then also sorts, the distance calculation won't be reused.
> I don't know a good way around that currently... a full cache would be
> pretty expensive memory-wise.

Right, we don't want a full cache that lives on like the other caches.  We 
likely could just either shove the info onto the document or shove a Map onto 
the Request object itself.  Going back to my servlet days, I often just used 
ServletRequest attributes for this kind of thing or some other request specific 
context.

> 
> Actually, perhaps there wouldn't be too much wasted calculation after all?
> Seems like additional optimizations could limit how many points need
> distance calculated for filtering?
> 
> Consider a bounding box for a particular radius... one could also find
> a box that lies completely within that radius.  Only points inside the
> bigger box but outside the smaller box need to have a distance
> calculated.
> 
> Also, if one is sorting by distance anyway, a straight bounding box
> filter may be sufficient (i.e. users should have the option of the
> cheaper or more expensive filter).


It's not just sorting, though, you could also want that function calculation 
for faceting, scoring and maybe sorting.

In reality of a spatial application, I think it is fairly common to say, all in 
one request:
1. Filter by distance/bounding box
2. Within the box, boost the score based on distance from center point and 
return the score
3. Return me out the distance from the center point as a field value 
(pseudo-fields)
4. Facet by function (i.e. distance) and put them in buckets (all docs in 
walking dist, cycling dist, driving distance, everything else) 
5. Sort by distance (in many cases, this one and #2 will be mutually exclusive, 
but not in all cases)

If you take a very dense geographical area, like Manhattan, you could still 
have hundreds of thousands, if not millions, of points all in a radius of 10 or 
20 miles such that not calculating that distance more than once is going to be 
paramount to success.  

-Grant


Reply via email to