Re: Improving performance for SOLR geo queries?

2012-02-14 Thread Matthias Käppler
hey thanks all for the suggestions, didn't have time to look into them
yet as we're feature-sprinting for MWC, but will report back with some
feedback over the next weeks (we will have a few more performance
sprints in March)

Best,
Matthias

On Mon, Feb 13, 2012 at 2:32 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Thu, Feb 9, 2012 at 1:46 PM, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 One way to speed up numeric range queries (at the cost of increased
 index size) is to lower the precisionStep.  You could try changing
 this from 8 to 4 and then re-indexing to see how that affects your
 query speed.

 Your issue, and the fact that I had been looking at the post-filtering
 code again for another client, reminded me that I had been planning on
 implementing post-filtering for spatial.  It's now checked into trunk.

 If you have the ability to use trunk, you can add a high cost (like
 cost=200) along with cache=false to trigger it.

 More details here:
 http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/

 -Yonik
 lucidimagination.com



-- 
Matthias Käppler
Lead Developer API  Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.


Re: Improving performance for SOLR geo queries?

2012-02-14 Thread Bill Bell
Can we get this back ported to 3x?

Bill Bell
Sent from mobile


On Feb 14, 2012, at 3:45 AM, Matthias Käppler matth...@qype.com wrote:

 hey thanks all for the suggestions, didn't have time to look into them
 yet as we're feature-sprinting for MWC, but will report back with some
 feedback over the next weeks (we will have a few more performance
 sprints in March)
 
 Best,
 Matthias
 
 On Mon, Feb 13, 2012 at 2:32 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Thu, Feb 9, 2012 at 1:46 PM, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 One way to speed up numeric range queries (at the cost of increased
 index size) is to lower the precisionStep.  You could try changing
 this from 8 to 4 and then re-indexing to see how that affects your
 query speed.
 
 Your issue, and the fact that I had been looking at the post-filtering
 code again for another client, reminded me that I had been planning on
 implementing post-filtering for spatial.  It's now checked into trunk.
 
 If you have the ability to use trunk, you can add a high cost (like
 cost=200) along with cache=false to trigger it.
 
 More details here:
 http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/
 
 -Yonik
 lucidimagination.com
 
 
 
 -- 
 Matthias Käppler
 Lead Developer API  Mobile
 
 Qype GmbH
 Großer Burstah 50-52
 20457 Hamburg
 Telephone: +49 (0)40 - 219 019 2 - 160
 Skype: m_kaeppler
 Email: matth...@qype.com
 
 Managing Director: Ian Brotherston
 Amtsgericht Hamburg
 HRB 95913
 
 This e-mail and its attachments may contain confidential and/or
 privileged information. If you are not the intended recipient (or have
 received this e-mail in error) please notify the sender immediately
 and destroy this e-mail and its attachments. Any unauthorized copying,
 disclosure or distribution of this e-mail and  its attachments is
 strictly forbidden. This notice also applies to future messages.


Re: Improving performance for SOLR geo queries?

2012-02-12 Thread David Smiley (@MITRE.org)
Mathias,

For what it's worth, someone using LSP (Lucene Spatial Playground)'s
RecursivePrefixTreeFieldType (which uses geohash encoding by default) quoted
a 2x performance increase over Solr's built-in LatLonType.  To boost the
performance further, there are a couple parameters you can tweak.  One is
distPrec which defaults to 0.025 and reflects an acceptable imprecision of
the border of the query shape -- 2.5% of the approximate radius.  Maybe
you'd be comfortable with 5% or 10% or somewhere in-between.  This figure
can be specified at query-time.  There is another internal number that I've
yet to make configurable at the Solr level which is prefixGridScanLevel.  If
you are interested in exploring LSP further, go check it out and look at the
README.  It's a bit skimpy right now but I'm happy to elaborate on the
performance tuning.

FYI, don't bother using the GeoHashField field type built-in to Solr.  The
search algorithm for this particular field is a dumb brute-force algorithm. 
It gives geohashes a bad name.

~ David

-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-performance-for-SOLR-geo-queries-tp3719310p3737861.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving performance for SOLR geo queries?

2012-02-12 Thread Yonik Seeley
On Thu, Feb 9, 2012 at 1:46 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 One way to speed up numeric range queries (at the cost of increased
 index size) is to lower the precisionStep.  You could try changing
 this from 8 to 4 and then re-indexing to see how that affects your
 query speed.

Your issue, and the fact that I had been looking at the post-filtering
code again for another client, reminded me that I had been planning on
implementing post-filtering for spatial.  It's now checked into trunk.

If you have the ability to use trunk, you can add a high cost (like
cost=200) along with cache=false to trigger it.

More details here:
http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/

-Yonik
lucidimagination.com


Re: Improving performance for SOLR geo queries?

2012-02-09 Thread Matthias Käppler
Hi Ryan,

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from debugQuery=true?

we query against a LatLonType using plain latitudes and longitudes and
the bbox function. We send the bbox filter in a filter query that is
uncached (we had to do this in order to get eviction rate down in the
filter cache, we had problems with that). Our filter cache is set up
as follows:

Concurrent LRU Cache(maxSize=32768, initialSize=8192, minSize=29491,
acceptableSize=31129, cleanupThread=false, autowarmCount=8192,
regenerator=org.apache.solr.search.SolrIndexSearcher$2@2fd1fc5c)

We've just restarted the slaves 30 minutes ago, so these values are
not really giving away much, but we see a hit rate of up to 97% on the
filter caches:

lookups : 13003
hits : 12440
hitratio : 0.95
inserts : 563
evictions : 0
size : 8927
warmupTime : 116891
cumulative_lookups : 9990103
cumulative_hits : 9583913
cumulative_hitratio : 0.95
cumulative_inserts : 406191
cumulative_evictions : 0

The warmup time looks a bit worrying, is that a high value by your experience?

As for debugQuery, here's the relevant snippet for the kind of geo
queries we send:

arr name=filter_queries
str{!bbox cache=false d=50 sfield=location_ll pt=54.1434,-0.452322}/str
/arr
arr name=parsed_filter_queries
str
WrappedQuery({!cache=false
cost=0}+location_ll_0_coordinate:[53.69373983225355 TO
54.59306016774645] +location_ll_1_coordinate:[-1.2199462259963294 TO
0.31530222599632934])
/str
/arr


 From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast
though

thanks, I will look into that! I still haven't really considered geo
hashes. As far as I understand, documents with a lat/lon are already
assigned a geo hash upon indexing, is that correct? In which way does
a query get faster though when I query by a geo hash rather than a
lat/lon? Doesn't local lucene already map documents to a cartesian
grid upon indexing, thus reducing lookup time? Moreover, will this
mean the results get less accurate since different lat/lons may
collapse into the same hash?

Thanks!

-- 
Matthias Käppler
Lead Developer API  Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.


Re: Improving performance for SOLR geo queries?

2012-02-09 Thread Yonik Seeley
2012/2/9 Matthias Käppler matth...@qype.com:
 arr name=filter_queries
 str{!bbox cache=false d=50 sfield=location_ll pt=54.1434,-0.452322}/str
 /arr
 arr name=parsed_filter_queries
 str
 WrappedQuery({!cache=false
 cost=0}+location_ll_0_coordinate:[53.69373983225355 TO
 54.59306016774645] +location_ll_1_coordinate:[-1.2199462259963294 TO
 0.31530222599632934])
 /str
 /arr

Yep, bbox normally just evaluates to two range queries.
In the example schema, *_coordinate uses tdouble:

   !-- Type used to index the lat and lon components for the
location FieldType --
   dynamicField name=*_coordinate  type=tdouble indexed=true
stored=false/

And tdouble is defined to be:

fieldType name=tdouble class=solr.TrieDoubleField
precisionStep=8 omitNorms=true positionIncrementGap=0/

One way to speed up numeric range queries (at the cost of increased
index size) is to lower the precisionStep.  You could try changing
this from 8 to 4 and then re-indexing to see how that affects your
query speed.

-Yonik
lucidimagination.com


Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Matthias Käppler
Hi Erick,

if we're not doing geo searches, we filter by location tags that we
attach to places. This is simply a hierachical regional id, which is
simple to filter for, but much less flexible. We use that on Web a
lot, but not on mobile, where we want to performance searches in
arbitrary radii around arbitrary positions. For those location tag
kind of queries, the average time spent in SOLR is 43msec (I'm looking
at the New Relic snapshot of the last 12 hours). I have disabled our
optimization again just yesterday, so for the bbox queries we're now
at an avg of 220ms (same time window). That's a 5 fold increase in
response time, and in peak hours it's worse than that.

I've also found a blog post from 3 years ago which outlines the inner
workings of the SOLR spatial indexing and searching:
http://www.searchworkings.org/blog/-/blogs/23842
From that it seems as if SOLR already performs a similar optimization
we had in mind during the index step, so if I understand correctly, it
doesn't even search over all records, only those that were mapped to
the grid box identified during indexing.

What I would love to see is what the suggested way is to perform a geo
query on SOLR, considering that they're so difficult to cache and
expensive to run. Is the best approach to restrict the candidate set
as much as possible using cheap filter queries, so that SOLR merely
has to do the geo search against these subsets? How does the query
planner work here? I see there's a cost attached to a filter query,
but one can only set it when cache is set to false? Are cached geo
queries executed last when there are cheaper filter queries to cut
down on documents? If you have a real world practical setup to share,
one that performs well in a production environment that serves
requests in the Millions per day, that would be great.

I'd love to contribute documentation by the way, if you knew me you'd
know I'm an avid open source contributor and actually run several open
source projects myself. But tell me, how can I possibly contribute
answer to questions I don't have an answer to? That's why I'm here,
remember :) So please, these kinds of snippy replies are not helping
anyone.

Thanks
-Matthias

On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often queried than others (naturally, since we're a user driven
 service). Therefore, we dynamically partition Earth into a static grid
 of overlapping boxes, where the grid size (the distance of the nodes)
 depends on the maximum allowed search radius. That way, for every user
 query, we would always be able to identify a single bounding box that
 covers it. This larger bounding box (200km edge length) we would send
 to SOLR as a cached filter query, along with the actual user query
 which would still be sent uncached. Ex:

 User asks for places in 10km around 49.14839,8.5691, then what we will
 send to SOLR is something like this:

 fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
 fq={!bbox cache=true d=100.0 sfield=location_ll
 pt=49.4684836290799,8.31165802979391} -- this one we derive
 automatically

 That way SOLR would intersect the two filters and return the same
 results as when only looking at the smaller bounding box, but keep the
 larger box in cache and speed up subsequent geo queries in the same
 regions. Or so we thought; unfortunately this approach did not help
 query execution times get better, at all.

 

Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Ryan McKinley
Hi Matthias-

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from debugQuery=true?

From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast though

ryan




2012/2/8 Matthias Käppler matth...@qype.com:
 Hi Erick,

 if we're not doing geo searches, we filter by location tags that we
 attach to places. This is simply a hierachical regional id, which is
 simple to filter for, but much less flexible. We use that on Web a
 lot, but not on mobile, where we want to performance searches in
 arbitrary radii around arbitrary positions. For those location tag
 kind of queries, the average time spent in SOLR is 43msec (I'm looking
 at the New Relic snapshot of the last 12 hours). I have disabled our
 optimization again just yesterday, so for the bbox queries we're now
 at an avg of 220ms (same time window). That's a 5 fold increase in
 response time, and in peak hours it's worse than that.

 I've also found a blog post from 3 years ago which outlines the inner
 workings of the SOLR spatial indexing and searching:
 http://www.searchworkings.org/blog/-/blogs/23842
 From that it seems as if SOLR already performs a similar optimization
 we had in mind during the index step, so if I understand correctly, it
 doesn't even search over all records, only those that were mapped to
 the grid box identified during indexing.

 What I would love to see is what the suggested way is to perform a geo
 query on SOLR, considering that they're so difficult to cache and
 expensive to run. Is the best approach to restrict the candidate set
 as much as possible using cheap filter queries, so that SOLR merely
 has to do the geo search against these subsets? How does the query
 planner work here? I see there's a cost attached to a filter query,
 but one can only set it when cache is set to false? Are cached geo
 queries executed last when there are cheaper filter queries to cut
 down on documents? If you have a real world practical setup to share,
 one that performs well in a production environment that serves
 requests in the Millions per day, that would be great.

 I'd love to contribute documentation by the way, if you knew me you'd
 know I'm an avid open source contributor and actually run several open
 source projects myself. But tell me, how can I possibly contribute
 answer to questions I don't have an answer to? That's why I'm here,
 remember :) So please, these kinds of snippy replies are not helping
 anyone.

 Thanks
 -Matthias

 On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often queried than others (naturally, since we're a user driven
 service). Therefore, we dynamically partition Earth into a static grid
 of overlapping boxes, where the grid size (the distance of the nodes)
 depends on the maximum allowed search radius. That way, for every user
 query, we would always be able to identify a single bounding box that
 covers it. This larger bounding box (200km edge length) we would send
 to SOLR as a cached filter query, along with the actual user query
 

Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Nicolas Flacco
I compared locallucene to spatial search and saw a performance
degradation, even using geohash queries, though perhaps I indexed things
wrong? Locallucene across 6 machines handles 150 queries per second fine,
but using geofilt and geohash I got lots of timeouts even when I was doing
only 50 queries per second. Has anybody done a formal comparison of
locallucene with spatial search and latlontype, pointtype and geohash?

On 2/8/12 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:

Hi Matthias-

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from debugQuery=true?

From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast
though

ryan




2012/2/8 Matthias Käppler matth...@qype.com:
 Hi Erick,

 if we're not doing geo searches, we filter by location tags that we
 attach to places. This is simply a hierachical regional id, which is
 simple to filter for, but much less flexible. We use that on Web a
 lot, but not on mobile, where we want to performance searches in
 arbitrary radii around arbitrary positions. For those location tag
 kind of queries, the average time spent in SOLR is 43msec (I'm looking
 at the New Relic snapshot of the last 12 hours). I have disabled our
 optimization again just yesterday, so for the bbox queries we're now
 at an avg of 220ms (same time window). That's a 5 fold increase in
 response time, and in peak hours it's worse than that.

 I've also found a blog post from 3 years ago which outlines the inner
 workings of the SOLR spatial indexing and searching:
 http://www.searchworkings.org/blog/-/blogs/23842
 From that it seems as if SOLR already performs a similar optimization
 we had in mind during the index step, so if I understand correctly, it
 doesn't even search over all records, only those that were mapped to
 the grid box identified during indexing.

 What I would love to see is what the suggested way is to perform a geo
 query on SOLR, considering that they're so difficult to cache and
 expensive to run. Is the best approach to restrict the candidate set
 as much as possible using cheap filter queries, so that SOLR merely
 has to do the geo search against these subsets? How does the query
 planner work here? I see there's a cost attached to a filter query,
 but one can only set it when cache is set to false? Are cached geo
 queries executed last when there are cheaper filter queries to cut
 down on documents? If you have a real world practical setup to share,
 one that performs well in a production environment that serves
 requests in the Millions per day, that would be great.

 I'd love to contribute documentation by the way, if you knew me you'd
 know I'm an avid open source contributor and actually run several open
 source projects myself. But tell me, how can I possibly contribute
 answer to questions I don't have an answer to? That's why I'm here,
 remember :) So please, these kinds of snippy replies are not helping
 anyone.

 Thanks
 -Matthias

 On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson
erickerick...@gmail.com wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often 

Re: Improving performance for SOLR geo queries?

2012-02-07 Thread Erick Erickson
So the obvious question is what is your
performance like without the distance filters?

Without that knowledge, we have no clue whether
the modifications you've made had any hope of
speeding up your response times

As for the docs, any improvements you'd like to
contribute would be happily received

Best
Erick

2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often queried than others (naturally, since we're a user driven
 service). Therefore, we dynamically partition Earth into a static grid
 of overlapping boxes, where the grid size (the distance of the nodes)
 depends on the maximum allowed search radius. That way, for every user
 query, we would always be able to identify a single bounding box that
 covers it. This larger bounding box (200km edge length) we would send
 to SOLR as a cached filter query, along with the actual user query
 which would still be sent uncached. Ex:

 User asks for places in 10km around 49.14839,8.5691, then what we will
 send to SOLR is something like this:

 fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
 fq={!bbox cache=true d=100.0 sfield=location_ll
 pt=49.4684836290799,8.31165802979391} -- this one we derive
 automatically

 That way SOLR would intersect the two filters and return the same
 results as when only looking at the smaller bounding box, but keep the
 larger box in cache and speed up subsequent geo queries in the same
 regions. Or so we thought; unfortunately this approach did not help
 query execution times get better, at all.

 Question is: why does it not help? Shouldn't it be faster to search on
 a cached bbox with only a few hundred thousand places? Is it a good
 idea to make these kinds of optimizations in the app layer (we do this
 as part of resolving the SOLR query in Ruby), and does it make sense
 at all? We're not sure what kind of optimizations SOLR already does in
 its query planner. The documentation is (sorry) miserable, and
 debugQuery yields no insight into which optimizations are performed.
 So this has been a hit and miss game for us, which is very ineffective
 considering that it takes considerable time to build these kinds of
 optimizations in the app layer.

 Would be glad to hear your opinions / experience around this.

 Thanks!

 --
 Matthias Käppler
 Lead Developer API  Mobile

 Qype GmbH
 Großer Burstah 50-52
 20457 Hamburg
 Telephone: +49 (0)40 - 219 019 2 - 160
 Skype: m_kaeppler
 Email: matth...@qype.com

 Managing Director: Ian Brotherston
 Amtsgericht Hamburg
 HRB 95913

 This e-mail and its attachments may contain confidential and/or
 privileged information. If you are not the intended recipient (or have
 received this e-mail in error) please notify the sender immediately
 and destroy this e-mail and its attachments. Any unauthorized copying,
 disclosure or distribution of this e-mail and  its attachments is
 strictly forbidden. This notice also applies to future messages.


Improving performance for SOLR geo queries?

2012-02-06 Thread Matthias Käppler
Hi,

we need to perform fast geo lookups on an index of ~13M places, and
were running into performance problems here with SOLR. We haven't done
a lot of query optimization / SOLR tuning up until now so there's
probably a lot of things we're missing. I was wondering if you could
give me some feedback on the way we do things, whether they make
sense, and especially why a supposed optimization we implemented
recently seems to have no effect, when we actually thought it would
help a lot.

What we do is this: our API is built on a Rails stack and talks to
SOLR via a Ruby wrapper. We have a few filters that almost always
apply, which we put in filter queries. Filter cache hit rate is
excellent, about 97%, and cache size caps at 10k filters (max size is
32k, but it never seems to reach that many, probably because we
replicate / delta update every few minutes). Still, geo queries are
slow, about 250-500msec on average. We send them with cache=false, so
as to not flood the fq cache and cause undesirable evictions.

Now our idea was this: while the actual geo queries are poorly
cacheable, we could clearly identify geographical regions which are
more often queried than others (naturally, since we're a user driven
service). Therefore, we dynamically partition Earth into a static grid
of overlapping boxes, where the grid size (the distance of the nodes)
depends on the maximum allowed search radius. That way, for every user
query, we would always be able to identify a single bounding box that
covers it. This larger bounding box (200km edge length) we would send
to SOLR as a cached filter query, along with the actual user query
which would still be sent uncached. Ex:

User asks for places in 10km around 49.14839,8.5691, then what we will
send to SOLR is something like this:

fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
fq={!bbox cache=true d=100.0 sfield=location_ll
pt=49.4684836290799,8.31165802979391} -- this one we derive
automatically

That way SOLR would intersect the two filters and return the same
results as when only looking at the smaller bounding box, but keep the
larger box in cache and speed up subsequent geo queries in the same
regions. Or so we thought; unfortunately this approach did not help
query execution times get better, at all.

Question is: why does it not help? Shouldn't it be faster to search on
a cached bbox with only a few hundred thousand places? Is it a good
idea to make these kinds of optimizations in the app layer (we do this
as part of resolving the SOLR query in Ruby), and does it make sense
at all? We're not sure what kind of optimizations SOLR already does in
its query planner. The documentation is (sorry) miserable, and
debugQuery yields no insight into which optimizations are performed.
So this has been a hit and miss game for us, which is very ineffective
considering that it takes considerable time to build these kinds of
optimizations in the app layer.

Would be glad to hear your opinions / experience around this.

Thanks!

-- 
Matthias Käppler
Lead Developer API  Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.