Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-07-03 Thread Shawn Heisey

On 7/2/2011 12:34 PM, Yonik Seeley wrote:

OK, I tried a quick test of 1.4.1 vs 3x on optimized indexes
(unoptimized had different numbers of segments so I didn't try that).
3x (as of today) was 28% faster at a large filter query (300 terms in
one  big disjunction, with each term matching ~1000 docs).


A lot of the terms used in my filter queries may match hundreds of 
thousands or even millions of documents.  The largest search group 
(sg:stdp) matches about 1.4 million out of 9.5 million docs on each 
shard, and is probably present in most filter queries.


Right now I have the default termIndexInterval of 128, and a 
setTermIndexDivisor of 8.  I think this probably has the same memory 
footprint as a termIndexInterval of 1024, but because it can do seeks in 
the tii file (taking good advantage of disk cache) before it ultimately 
seeks in the tis file, there are probably fewer seeks.  My warm time is 
slightly better than it was with the interval at 1024, and my average 
query speed hasn't changed much.  I am going to try an interval of 64 
and a divisor of 16.


I'm interested in other performance enhancing ideas that don't involve 
tweaking tons of options all at the same time.  I think my best bet for 
performance is adding more memory, of course.


Shawn



Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-07-02 Thread Yonik Seeley
OK, I tried a quick test of 1.4.1 vs 3x on optimized indexes
(unoptimized had different numbers of segments so I didn't try that).
3x (as of today) was 28% faster at a large filter query (300 terms in
one  big disjunction, with each term matching ~1000 docs).

-Yonik
http://www.lucidimagination.com


On Thu, Jun 30, 2011 at 3:30 PM, Shawn Heisey  wrote:
> On 6/29/2011 10:16 PM, Shawn Heisey wrote:
>>
>> I was thinking perhaps I might actually decrease the termIndexInterval
>> value below the default of 128.  I know from reading the Hathi Trust blog
>> that memory usage for the tii file is much more than the size of the file
>> would indicate, but if I increase it from 13MB to 26MB, it probably would
>> still be OK.
>
> Decreasing the termIndexInterval to 64 almost doubled the tii file size, as
> expected.  It made the filterCache warming much faster, but made the
> queryResultCache warming very very slow.  Regular queries also seem like
> they're slower.
>
> I am trying again with 256.  I may go back to the default before I'm done.
>  I'm guessing that a lot of trial and error was put into choosing the
> default value.
>
> It's been fun having a newer index available on my backup servers.  I've
> been able to do a lot of trials, learned a lot of things that don't work and
> a few that do.  I might do some experiments with trunk once I've moved off
> 1.4.1.
>
> Thanks,
> Shawn
>
>


Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-30 Thread Shawn Heisey

On 6/29/2011 10:16 PM, Shawn Heisey wrote:
I was thinking perhaps I might actually decrease the termIndexInterval 
value below the default of 128.  I know from reading the Hathi Trust 
blog that memory usage for the tii file is much more than the size of 
the file would indicate, but if I increase it from 13MB to 26MB, it 
probably would still be OK.


Decreasing the termIndexInterval to 64 almost doubled the tii file size, 
as expected.  It made the filterCache warming much faster, but made the 
queryResultCache warming very very slow.  Regular queries also seem like 
they're slower.


I am trying again with 256.  I may go back to the default before I'm 
done.  I'm guessing that a lot of trial and error was put into choosing 
the default value.


It's been fun having a newer index available on my backup servers.  I've 
been able to do a lot of trials, learned a lot of things that don't work 
and a few that do.  I might do some experiments with trunk once I've 
moved off 1.4.1.


Thanks,
Shawn



Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Shawn Heisey

On 6/29/2011 7:50 PM, Yonik Seeley wrote:

OK, your filter queries have hundreds of terms in them (and that means
hundreds of term lookups, which uses the term index).
Thus, your termIndexInterval change is be the leading suspect for the
slowdown.  A termIndexInterval of 1024 means that
a term lookup will seek to the closest 1024th term and then call
next() until the desired term is found.  Hence instead of calling
next()
an average of 64 times internally, it's now 512 times.

Of course there is still a mystery about why your tii (which is the
term index) would be so much bigger instead of smaller...


It turns out I got the two indexes backwards, the smaller one was the 
new index.  I may have mixed up the indexes on some of the other files 
too, but they weren't much different, so I'm not going to try and figure 
out where any mistakes might be.


Earlier in the afternoon I figured this out, removed termIndexInterval 
from my config, and rebuilt the index.  I had originally put this in to 
speed up indexing.  The evidence I had available at the time told me 
that this goal was accomplished, but the rebuild actually went faster 
without the statement.  Warming times are now averaging under 10 seconds 
even with the warmup count back up to 8.  This is still slower than I 
would like, but it is a major improvement.  Even more important, I 
understand what happened.


I was thinking perhaps I might actually decrease the termIndexInterval 
value below the default of 128.  I know from reading the Hathi Trust 
blog that memory usage for the tii file is much more than the size of 
the file would indicate, but if I increase it from 13MB to 26MB, it 
probably would still be OK.


Are any index intervals for the other Lucene files configurable in a 
similar manner?  I know that screwing too much with the defaults can 
make things much worse, so I would be very careful with any adjustments, 
and try to fully understand why any performance gain or loss occurred.


Thanks,
Shawn



Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Yonik Seeley
On Wed, Jun 29, 2011 at 3:28 PM, Yonik Seeley
 wrote:
>
> On Wed, Jun 29, 2011 at 1:43 PM, Shawn Heisey  wrote:
> > Just now, three of the six shards had documents deleted, and they took
> > 29.07, 27.57, and 28.66 seconds to warm.  The 1.4.1 counterpart to the 29.07
> > second one only took 4.78 seconds, and it did twice as many autowarm
> > queries.
>
> Can you post the logs at the INFO level that covers the warming period?

OK, your filter queries have hundreds of terms in them (and that means
hundreds of term lookups, which uses the term index).
Thus, your termIndexInterval change is be the leading suspect for the
slowdown.  A termIndexInterval of 1024 means that
a term lookup will seek to the closest 1024th term and then call
next() until the desired term is found.  Hence instead of calling
next()
an average of 64 times internally, it's now 512 times.

Of course there is still a mystery about why your tii (which is the
term index) would be so much bigger instead of smaller...

-Yonik
http://www.lucidimagination.com


Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Yonik Seeley
On Wed, Jun 29, 2011 at 1:43 PM, Shawn Heisey  wrote:
> Just now, three of the six shards had documents deleted, and they took
> 29.07, 27.57, and 28.66 seconds to warm.  The 1.4.1 counterpart to the 29.07
> second one only took 4.78 seconds, and it did twice as many autowarm
> queries.

Can you post the logs at the INFO level that covers the warming period?

-Yonik
http://www.lucidimagination.com


Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Shawn Heisey

On 6/29/2011 11:27 AM, Shawn Heisey wrote:

On 6/29/2011 9:17 AM, Yonik Seeley wrote:
Hmmm, you could comment out the query and filter caches on both 1.4.1 
and 3.2
and then run some of the queries to see if you can figure out which 
are slower?


Do any of the queries have stopwords in fields where you now index
those?  If so, that could entirely account for the difference.


The query cache warms very quickly, it's the filter cache that's 
taking forever.  I'm not intimately familiar with what is being put in 
our filter queries by our webapp, but I'd be a little surprised if 
there are stopwords there.  A quick grep through solr logs (when I've 
turned it up to INFO) for the really common ones didn't reveal any.  
People do type them in fairly frequently, but they go into q= ... fq 
values are constructed internally, not from what a user types, and as 
far as I know, they involve fields that have never had stopwords removed.


I should add that this happens only after the index has had at least a 
few hundred queries, when deletes are committed.  The delete process 
runs every ten minutes, and checks for document presence before issuing 
the delete, which avoids unnecessary commits.


Just now, three of the six shards had documents deleted, and they took 
29.07, 27.57, and 28.66 seconds to warm.  The 1.4.1 counterpart to the 
29.07 second one only took 4.78 seconds, and it did twice as many 
autowarm queries.  I know it's not my single *:* sorted warming query 
(firstSearcher and newSearcher), because on solr startup with either 
version, warm time is 0.01 seconds.  I have useColdSearcher set to false.


Thanks,
Shawn



Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Shawn Heisey

On 6/29/2011 9:17 AM, Yonik Seeley wrote:

Hmmm, you could comment out the query and filter caches on both 1.4.1 and 3.2
and then run some of the queries to see if you can figure out which are slower?

Do any of the queries have stopwords in fields where you now index
those?  If so, that could entirely account for the difference.


The query cache warms very quickly, it's the filter cache that's taking 
forever.  I'm not intimately familiar with what is being put in our 
filter queries by our webapp, but I'd be a little surprised if there are 
stopwords there.  A quick grep through solr logs (when I've turned it up 
to INFO) for the really common ones didn't reveal any.  People do type 
them in fairly frequently, but they go into q= ... fq values are 
constructed internally, not from what a user types, and as far as I 
know, they involve fields that have never had stopwords removed.


I will do some experimentation with your suggestions.

Thanks,
Shawn



Re: Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Yonik Seeley
Hmmm, you could comment out the query and filter caches on both 1.4.1 and 3.2
and then run some of the queries to see if you can figure out which are slower?

Do any of the queries have stopwords in fields where you now index
those?  If so, that could entirely account for the difference.

-Yonik
http://www.lucidimagination.com

On Wed, Jun 29, 2011 at 10:59 AM, Shawn Heisey  wrote:
> I have noticed a significant difference in filter cache warming times on my
> shards between 3.2 and 1.4.1.  What can I do to troubleshoot this?  Please
> let me know what additional information you might need to look deeper.  I
> know this isn't enough.
>
> It takes about 3 seconds to do an autowarm count of 8 on 1.4.1 and 10-15
> seconds to do an autowarm count of 4 on 3.2.  The only explicit warming
> query is *:*, sorted descending by post_date, a tlong field containing a
> UNIX timestamp, precisionStep 16.  The indexes are not entirely identical,
> but the new one did evolve from the old one.  Perhaps one of the experts
> might spot something that makes for much slower filter cache warming, or
> some way to look deeper if this seems wrong?  Is there a way to see the
> search URL bits that populated the cache?
>
> Index differences: The new index has four extra small fields, is no longer
> removing stopwords, and has omitTermFreqAndPositions enabled on a
> significant number of fields.  Most of the fields are tokenized text, and
> now more than half of those don't have tf and tp enabled.  Naturally the
> largest text field where most of the matches happen still does have them
> enabled.
>
> To increase reindex speed, the new index has a termIndexInterval of 1024,
> the old one is at the default of 128.  In terms of raw size, the new index
> is less than one percent larger than the old one.  The old shards average
> out to 17.22GB, the new ones to 17.41GB.  Here's an overview of the
> differences of each type of file (comparing the huge optimized segment only,
> not the handful of tiny ones since) on one the index with the largest size
> gap, old value listed first:
>
> fdt: 6317180127/6055634923 (4.1% decrease)
> fdx: 76447972/75647412 (1% decrease)
> fnm: 382, 338 (44 bytes!  woohoo!)
> frq: 2828400926/2873249038 (1.5% increase)
> nrm: 28367782/38223988 (35% increase)
> prx: 2449154203/2684249069 (9.5% increase)
> tii: 1686298/13329832 (790% increase)  
> tis: 923045932/999294109 (8% increase)
> tvd: 18910972/19111840 (1% increase)
> tvf: 5867309063/5640332282 (3.9% decrease)
> tvx: 151294820/152895940 (1% increase)
>
> The tii and nrm files are the only ones that saw a significant size
> increase, but the tii file is MUCH bigger.
>
> Thanks,
> Shawn
>
>


Solr 3.2 filter cache warming taking longer than 1.4.1

2011-06-29 Thread Shawn Heisey
I have noticed a significant difference in filter cache warming times on 
my shards between 3.2 and 1.4.1.  What can I do to troubleshoot this?  
Please let me know what additional information you might need to look 
deeper.  I know this isn't enough.


It takes about 3 seconds to do an autowarm count of 8 on 1.4.1 and 10-15 
seconds to do an autowarm count of 4 on 3.2.  The only explicit warming 
query is *:*, sorted descending by post_date, a tlong field containing a 
UNIX timestamp, precisionStep 16.  The indexes are not entirely 
identical, but the new one did evolve from the old one.  Perhaps one of 
the experts might spot something that makes for much slower filter cache 
warming, or some way to look deeper if this seems wrong?  Is there a way 
to see the search URL bits that populated the cache?


Index differences: The new index has four extra small fields, is no 
longer removing stopwords, and has omitTermFreqAndPositions enabled on a 
significant number of fields.  Most of the fields are tokenized text, 
and now more than half of those don't have tf and tp enabled.  Naturally 
the largest text field where most of the matches happen still does have 
them enabled.


To increase reindex speed, the new index has a termIndexInterval of 
1024, the old one is at the default of 128.  In terms of raw size, the 
new index is less than one percent larger than the old one.  The old 
shards average out to 17.22GB, the new ones to 17.41GB.  Here's an 
overview of the differences of each type of file (comparing the huge 
optimized segment only, not the handful of tiny ones since) on one the 
index with the largest size gap, old value listed first:


fdt: 6317180127/6055634923 (4.1% decrease)
fdx: 76447972/75647412 (1% decrease)
fnm: 382, 338 (44 bytes!  woohoo!)
frq: 2828400926/2873249038 (1.5% increase)
nrm: 28367782/38223988 (35% increase)
prx: 2449154203/2684249069 (9.5% increase)
tii: 1686298/13329832 (790% increase)  
tis: 923045932/999294109 (8% increase)
tvd: 18910972/19111840 (1% increase)
tvf: 5867309063/5640332282 (3.9% decrease)
tvx: 151294820/152895940 (1% increase)

The tii and nrm files are the only ones that saw a significant size 
increase, but the tii file is MUCH bigger.


Thanks,
Shawn