RE: facet method=enum and uninvertedfield limitations

2013-11-20 Thread Lemke, Michael SZ/HZA-ZSW
On Wednesday, November 20, 2013 7:37 AM, Dmitry Kan wrote:

Thanks for your reply.


Since you are faceting on a text field (is this correct?) you deal with a
lot of unique values in it.

Yes, this is a text field and we experimented with reducing the index.  As
I said in my original question the stripped down index had 178,000 terms
and it (fc) still didn't work.  Is number of terms the relevant quantity?

So your best bet is enum method. 

Hm, yes, that works but I have to wait 4 minutes for the answer (with the
original data).  Not good.

Also if you
are on solr 4x try building doc values in the index: this suits faceting
well.

We are on Solr 1.4, so, no.


Otherwise start from your spec once again. Can you use shingles instead?

Possibly but I don't know shingles.  Although I'd prefer to use our original
index we are trying to build a specialized index just for this sort of
query but still don't know what to look for.

A query like

 
q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

would give me the top ten results containing 'word' and something starting
with 'a'.  That's what I want.  An empty facet.prefix should also work.
Eventually, the query will be more complex containing other fields and
filter queries but the basic function should be exactly like this.  How
can we achieve this?

Thanks,
Michael


On 19 Nov 2013 17:44, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com
wrote:

 On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:

 Judging from numerous replies this seems to be a tough question.
 Nevertheless, I'd really appreciate any help as we are stuck.
 We'd really like to know what in our index causes the facet.method=fc
 query to fail.

 Thanks,
 Michael

 On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
 On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
 lemke...@schaeffler.com wrote:
  I am running into performance problems with faceted queries.
  If I do a
 
 
 q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0
 
  I am getting an exception:
  org.apache.solr.common.SolrException: Too many values for
 UnInvertedField faceting on field CONTENT
  at
 org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
  at
 org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178)
  at
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
  ...
 
  I understand it's got something to do with a 24bit limit somewhere
  in the code but I don't understand enough of it to be able to construct
  a specialized index that can be queried with facet.method=enum.
 
 You shouldn't need to do anything differently to try facet.method=enum
 (just replace facet.method=fc with facet.method=enum)
 
 This is true and facet.method=enum does work indeed.  The problem is
 runtime.  In particular queries with an empty facet.prefix= run many
 seconds if not minutes.  I initially asked about this here:
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E
 
 It was suggested that fc is much faster than enum and I'd like to
 test that.  We are still fairly free to design the index such that
 it performs well.  But to do that we need to understand what is
 killing it.
 
 
 You may also want to add the parameter
 facet.enum.cache.minDf=10
 to lower memory usage by only usiing the filter cache for terms that
 match more than 100K docs.
 
 That helped a little, cut down my particular test from 10 sec to 5 sec.
 But still too slow.  Mind you this is for an autosuggest feature.
 
 Thanks for your reply.
 
 Michael
 
 





RE: facet method=enum and uninvertedfield limitations

2013-11-19 Thread Lemke, Michael SZ/HZA-ZSW
On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:

Judging from numerous replies this seems to be a tough question.
Nevertheless, I'd really appreciate any help as we are stuck.
We'd really like to know what in our index causes the facet.method=fc
query to fail.

Thanks,
Michael

On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
 I am running into performance problems with faceted queries.
 If I do a

 q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

 I am getting an exception:
 org.apache.solr.common.SolrException: Too many values for UnInvertedField 
 faceting on field CONTENT
 at 
 org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
 at 
 org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178)
 at 
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
 ...

 I understand it's got something to do with a 24bit limit somewhere
 in the code but I don't understand enough of it to be able to construct
 a specialized index that can be queried with facet.method=enum.

You shouldn't need to do anything differently to try facet.method=enum
(just replace facet.method=fc with facet.method=enum)

This is true and facet.method=enum does work indeed.  The problem is
runtime.  In particular queries with an empty facet.prefix= run many
seconds if not minutes.  I initially asked about this here:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E

It was suggested that fc is much faster than enum and I'd like to
test that.  We are still fairly free to design the index such that
it performs well.  But to do that we need to understand what is
killing it.


You may also want to add the parameter
facet.enum.cache.minDf=10
to lower memory usage by only usiing the filter cache for terms that
match more than 100K docs.

That helped a little, cut down my particular test from 10 sec to 5 sec.
But still too slow.  Mind you this is for an autosuggest feature.

Thanks for your reply.

Michael





RE: facet method=enum and uninvertedfield limitations

2013-11-15 Thread Lemke, Michael SZ/HZA-ZSW
On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
 I am running into performance problems with faceted queries.
 If I do a

 q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

 I am getting an exception:
 org.apache.solr.common.SolrException: Too many values for UnInvertedField 
 faceting on field CONTENT
 at 
 org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
 at 
 org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178)
 at 
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
 ...

 I understand it's got something to do with a 24bit limit somewhere
 in the code but I don't understand enough of it to be able to construct
 a specialized index that can be queried with facet.method=enum.

You shouldn't need to do anything differently to try facet.method=enum
(just replace facet.method=fc with facet.method=enum)

This is true and facet.method=enum does work indeed.  The problem is
runtime.  In particular queries with an empty facet.prefix= run many
seconds if not minutes.  I initially asked about this here:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E

It was suggested that fc is much faster than enum and I'd like to
test that.  We are still fairly free to design the index such that
it performs well.  But to do that we need to understand what is
killing it.


You may also want to add the parameter
facet.enum.cache.minDf=10
to lower memory usage by only usiing the filter cache for terms that
match more than 100K docs.

That helped a little, cut down my particular test from 10 sec to 5 sec.
But still too slow.  Mind you this is for an autosuggest feature.

Thanks for your reply.

Michael



facet method=enum and uninvertedfield limitations

2013-11-14 Thread Lemke, Michael SZ/HZA-ZSW
I am running into performance problems with faceted queries.
If I do a 

q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

I am getting an exception:
org.apache.solr.common.SolrException: Too many values for UnInvertedField 
faceting on field CONTENT
at 
org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
at 
org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
...

I understand it's got something to do with a 24bit limit somewhere
in the code but I don't understand enough of it to be able to construct
a specialized index that can be queried with facet.method=enum.

A stripped down index still doesn't work.  It has exactly one
field CONTENT with 178,000 Terms and ~1 mio documents.  The top
ranking terms according to Luke are

1 413950CONTENT word1
2 321223CONTENT word2
3 299036CONTENT word3
4 276757CONTENT word4
...

How would we have to strip the index?

Thanks,
Michael



RE: Facet performance

2013-10-23 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 5:23 PM Michael Lemke wrote:
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was 
doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read 
about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.


Got another observation today.  The response time for q=ottomotor depends on 
facet.limit:

QTime=59300  facet.limit=2
QTime=69395  facet.limit=4
QTime=85208  facet.limit=6
QTime=158150 facet.limit=8
QTime=186276 facet.limit=10
QTime=231763 facet.limit=15
QTime=260437 facet.limit=20
QTime=312268 facet.limit=30

For q=frequent_word the result is much less pronounced and shows only
for facet.limit = 15 :

QTime=0  facet.limit=0
QTime=20535  facet.limit=1
QTime=13456  facet.limit=2
QTime=13925  facet.limit=4
QTime=13705  facet.limit=6
QTime=13924  facet.limit=8
QTime=13799  facet.limit=10
QTime=14361  facet.limit=15
QTime=14704  facet.limit=20
QTime=15189  facet.limit=30
QTime=16783  facet.limit=50
QTime=57128  facet.limit=500

Looks to me for solr to collect enough facets to fulfill the limit constraint
it has to read much more of the index in the case of the infrequent word.

jconsole didn't show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?


Michael




RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.

jconsole didn’t show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?

Michael


RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote:

 This is with Solr 1.4.
Really ?
This sound really outdated to me.
Have you tried a tried more recent version, 4.5 just went out ?

Sorry, can't.  Too much `grown' stuff.

Michael


RE: Facet performance

2013-10-21 Thread Lemke, Michael SZ/HZA-ZSW
On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote:
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 Toke Eskildsen wrote:
  Unfortunately the enum-solution is normally quite slow when there
  are enough unique values to trigger the too many  values-exception.
  [...]
 
 [...] And yes, the fc method was terribly slow in a case where it did
 work.  Something like 20 minutes whereas enum returned within a few
 seconds.

Err.. What? That sounds _very_ strange. You have millions of unique
values so fc should be a lot faster than enum, not the other way around.

I assume the 20 minutes was for the first call. How fast does subsequent
calls return for fc?

QTime enum:
 1st call: 1200
 subsequent calls: 200

QTime fc:
   never returns, webserver restarts itself after 30 min with 100% CPU load


This is on the test system, the production system managed to return with
... Too many values for UnInvertedField faceting 

However, I also have different faceting queries I played with today.

One complete example:

q=ottomotorfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

These are the results, all with facet.method=enum (fc doesn't work).  They
were executed in the sequence shown on an otherwise unused server:

QTime=41205  facet.prefix=q=frequent_word  
numFound=44532

Same query repeated:
QTime=225810 facet.prefix=q=ottomotor  
numFound=909
QTime=199839 facet.prefix=q=ottomotor  
numFound=909

QTime=0  facet.prefix=q=ottomotor jkdhwjfh 
numFound=0
QTime=0  facet.prefix=q=jkdhwjfh   
numFound=0

QTime=185948 facet.prefix=q=ottomotor  
numFound=909

QTime=3344   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3078   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3141   facet.prefix=d   q=ottomotor  
numFound=909

The response time is obviously not dependent on the number of documents found.
Caching doesn't kick in either.



Maybe you could provide some approximate numbers?

I'll try, see below.  Thanks for asking and having a closer look.


- Documents in your index
13,434,414

- Unique values in the CONTENT field
Not sure how to get this.  In luke I find
21,797,514 term count CONTENT

Is that what you mean?

- Hits are returned from a typical query
Hm, that can be anything between 0 and 40,000 or more.
Or do you mean from the facets?  Or do my tests above
answer it?

- Xmx
The maximum the system allows me to get: 1612m


Maybe I have a hopelessly under-dimensioned server for this sort of things?

Thanks a lot for your help,
Michael


Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW
I am working with Solr facet fields and come across a 
performance problem I don't understand. Consider these 
two queries:

1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result) while 
the second one takes only 80 msec (QTime 80). Why is this?

And as side note: facet.method=fc makes the queries run 'forever' and 
eventually 
fail with org.apache.solr.common.SolrException: Too many values for 
UnInvertedField faceting on field CONTENT.

This is with Solr 1.4.




RE: Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in 
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?


Furthermore, for enum the difference between no prefix and some prefix is 
huge. As enum iterates values first (as opposed to fc that iterates hits 
first), limiting to only the values that starts with 'a' ought to speed up 
retrieval by a factor 10 or more.

Thanks.  That is what we sort of figured but it's good to know for sure.  Of 
course it begs the question if there is a way to speed this up?


 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of 
possible unique values when using fc. It is not a bug as such, but more a 
consequence of a choice. Unfortunately the enum-solution is normally quite 
slow when there are enough unique values to trigger the too many 
values-exception. I know too little about the structures for DocValues to say 
if they will help here, but you might want to take a look at those.

What is DocValues?  Haven't heard of it yet.  And yes, the fc method was 
terribly slow in a case where it did work.  Something like 20 minutes whereas 
enum returned within a few seconds.

Michael