Re: DocValue on Strings slow and OOM

2013-11-16 Thread Per Steffensen
Please note, for now, that this problem is not relevant for us anymore, and we will change our c-field from being of type string (docValue) to being of type long (docValue). And faceting on huge numbers of long docValues seem to perform very well - except for

Re: DocValue on Strings slow and OOM

2013-11-14 Thread Per Steffensen
If anyone if following this one, just an update. We are not going to upgrade to 4.5.1 in order to see if the String facet performance problem has been fixed. Instead we have made a few hacks around our data so that we can store the c-field (c_dstr_doc_sto) as long instead (c_dlng_doc_sto). So

Re: DocValue on Strings slow and OOM

2013-11-14 Thread Joel Bernstein
Per, As you are seeing there are different implementations for calculating facets for numeric fields and string fields. The numeric fields I believe are using an int-to-int or long-to-int hashmap to hold the facet counts. This map grows as values are added to it. The String version uses an int

Re: DocValue on Strings slow and OOM

2013-11-06 Thread Per Steffensen
Thanks for all the help, guys! Just to clarify. Everything is working functionality-wise - we have tests showing that. It is just that two similar queries (hitting the same number of rows (only 6 among 12billion in this example) and resulting in the same number of facet-groups etc etc) is

Re: DocValue on Strings slow and OOM

2013-11-06 Thread Per Steffensen
Forget about the quoted comment a the bottom below. It is not true. Both the fast/efficient and the slow/memory-consuming query follow the getTermCounts-path. But I have identified another place where they take different paths in the code. In SimpleFacets.getTermCounts you will find the code

Re: DocValue on Strings slow and OOM

2013-11-06 Thread Robert Muir
Before lucene 4.5 docvalues were loaded entirely into RAM. I'm not going to waste time debugging any old code releases here, you should upgrade to the latest release! On Wed, Nov 6, 2013 at 4:58 AM, Per Steffensen st...@designware.dk wrote: Forget about the quoted comment a the bottom below. It

Re: DocValue on Strings slow and OOM

2013-11-06 Thread Per Steffensen
It seems like NumericFacets.getCounts is using the FieldCache. This is what we wanted to avoid by using doc-values in the first place - because we have experienced so many times that the FieldCache makes us go OOM. We where told that if we used doc-values the FieldCache would not be used. But

Re: DocValue on Strings slow and OOM

2013-11-06 Thread Per Steffensen
On 11/6/13 11:43 AM, Robert Muir wrote: Before lucene 4.5 docvalues were loaded entirely into RAM. I'm not going to waste time debugging any old code releases here, you should upgrade to the latest release! Ok, thanks! I do not consider it a bug (just a performance issue), so no debugging

DocValue on Strings slow and OOM

2013-11-05 Thread Per Steffensen
Hi We have a 6-Solr-node (release 4.4.0) setup with 12billion small documents loadad. The documents have the following fields * a_dlng_doc_sto * b_dlng_doc_sto * c_dstr_doc_sto * timestamp_lng_ind_sto * d_lng_ind_sto From schema.xml dynamicField name=*_dstr_doc_sto type=dstring

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Per Steffensen
Looking at threaddumps It seems like one of the major differences in what is done for c_dstr_doc_sto vs a_dlng_doc_sto is in SimpleFactes.getFacetFieldCounts, where c_dstr_doc_sto takes the getTermCounts-path and a_dlng_doc_sto takes the getListedTermCounts-path. String termList

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Robert Muir
If you are querying on a field, you should index it! On Tue, Nov 5, 2013 at 5:47 AM, Per Steffensen st...@designware.dk wrote: Hi We have a 6-Solr-node (release 4.4.0) setup with 12billion small documents loadad. The documents have the following fields * a_dlng_doc_sto * b_dlng_doc_sto *

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Per Steffensen
On 11/5/13 3:30 PM, Robert Muir wrote: If you are querying on a field, you should index it! Believe I do that. Query looks like this timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n) and both timestamp_lng_ind_sto and d_lng_ind_sto are indexed. Please elaborate! I

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Robert Muir
On Tue, Nov 5, 2013 at 9:42 AM, Per Steffensen st...@designware.dk wrote: On 11/5/13 3:30 PM, Robert Muir wrote: If you are querying on a field, you should index it! Believe I do that. Query looks like this timestamp_lng_ind_sto:[x TO y] AND d_lng_ind_sto:(a OR b OR ... OR n) and both

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Erick Erickson
H. I was just looking at the DocValues Wiki page. Should I add a bit about docValuesFormat supporting Disk as a 4.5 plus feature? Currently it kind of looks like you can do that with 4.2 Or am I off base here? I'm going from CHANGES.txt about LUCENE-5124 Erick On Tue, Nov 5, 2013 at

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Cassandra Targett
On Tue, Nov 5, 2013 at 3:27 PM, Erick Erickson erickerick...@gmail.com wrote: H. I was just looking at the DocValues Wiki page. Should I add a bit about docValuesFormat supporting Disk as a 4.5 plus feature? Currently it kind of looks like you can do that with 4.2 It's in the Solr Ref

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Erick Erickson
Hmmm, what I'm referring to is this bit: fieldType name=string_ondisk class=solr.StrField docValuesFormat=Disk / The docValuesFormat=Disk bit isn't supported until 4.5, which doesn't seem clear in either place. Feel free to disagree of course :). On Tue, Nov 5, 2013 at 11:43 AM, Cassandra

Re: DocValue on Strings slow and OOM

2013-11-05 Thread Shawn Heisey
On 11/5/2013 11:56 AM, Erick Erickson wrote: Hmmm, what I'm referring to is this bit: |||fieldType||name||=||string_ondisk||class||=||solr.StrField||docValuesFormat||=||Disk||/| | | |The docValuesFormat=Disk bit isn't supported until 4.5, which doesn't seem clear in either place. Feel free to