Solr 4.3: Recovering from "Too many values for UnInvertedField faceting on field"

2013-09-03 Thread Dennis Schafroth
We are harvesting and indexing bibliographic data, thus having many distinct 
author names in our index. While testing Solr 4 I believe I had pushed a single 
core to 100 million records (91GB of data) and everything was working fine and 
fast. After adding a little more to the index, then following started to happen:

17328668 [searcherExecutor-4-thread-1] WARN org.apache.solr.core.SolrCore – 
Approaching too many values for UnInvertedField faceting on field 
'author_exact' : bucket size=16726546
17328701 [searcherExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore – 
UnInverted multi-valued field 
{field=author_exact,memSize=336715415,tindexSize=5001903,time=31595,phase1=31465,nTerms=12048027,bigTerms=0,termInstances=57751332,uses=0}
18103757 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore – 
org.apache.solr.common.SolrException: Too many values for UnInvertedField 
faceting on field author_exact
at org.apache.solr.request.UnInvertedField.(UnInvertedField.java:181)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)

I can see that we reached a limit of bucket size. Is there a way to adjust 
this? The index also seem to explode in size (217GB).

Thinking that I had reached a limit for what a single core could handle in 
terms of facet, I deleted records in the index, but even now at 1/3 (32 
million) it will still fails with above error. I have optimised with 
expungeDeleted=true. The index is  somewhat larger (76GB) than I would have 
expected.

While we can still use the index and get facets back using enum method on that 
field, I would still like a way to fix the index if possible. Any suggestions? 

cheers, 
:-Dennis

Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth
I was wrong! It does seem to work! 

Thanks a bunch! 

cheers,
:-Dennis

On Mar 29, 2012, at 15:52 , fbrisbart wrote:

> I had the same issue months ago.
> 'newSearcher' fixed the problem for me.
> I also remember that I had to upgrade solr (3.1) because it didn't work
> with release 1.4 
> But, I suppose you already have a solr 3.x or more.
> So I'm afraid I can't help you more :o(
> 
> Franck
> 
> 
> Le jeudi 29 mars 2012 à 15:41 +0200, Dennis Schafroth a écrit :
>> On Mar 29, 2012, at 14:49 , fbrisbart wrote:
>> 
>>> Arf, I didn't see your attached tgz.
>>> 
>>> In your slave solrconfig.xml, only the 'firstSearcher' contains the
>>> query. Add it also in the 'newSearcher', so that the new search
>>> instances will wait also after a new index is replicated.
>> 
>> Did that now, but I believe my case is mostly a first searcher issue. Anyway 
>> it didn't seem to change anything. 
>> 
>>> 
>>> The first request is long because the default faceting method uses the
>>> FieldCache for your facet fields.
>> 
>> Jup, i know. 
>> 
>>> You may also choose to use the facet.method=enum  The performance is
>>> globally worse
>> 
>> You say. This means that every search with facets is now 20 seconds instead 
>> of 2. Then I prefer the field cache with one bad first search. 
>> 
>>> than the 'fc' method, but you will avoid the very slow
>>> first request. Btw, it's far better to use the default 'enum' facet
>>> method.
> I meant "the default 'fc' method" of course :o)
> 
>> 
>> Thanks for the input so far. 
>> 
>>> 
>>> Hope this helps,
>>> Franck
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Le jeudi 29 mars 2012 à 13:57 +0200, fbrisbart a écrit :
>>>> If you add your query to the firstSearcher and/or newSearcher event
>>>> listeners in the slave
>>>> 'solrconfig.xml' ( 
>>>> http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
>>>>  ),
>>>> 
>>>> each new search instance will wait before accepting queries.
>>>> 
>>>> Example to load the FieldCache for 'your_facet_field' field :
>>>> ...
>>>>   
>>>> 
>>>>   *:*true>>> name="facet.field">your_facet_field
>>>> 
>>>>   
>>>> ...
>>>> 
>>>> 
>>>> Franck
>>>> 
>>>> Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
>>>>> Hi 
>>>>>   
>>>>> I am running indexing and facetted searching on bibliographic data, which 
>>>>> is known not to perform to well due to the high facet count. Actually 
>>>>> it's just the firstSearch that is horrible slow, 200+ seconds  . After 
>>>>> that, I am getting okay times (1 second) (at least in a few users 
>>>>> scenario we have now). 
>>>>> 
>>>>> The current index is 54 millions record with approx. 10 millions unique 
>>>>> authors. The facets (… _exact) is using the string type. 
>>>>> 
>>>>> I had hoped that a master (indexing) and slave (searching) would have 
>>>>> solved the issue, but I am still seeing the issue on the slave, so I 
>>>>> guess I must have misunderstood (or perhaps misconfigured) something
>>>>> 
>>>>> I had thought that the slave would not switch to the new index until the 
>>>>> auto warming was completed.  Is such behavior possible? 
>>>>> 
>>>>> I guess a alternative solution could be to have multiple slaves and 
>>>>> taking a slave off-line when doing replication, but if it is possible to 
>>>>> do simpler (and using 1/3 less space) that would be great. Then again we 
>>>>> might need multiple slaves with more requests.
>>>>> 
>>>>> Attached is the configuration files.
>>>>> 
>>>>> Let me know if there is missing information. 
>>>>> 
>>>>> cheers, 
>>>>> :-Dennis Schafroth
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 



Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth

On Mar 29, 2012, at 14:49 , fbrisbart wrote:

> Arf, I didn't see your attached tgz.
> 
> In your slave solrconfig.xml, only the 'firstSearcher' contains the
> query. Add it also in the 'newSearcher', so that the new search
> instances will wait also after a new index is replicated.

Did that now, but I believe my case is mostly a first searcher issue. Anyway it 
didn't seem to change anything. 

> 
> The first request is long because the default faceting method uses the
> FieldCache for your facet fields.

Jup, i know. 

> You may also choose to use the facet.method=enum  The performance is
> globally worse

You say. This means that every search with facets is now 20 seconds instead of 
2. Then I prefer the field cache with one bad first search. 

> than the 'fc' method, but you will avoid the very slow
> first request. Btw, it's far better to use the default 'enum' facet
> method.

Thanks for the input so far. 

> 
> Hope this helps,
> Franck
> 
> 
> 
> 
> 
> 
> Le jeudi 29 mars 2012 à 13:57 +0200, fbrisbart a écrit :
>> If you add your query to the firstSearcher and/or newSearcher event
>> listeners in the slave
>> 'solrconfig.xml' ( 
>> http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
>>  ),
>> 
>> each new search instance will wait before accepting queries.
>> 
>> Example to load the FieldCache for 'your_facet_field' field :
>> ...
>>
>>  
>>*:*true> name="facet.field">your_facet_field
>>  
>>
>> ...
>> 
>> 
>> Franck
>> 
>> Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
>>> Hi 
>>> 
>>> I am running indexing and facetted searching on bibliographic data, which 
>>> is known not to perform to well due to the high facet count. Actually it's 
>>> just the firstSearch that is horrible slow, 200+ seconds  . After that, I 
>>> am getting okay times (1 second) (at least in a few users scenario we have 
>>> now). 
>>> 
>>> The current index is 54 millions record with approx. 10 millions unique 
>>> authors. The facets (… _exact) is using the string type. 
>>> 
>>> I had hoped that a master (indexing) and slave (searching) would have 
>>> solved the issue, but I am still seeing the issue on the slave, so I guess 
>>> I must have misunderstood (or perhaps misconfigured) something
>>> 
>>> I had thought that the slave would not switch to the new index until the 
>>> auto warming was completed.  Is such behavior possible? 
>>> 
>>> I guess a alternative solution could be to have multiple slaves and taking 
>>> a slave off-line when doing replication, but if it is possible to do 
>>> simpler (and using 1/3 less space) that would be great. Then again we might 
>>> need multiple slaves with more requests.
>>> 
>>> Attached is the configuration files.
>>> 
>>> Let me know if there is missing information. 
>>> 
>>> cheers, 
>>> :-Dennis Schafroth
>>> 
>> 
>> 
> 
> 
> 



Re: Slow first searcher with facet on bibliographic data in Master - Slave

2012-03-29 Thread Dennis Schafroth

I do have a firstSearcher, but currently coldSearcher is set to true. But 
doesn't this just mean that that any searches will block while the first 
searcher is running? This is how the comment describes first searcher. It would 
almost give the same effect; that some searches take a long time.   

What I am looking for is after receiving replicated data, do first searcher and 
then switch to new index. 

I will try with coldSearcher false, but I actually think I have already tried 
this. 

cheers, 
:-Dennis

On Mar 29, 2012, at 13:57 , fbrisbart wrote:

> If you add your query to the firstSearcher and/or newSearcher event
> listeners in the slave
> 'solrconfig.xml' ( 
> http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
>  ),
> 
> each new search instance will wait before accepting queries.
> 
> Example to load the FieldCache for 'your_facet_field' field :
> ...
>
>  
>*:*true name="facet.field">your_facet_field
>      
>    
> ...
> 
> 
> Franck
> 
> Le jeudi 29 mars 2012 à 13:30 +0200, Dennis Schafroth a écrit :
>> Hi 
>>  
>> I am running indexing and facetted searching on bibliographic data, which is 
>> known not to perform to well due to the high facet count. Actually it's just 
>> the firstSearch that is horrible slow, 200+ seconds  . After that, I am 
>> getting okay times (1 second) (at least in a few users scenario we have 
>> now). 
>> 
>> The current index is 54 millions record with approx. 10 millions unique 
>> authors. The facets (… _exact) is using the string type. 
>> 
>> I had hoped that a master (indexing) and slave (searching) would have solved 
>> the issue, but I am still seeing the issue on the slave, so I guess I must 
>> have misunderstood (or perhaps misconfigured) something
>> 
>> I had thought that the slave would not switch to the new index until the 
>> auto warming was completed.  Is such behavior possible? 
>> 
>> I guess a alternative solution could be to have multiple slaves and taking a 
>> slave off-line when doing replication, but if it is possible to do simpler 
>> (and using 1/3 less space) that would be great. Then again we might need 
>> multiple slaves with more requests.
>> 
>> Attached is the configuration files.
>> 
>> Let me know if there is missing information. 
>> 
>> cheers, 
>> :-Dennis Schafroth
>> 
> 
> 
> 



Re: Solr memory consumption

2011-06-01 Thread Dennis Schafroth

I ran out of memory on some big indexes when using solr 1.4. Found out that 
increasing

termInfosIndexDivisor

in solrconfig.xml could help a lot. 

It may slow down your searching your index.

cheers,
:-Dennis


On 02/06/2011, at 01.16, Alexey Serba wrote:

> Hey Denis,
> 
> * How big is your index in terms of number of documents and index size?
> * Is it production system where you have many search requests?
> * Is there any pattern for OOM errors? I.e. right after you start your
> Solr app, after some search activity or specific Solr queries, etc?
> * What are 1) cache settings 2) facets and sort-by fields 3) commit
> frequency and warmup queries?
> etc
> 
> Generally you might want to connect to your jvm using jconsole tool
> and monitor your heap usage (and other JVM/Solr numbers)
> 
> * http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
> * http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX
> 
> HTH,
> Alexey
> 
> 2011/6/1 Denis Kuzmenok :
>> There  were  no  parameters  at  all,  and java hitted "out of memory"
>> almost  every day, then i tried to add parameters but nothing changed.
>> Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
>> because it's the last thing i didn't try yet :(
>> 
>> 
>> Wednesday, June 1, 2011, 9:00:56 PM, you wrote:
>> 
>>> Could be related to your crazy high MaxPermSize like Marcus said.
>> 
>>> I'm no JVM tuning expert either. Few people are, it's confusing. So if
>>> you don't understand it either, why are you trying to throw in very
>>> non-standard parameters you don't understand?  Just start with whatever
>>> the Solr example jetty has, and only change things if you have a reason
>>> to (that you understand).
>> 
>>> On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
 Overall  memory on server is 24G, and 24G of swap, mostly all the time
 swap  is  free and is not used at all, that's why "no free swap" sound
 strange to me..
>> 
>> 
>> 
>> 
>> 
> 



Re: solrj issue: SocketTimeout: read timed out, but commit succed on server.

2011-05-17 Thread Dennis Schafroth

It also happens on add records.  

Putting a proxy in between client and server, revealed that the server writes 
zero bytes back on the update, so what the client says is correct. So guess I 
have to dig into the server code.

Limiting to fewer updates before commit does seem to make the change of success 
higher.

Any input will greatly appreciated. 

cheers, 
:-Dennis

On 17/05/2011, at 14.43, Dennis Schafroth wrote:

> Hi
> 
> I can see others is having same issue but haven't seen any fixes or work 
> around. 
> 
> 
> I am adding and delete records mixed. I do bulks up till 1000 records. On the 
> commit I see the following in the client: 
> 
> 2011-05-17 13:42:41 ERROR - harvester 
> [main/com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage] 
> - Commit failed when adding 39900 and deleting 11666.
> org.apache.solr.client.solrj.SolrServerException: 
> java.net.SocketTimeoutException: Read timed out
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>   at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>   at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>   at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
>   at 
> com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage.commit(SolrRecordStorage.java:47)
>   at 
> com.indexdata.masterkey.localindices.harvest.storage.BulkSolrRecordStorage.commit(BulkSolrRecordStorage.java:101)
>   at 
> com.indexdata.masterkey.localindices.harvest.job.OAIRecordHarvestJob.run(OAIRecordHarvestJob.java:146)
>   at 
> com.indexdata.masterkey.localindices.harvest.job.TestOAIRecordHarvestJob.TestCleanFullBulkHarvestJob(TestOAIRecordHarvestJob.java:65)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
> Caused by: java.net.SocketTimeoutException: Read timed out
>   at java.net.SocketInputStream.socketRead0(Native Method)
>   at java.net.SocketInputStream.read(SocketInputStream.java:129)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>   at 
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
>   at 
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
>   at 
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
>   at 
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
>   at 
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
>   at 
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> or

solrj issue: SocketTimeout: read timed out, but commit succed on server.

2011-05-17 Thread Dennis Schafroth
Hi

I can see others is having same issue but haven't seen any fixes or work 
around. 


I am adding and delete records mixed. I do bulks up till 1000 records. On the 
commit I see the following in the client: 

2011-05-17 13:42:41 ERROR - harvester 
[main/com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage] - 
Commit failed when adding 39900 and deleting 11666.
org.apache.solr.client.solrj.SolrServerException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
at 
com.indexdata.masterkey.localindices.harvest.storage.SolrRecordStorage.commit(SolrRecordStorage.java:47)
at 
com.indexdata.masterkey.localindices.harvest.storage.BulkSolrRecordStorage.commit(BulkSolrRecordStorage.java:101)
at 
com.indexdata.masterkey.localindices.harvest.job.OAIRecordHarvestJob.run(OAIRecordHarvestJob.java:146)
at 
com.indexdata.masterkey.localindices.harvest.job.TestOAIRecordHarvestJob.TestCleanFullBulkHarvestJob(TestOAIRecordHarvestJob.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at 
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
... 24 more
 

But the server seems pretty happy anyway: 

17-05-2011 13:42:40 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
17-05-2011 13:42:41 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=/Users/dennis/solr/solr/data/index,segFN=segments_28,version=1303911462910,generation=80,filenames=[_26.frq,
 _2c.nrm, _26.fnm, _2c.fdx, _2c.prx, _2b.fnm, _2c.fdt, _2d.tis, _2e.frq, 
_2e.prx, _2b.frq, _2b_1.del, _2a.tii, _2c.fnm, _2e.nrm, _2b.prx, _2a.tis, 
_2a.nrm, _2a.fdx, _2c.tis, _2e.tii, _2c.frq, _2e.fdx, _

Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Dennis Schafroth
Thanks for the hints! 

Sorry about stealing the thread "query range in multivalued date field" 
Mistakenly responded to it. 

cheers,
:-Dennis 

On 27/01/2011, at 16.48, Erik Hatcher wrote:

> Beyond what Erick said, I'll add that it is often better to "do this from the 
> outside" and send in multiple actual end-user displayable facet values.  When 
> you send in a field like "Water -- Irrigation ; Water -- Sewage", that is 
> what will get stored (if you have it set to stored), but what you might 
> rather want is each individual value stored, which can only be done by the 
> indexer sending in multiple values, not through just tokenization.
> 
>   Erik
> 
> On Jan 27, 2011, at 09:09 , Dennis Schafroth wrote:
> 
>> Hi, 
>> 
>> Pretty novice into SOLR coding, but looking for hints about how (if not 
>> already done) to implement a PatternTokenizer, that would index this into 
>> multivalie fields of solr.StrField for facetting. Ex. 
>> 
>> Water -- Irrigation ; Water -- Sewage
>> 
>> should be tokenized into 
>> 
>> Water
>> Irrigation
>> Sewage
>> 
>> in multi-valued non-tokenized fields due to performance. I could do it from 
>> the outside, but I would this as a opportunity to learn about SOLR.
>> 
>> It "works" as I want with the PatternTokenizerFactory when I am using 
>> solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
>> according to reading, facets performance is better on non-tokenized fields. 
>> We need better performance on our faceted searches on these multi-value 
>> fields.  (25 million documents, three multi-valued facets)
>> 
>> I would also need to have a filter that filter out identical values as the 
>> feeds have redundant data as shown above.
>> 
>> Can anyone point point me in the right direction..
>> 
>> cheers, 
>> :-Dennis
> 
> 



Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Dennis Schafroth
Hi, 

Pretty novice into SOLR coding, but looking for hints about how (if not already 
done) to implement a PatternTokenizer, that would index this into multivalie 
fields of solr.StrField for facetting. Ex. 

Water -- Irrigation ; Water -- Sewage

should be tokenized into 

Water
Irrigation
Sewage

in multi-valued non-tokenized fields due to performance. I could do it from the 
outside, but I would this as a opportunity to learn about SOLR.

It "works" as I want with the PatternTokenizerFactory when I am using 
solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
according to reading, facets performance is better on non-tokenized fields. We 
need better performance on our faceted searches on these multi-value fields.  
(25 million documents, three multi-valued facets)

I would also need to have a filter that filter out identical values as the 
feeds have redundant data as shown above.

Can anyone point point me in the right direction..

cheers, 
:-Dennis 

Re: Garbled facets even in a zero hit search

2010-09-09 Thread Dennis Schafroth

Thanks, that did it. 

On 09/09/2010, at 16.14, Markus Jelsma wrote:

> That's normal behavior if you haven't configured facet.mincount. Check the 
> wiki.
> 
> On Thursday 09 September 2010 16:05:01 Dennis Schafroth wrote:
>> I am definitely not excluding the idea that index is garbled, but.. it
>> doesn't explain that I get facets on zero hit.
>> 
>> The schema is as following:
>> 
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 
> 



Re: Garbled facets even in a zero hit search

2010-09-09 Thread Dennis Schafroth

I am definitely not excluding the idea that index is garbled, but.. it doesn't 
explain that I get facets on zero hit.

The schema is as following: 


satay.xml
Description: XML document


where I have copied field for the facets (author_exact, subject_exact, 
title_exact), as I don't want tokenization on these. 

the request has the following parameters: 
facet=true&start=0&q=title:xyzy&f.date.facet.limit=10&f.subject_exact.facet.limit=10&facet.field=author_exact&facet.field=date&facet.field=subject_exact&f.author_exact.facet.limit=10&rows=20

I haven't been able to reproduce it in a test index yet, but I do have two 
different index that show similar problem (facets on zero hits).  

cheers, 
:-Dennis Schafroth

On 09/09/2010, at 15.10, Erick Erickson wrote:

> That looks...er...unfortunate. The very first thing I'd do is check your
> index
> and see if there are such weird values in your facet fields. My guess is
> that SOLR is working fine, but you somehow have garbage values
> in your index, but that's only a guess. I'd try that before trying to
> debug, GIGO.
> 
> Which wouldn't answer the question of how the garbage got in there
> in the first place, posting your field type definition for your
> faceted fields would help with that question.
> 
> Best
> Erick
> 
> On Thu, Sep 9, 2010 at 8:19 AM, Dennis Schafroth wrote:
> 
>> Hi,
>> 
>> Running on a Debian 5.0.5 64bit box. Using
>> solr-1.4.1 with Java version "1.6.0_20"
>> 
>> I am seeing weird facets results along with the "right" looking ones.
>> Garbled data, stuff that looks like a buffer overflow / index off by ...
>> 
>> And I even get them when I do a zero hit search. I wouldn't expect any
>> facets:
>> 
>> 
>> 
>>  
>>0
>>56
>>
>>  true
>>  satay:8985/solr
>>  0
>>  title:xzyzx
>>  10
>>  10
>>  
>>author_exact
>>date
>>subject_exact
>>  
>>  10
>>  20
>>
>>  
>>  
>>  
>>
>>
>>  
>>0
>>0
>>0> int>
>>0
>>0
>>0
>>0
>>0
>>0
>>0
>>  
>>  
>>0
>>0
>>0
>>0
>>    0
>>0
>>0
>>0
>>0
>>0
>>  
>>  
>>0
>>0
>>0
>>0
>>0
>>0
>>0
>>0
>>0
>>0
>>  
>>
>>
>>  
>> 
>> 
>> 
>> I tried to look for a bug report, but haven't been able to find one that 
>> matches. I will try to setup a debug session to get closer, but would love 
>> to get feedback if this is a know issue.
>> 
>> cheers,
>> 
>> :-Dennis Schafroth
>> 
>> 
>> 



Garbled facets even in a zero hit search

2010-09-09 Thread Dennis Schafroth
Hi, Running on a Debian 5.0.5 64bit box. Usingsolr-1.4.1 with Java version "1.6.0_20"I am seeing weird facets results along with the "right" looking ones. Garbled data, stuff that looks like a buffer overflow / index off by ...And I even get them when I do a zero hit search. I wouldn't expect any facets:  version="1.0" encoding="UTF-8"?><response>  <lst name="responseHeader">    <int name="status">0int>    <int name="QTime">56int>    <lst name="params">      <str name="facet">truestr>      <str name="shards">satay:8985/solrstr>      <str name="start">0str>      <str name="q">title:xzyzxstr>      <str name="f.date.facet.limit">10str>      <str name="f.subject_exact.facet.limit">10str>      <arr name="facet.field">        <str>author_exactstr>        <str>datestr>        <str>subject_exactstr>      arr>      <str name="f.author_exact.facet.limit">10str>      <str name="rows">20str>    lst>  lst>  <result name="response" numFound="0" start="0"/>  <lst name="facet_counts">    <lst name="facet_queries"/>    <lst name="facet_fields">      <lst name="author_exact">        <int name=" ">0int>        <int name=" !;;!">0int>        <int name=" (Domingo, Juan); Imprenta Tormentaria (Córdoba)">0int>        <int name=" (Supervisor)">0int>        <int name=" *">0int>        <int name=" * ">0int>        <int name=" * (μτφρ.)">0int>        <int name=" * * * ">0int>        <int name=" * * * (μτφρ.)">0int>        <int name=" * * * *">0int>      lst>      <lst name="date">        <int name="">0int>        <int name="0001">0int>        <int name="0002">0int>        <int name="0003">0int>        <int name="0004">0int>        <int name="0005">0int>        <int name="0006">0int>        <int name="0007">0int>        <int name="0008">0int>        <int name="0009">0int>      lst>      <lst name="subject_exact">        <int name=" ">0int>        <int name=" ! ! R P R">0int>        <int name=" !!rrqqyyhqhqwwllrqrqdd!!vvddvv">0int>        <int name=" !""$%"( )*+,($"(">0int>        <int name=" !()+, -./01 23456">0int>        <int name=" !-decidable and decidable deductive procedures for a restricted FTL with Unless">0int>        <int name=" !<f87.03...">0int>        <int name=" ")338-8570">0int>        <int name=" "-Optimization Schemes and L-Bit Precision: Alternative Perspectives in Combinatorial Optimization">0int>        <int name=" "A picture is worth 1K words"">0int>      lst>    lst>    <lst name="facet_dates"/>  lst>response>

response_formated.xml
Description: XML document
I tried to look for a bug report, but haven't been able to find one that matches. I will try to setup a debug session to get closer, but would love to get feedback if this is a know issue.cheers, :-Dennis Schafroth