Re: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

Erick Erickson Fri, 05 Aug 2016 20:48:07 -0700

You also need to find out _why_ you're trying to index such huge
tokens, they indicate that something you're ingesting isn't
reasonable....


Just truncating the input will index things, true. But a 32K token is
unexpected, and indicates what's in your index may not be what you
expect and may not be useful.

But you know what you're indexing best, this is just a general statement.

Erick

On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM
ARL (US) <kris.t.musshorn....@mail.mil> wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> What I did was force nutch to truncate content to 32765 max before indexing 
> into solr and it solved my problem.
>
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn....@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, August 05, 2016 3:29 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)
>
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser.
>
>
>
>
> ----
>
> what that error is telling you is that you have an unanalyzed term that is, 
> well, huge (i..e > 32K). Is your "content" field by chance a "string" type? 
> It's very rare that a term > 32K is actually useful.
> You can't search on it except with, say, wildcards,there's no stemming etc. 
> So the first question is whether the "content" field is appropriately defined 
> in your schema for your use case.
>
> If your content field is some kind of text-based field (i.e.
> solr.Textfield), then the second issue may be that you just have wonky data 
> coming in, say a base-64 encoded image or something scraped from somewhere. 
> In that case you need to NOT index it. You can try Or try 
> LengthFilterFactory, see:
> Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory.
>
> This is a fundamental limitation enforced at the Lucene layer, so if that 
> doesn't work, the only real solution is "don't do that". You'll have to 
> intercept the doc and omit that data, perhaps write a custom update processor 
> to throw out huge fields or the like.
>
> Best,
> Erick
>
>
> On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
> <kris.t.musshorn....@mail.mil> wrote:
>> CLASSIFICATION: UNCLASSIFIED
>>
>> I am trying to index from nutch 1.12 to SOLR 6.1.0.
>> Got this error.
>> java.lang.Exception:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at Caution-http://localhost:8983/solr/ARLInside:
>> Exception writing document id
>> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v
>> ol.3.2/index.cfm to the index; possible analysis error: Document
>> contains at least one immense term in field="content" (whose UTF8
>> encoding is longer than the max length 32766
>>
>> How to correct?
>>
>> Thanks,
>> Kris
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Kris T. Musshorn
>> FileMaker Developer - Contractor - Catapult Technology Inc.
>> US Army Research Lab
>> Aberdeen Proving Ground
>> Application Management & Development Branch
>> 410-278-7251
>> kris.t.musshorn....@mail.mil
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>> CLASSIFICATION: UNCLASSIFIED
>
>
> CLASSIFICATION: UNCLASSIFIED

Re: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

Reply via email to