Re: immense term,Correcting analyzer

Sebastian Nagel Tue, 21 Jun 2016 14:15:36 -0700

Hi,

you are right, looks like the field "content" is indexed as one single term
and is not split ("tokenized") into words.  The best way would be
to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml),
see https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch


> BTW, am using nutch 1.11 and solr 6.0.0
Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!)

Sebastian


On 06/21/2016 08:04 PM, shakiba davari wrote:
> Hi guys, I'm trying to index my nutch crawled data by running:
> 
> bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate";
> crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
> 
> At first it was working totally Ok. I indexed my data, sent a few queries
> and recieved good result. but then I ran the crawling again so that it
> crawles in a bigger depth and fetches more pages. so the last time the
> crawler's status showed 1051 unfetched and 151 fetched data in my db. and
> now when I run the nutch index command, I face with " java.io.IOException:
> Job failed!"
> here is my log:
> 
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Exception writing document id
> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
> possible analysis error: Document contains at least one immense term in
> field="content" (whose UTF8 encoding is longer than the max length 32766),
> all of which were skipped.  Please correct the analyzer to not produce such
> terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
> most 32766 in length; got 40063. Perhaps the document has an indexed string
> field (solr.StrField) which is too large
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Exception writing document id
> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
> possible analysis error: Document contains at least one immense term in
> field="content" (whose UTF8 encoding is longer than the max length 32766),
> all of which were skipped.  Please correct the analyzer to not produce such
> terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
> most 32766 in length; got 40063. Perhaps the document has an indexed string
> field (solr.StrField) which is too large
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job 
> failed!https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> 
> 
> I realize that in the mentioned page there is really long term so in
> schema.xml and managed-schema I changed the type of "id", "content",and
> "text" from "strings" to "text_general" :
> <field name="id" type="text_general">
> but it didn't solve the problem.
> I'm no expert, so I'm not sure how to correct the analyzer without screwing
> up something else. I've read somewhere that I can
> 1. use (in index analyzer), a LengthFilterFactory in order to filter out
> those tokens that don't fall withing a requested length range.
> 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max
> length of indexed tokens
> 
> but there are so many analyzer in the schema. should I change the analyzer
> defined for <fieldType name="text_general"...> ? if yes since the content
> and other fields' type are text_general, isn't it gonna affect all of them
> too?
> 
> I would really appreciate any help.
> BTW, am using nutch 1.11 and solr 6.0.0
> 
> Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *
>

Re: immense term,Correcting analyzer

Reply via email to