immense term,Correcting analyzer

shakiba davari Tue, 21 Jun 2016 11:05:30 -0700

Hi guys, I'm trying to index my nutch crawled data by running:

bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate";
crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*


At first it was working totally Ok. I indexed my data, sent a few queries
and recieved good result. but then I ran the crawling again so that it
crawles in a bigger depth and fetches more pages. so the last time the
crawler's status showed 1051 unfetched and 151 fetched data in my db. and
now when I run the nutch index command, I face with " java.io.IOException:
Job failed!"
here is my log:

java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Exception writing document id
http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
most 32766 in length; got 40063. Perhaps the document has an indexed string
field (solr.StrField) which is too large
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Exception writing document id
http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
most 32766 in length; got 40063. Perhaps the document has an indexed string
field (solr.StrField) which is too large
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)


I realize that in the mentioned page there is really long term so in
schema.xml and managed-schema I changed the type of "id", "content",and
"text" from "strings" to "text_general" :
<field name="id" type="text_general">
but it didn't solve the problem.
I'm no expert, so I'm not sure how to correct the analyzer without screwing
up something else. I've read somewhere that I can
1. use (in index analyzer), a LengthFilterFactory in order to filter out
those tokens that don't fall withing a requested length range.
2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max
length of indexed tokens

but there are so many analyzer in the schema. should I change the analyzer
defined for <fieldType name="text_general"...> ? if yes since the content
and other fields' type are text_general, isn't it gonna affect all of them
too?

I would really appreciate any help.
BTW, am using nutch 1.11 and solr 6.0.0

Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *

immense term,Correcting analyzer

Reply via email to