Re: immense term,Correcting analyzer

Jose-Marcio Martins da Cruz Wed, 22 Jun 2016 06:31:31 -0700

Hi,


I solved this, at least for now, changing the type of "content" field, from 
string to general_text :

  <field name="content" type="text_general"/>

José-Marcio

On 06/22/2016 02:05 PM, Markus Jelsma wrote:

Yes, this happens if you use recent Solr's with managed schema, it apparently 
treats text as string types. There's a ticket to change that to TextField 
though.
Markus



-----Original message-----

From:Sebastian Nagel <[email protected]>
Sent: Tuesday 21st June 2016 23:15
To: [email protected]
Subject: Re: immense term,Correcting analyzer

Hi,

you are right, looks like the field "content" is indexed as one single term
and is not split ("tokenized") into words.  The best way would be
to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml),
see https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch

BTW, am using nutch 1.11 and solr 6.0.0

Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!)

Sebastian


On 06/21/2016 08:04 PM, shakiba davari wrote:

Hi guys, I'm trying to index my nutch crawled data by running:

bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate";
crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*

At first it was working totally Ok. I indexed my data, sent a few queries
and recieved good result. but then I ran the crawling again so that it
crawles in a bigger depth and fetches more pages. so the last time the
crawler's status showed 1051 unfetched and 151 fetched data in my db. and
now when I run the nutch index command, I face with " java.io.IOException:
Job failed!"
here is my log:

java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Exception writing document id
http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
most 32766 in length; got 40063. Perhaps the document has an indexed string
field (solr.StrField) which is too large
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Exception writing document id
http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
possible analysis error: Document contains at least one immense term in
field="content" (whose UTF8 encoding is longer than the max length 32766),
all of which were skipped.  Please correct the analyzer to not produce such
terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
most 32766 in length; got 40063. Perhaps the document has an indexed string
field (solr.StrField) which is too large
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job 
failed!https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)


I realize that in the mentioned page there is really long term so in
schema.xml and managed-schema I changed the type of "id", "content",and
"text" from "strings" to "text_general" :
<field name="id" type="text_general">
but it didn't solve the problem.
I'm no expert, so I'm not sure how to correct the analyzer without screwing
up something else. I've read somewhere that I can
1. use (in index analyzer), a LengthFilterFactory in order to filter out
those tokens that don't fall withing a requested length range.
2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max
length of indexed tokens

but there are so many analyzer in the schema. should I change the analyzer
defined for <fieldType name="text_general"...> ? if yes since the content
and other fields' type are text_general, isn't it gonna affect all of them
too?

I would really appreciate any help.
BTW, am using nutch 1.11 and solr 6.0.0

Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *



--

 Envoyé de ma machine à écrire.
 ---------------------------------------------------------------
  Spam : Classement statistique de messages électroniques -
         Une approche pragmatique
  Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
 ---------------------------------------------------------------
 Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
 Ecole des Mines de Paris                   http://bit.ly/SpamJM
 60, bd Saint Michel                      75272 - PARIS CEDEX 06

Re: immense term,Correcting analyzer

Reply via email to