Re: immense term,Correcting analyzer

shakiba davari Thu, 23 Jun 2016 16:29:08 -0700

Thanks for your comments guys. As I said, I've tried all of these things,
but none of them worked.
anyway, I don't have more time to spend for it and I could do the job
another way. So, at least for now I guess I wont need it anymore. Thanks
again.


Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *


On Wed, Jun 22, 2016 at 9:24 AM, Jose-Marcio Martins da Cruz <
[email protected]> wrote:

>
> Hi,
>
> I solved this, at least for now, changing the type of "content" field,
> from string to general_text :
>
>   <field name="content" type="text_general"/>
>
> José-Marcio
>
>
> On 06/22/2016 02:05 PM, Markus Jelsma wrote:
>
>> Yes, this happens if you use recent Solr's with managed schema, it
>> apparently treats text as string types. There's a ticket to change that to
>> TextField though.
>> Markus
>>
>>
>>
>> -----Original message-----
>>
>>> From:Sebastian Nagel <[email protected]>
>>> Sent: Tuesday 21st June 2016 23:15
>>> To: [email protected]
>>> Subject: Re: immense term,Correcting analyzer
>>>
>>> Hi,
>>>
>>> you are right, looks like the field "content" is indexed as one single
>>> term
>>> and is not split ("tokenized") into words.  The best way would be
>>> to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml),
>>> see
>>> https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
>>>
>>> BTW, am using nutch 1.11 and solr 6.0.0
>>>>
>>> Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!)
>>>
>>> Sebastian
>>>
>>>
>>> On 06/21/2016 08:04 PM, shakiba davari wrote:
>>>
>>>> Hi guys, I'm trying to index my nutch crawled data by running:
>>>>
>>>> bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate
>>>> "
>>>> crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
>>>>
>>>> At first it was working totally Ok. I indexed my data, sent a few
>>>> queries
>>>> and recieved good result. but then I ran the crawling again so that it
>>>> crawles in a bigger depth and fetches more pages. so the last time the
>>>> crawler's status showed 1051 unfetched and 151 fetched data in my db.
>>>> and
>>>> now when I run the nutch index command, I face with "
>>>> java.io.IOException:
>>>> Job failed!"
>>>> here is my log:
>>>>
>>>> java.lang.Exception:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> Exception writing document id
>>>> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
>>>> possible analysis error: Document contains at least one immense term in
>>>> field="content" (whose UTF8 encoding is longer than the max length
>>>> 32766),
>>>> all of which were skipped.  Please correct the analyzer to not produce
>>>> such
>>>> terms.  The prefix of the first immense term is: '[70, 114, 97, 110,
>>>> 107,
>>>> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116,
>>>> 32,
>>>> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be
>>>> at
>>>> most 32766 in length; got 40063. Perhaps the document has an indexed
>>>> string
>>>> field (solr.StrField) which is too large
>>>> at
>>>>
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
>>>> Caused by:
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>> Exception writing document id
>>>> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
>>>> possible analysis error: Document contains at least one immense term in
>>>> field="content" (whose UTF8 encoding is longer than the max length
>>>> 32766),
>>>> all of which were skipped.  Please correct the analyzer to not produce
>>>> such
>>>> terms.  The prefix of the first immense term is: '[70, 114, 97, 110,
>>>> 107,
>>>> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116,
>>>> 32,
>>>> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be
>>>> at
>>>> most 32766 in length; got 40063. Perhaps the document has an indexed
>>>> string
>>>> field (solr.StrField) which is too large
>>>> at
>>>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
>>>> at
>>>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
>>>> at
>>>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>>>> at
>>>>
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>>>> at
>>>>
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
>>>> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>>>> at
>>>>
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
>>>> at
>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
>>>> at
>>>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>>> at
>>>>
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>>>> at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>> 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer:
>>>> java.io.IOException: Job failed!
>>>> https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>>>>
>>>>
>>>> I realize that in the mentioned page there is really long term so in
>>>> schema.xml and managed-schema I changed the type of "id", "content",and
>>>> "text" from "strings" to "text_general" :
>>>> <field name="id" type="text_general">
>>>> but it didn't solve the problem.
>>>> I'm no expert, so I'm not sure how to correct the analyzer without
>>>> screwing
>>>> up something else. I've read somewhere that I can
>>>> 1. use (in index analyzer), a LengthFilterFactory in order to filter out
>>>> those tokens that don't fall withing a requested length range.
>>>> 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the
>>>> max
>>>> length of indexed tokens
>>>>
>>>> but there are so many analyzer in the schema. should I change the
>>>> analyzer
>>>> defined for <fieldType name="text_general"...> ? if yes since the
>>>> content
>>>> and other fields' type are text_general, isn't it gonna affect all of
>>>> them
>>>> too?
>>>>
>>>> I would really appreciate any help.
>>>> BTW, am using nutch 1.11 and solr 6.0.0
>>>>
>>>> Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>*
>>>> Davari *
>>>>
>>>>
>>>
>>>
>
> --
>
>  Envoyé de ma machine à écrire.
>  ---------------------------------------------------------------
>   Spam : Classement statistique de messages électroniques -
>          Une approche pragmatique
>   Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
>  ---------------------------------------------------------------
>  Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
>  Ecole des Mines de Paris                   http://bit.ly/SpamJM
>  60, bd Saint Michel                      75272 - PARIS CEDEX 06
>
>

Re: immense term,Correcting analyzer

Reply via email to