Thanks for your comments guys. As I said, I've tried all of these things, but none of them worked. anyway, I don't have more time to spend for it and I could do the job another way. So, at least for now I guess I wont need it anymore. Thanks again.
Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari * On Wed, Jun 22, 2016 at 9:24 AM, Jose-Marcio Martins da Cruz < [email protected]> wrote: > > Hi, > > I solved this, at least for now, changing the type of "content" field, > from string to general_text : > > <field name="content" type="text_general"/> > > José-Marcio > > > On 06/22/2016 02:05 PM, Markus Jelsma wrote: > >> Yes, this happens if you use recent Solr's with managed schema, it >> apparently treats text as string types. There's a ticket to change that to >> TextField though. >> Markus >> >> >> >> -----Original message----- >> >>> From:Sebastian Nagel <[email protected]> >>> Sent: Tuesday 21st June 2016 23:15 >>> To: [email protected] >>> Subject: Re: immense term,Correcting analyzer >>> >>> Hi, >>> >>> you are right, looks like the field "content" is indexed as one single >>> term >>> and is not split ("tokenized") into words. The best way would be >>> to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml), >>> see >>> https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch >>> >>> BTW, am using nutch 1.11 and solr 6.0.0 >>>> >>> Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!) >>> >>> Sebastian >>> >>> >>> On 06/21/2016 08:04 PM, shakiba davari wrote: >>> >>>> Hi guys, I'm trying to index my nutch crawled data by running: >>>> >>>> bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate >>>> " >>>> crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* >>>> >>>> At first it was working totally Ok. I indexed my data, sent a few >>>> queries >>>> and recieved good result. but then I ran the crawling again so that it >>>> crawles in a bigger depth and fetches more pages. so the last time the >>>> crawler's status showed 1051 unfetched and 151 fetched data in my db. >>>> and >>>> now when I run the nutch index command, I face with " >>>> java.io.IOException: >>>> Job failed!" >>>> here is my log: >>>> >>>> java.lang.Exception: >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >>>> Exception writing document id >>>> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; >>>> possible analysis error: Document contains at least one immense term in >>>> field="content" (whose UTF8 encoding is longer than the max length >>>> 32766), >>>> all of which were skipped. Please correct the analyzer to not produce >>>> such >>>> terms. The prefix of the first immense term is: '[70, 114, 97, 110, >>>> 107, >>>> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, >>>> 32, >>>> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be >>>> at >>>> most 32766 in length; got 40063. Perhaps the document has an indexed >>>> string >>>> field (solr.StrField) which is too large >>>> at >>>> >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) >>>> at >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) >>>> Caused by: >>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >>>> Exception writing document id >>>> http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; >>>> possible analysis error: Document contains at least one immense term in >>>> field="content" (whose UTF8 encoding is longer than the max length >>>> 32766), >>>> all of which were skipped. Please correct the analyzer to not produce >>>> such >>>> terms. The prefix of the first immense term is: '[70, 114, 97, 110, >>>> 107, >>>> 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, >>>> 32, >>>> 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be >>>> at >>>> most 32766 in length; got 40063. Perhaps the document has an indexed >>>> string >>>> field (solr.StrField) which is too large >>>> at >>>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) >>>> at >>>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) >>>> at >>>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) >>>> at >>>> >>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) >>>> at >>>> >>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) >>>> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) >>>> at >>>> >>>> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) >>>> at >>>> >>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) >>>> at >>>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) >>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) >>>> at >>>> >>>> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) >>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: >>>> java.io.IOException: Job failed! >>>> https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) >>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) >>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) >>>> >>>> >>>> I realize that in the mentioned page there is really long term so in >>>> schema.xml and managed-schema I changed the type of "id", "content",and >>>> "text" from "strings" to "text_general" : >>>> <field name="id" type="text_general"> >>>> but it didn't solve the problem. >>>> I'm no expert, so I'm not sure how to correct the analyzer without >>>> screwing >>>> up something else. I've read somewhere that I can >>>> 1. use (in index analyzer), a LengthFilterFactory in order to filter out >>>> those tokens that don't fall withing a requested length range. >>>> 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the >>>> max >>>> length of indexed tokens >>>> >>>> but there are so many analyzer in the schema. should I change the >>>> analyzer >>>> defined for <fieldType name="text_general"...> ? if yes since the >>>> content >>>> and other fields' type are text_general, isn't it gonna affect all of >>>> them >>>> too? >>>> >>>> I would really appreciate any help. >>>> BTW, am using nutch 1.11 and solr 6.0.0 >>>> >>>> Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* >>>> Davari * >>>> >>>> >>> >>> > > -- > > Envoyé de ma machine à écrire. > --------------------------------------------------------------- > Spam : Classement statistique de messages électroniques - > Une approche pragmatique > Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM > --------------------------------------------------------------- > Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org > Ecole des Mines de Paris http://bit.ly/SpamJM > 60, bd Saint Michel 75272 - PARIS CEDEX 06 > >

