Hi guys, I'm trying to index my nutch crawled data by running: bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
At first it was working totally Ok. I indexed my data, sent a few queries and recieved good result. but then I ran the crawling again so that it crawles in a bigger depth and fetches more pages. so the last time the crawler's status showed 1051 unfetched and 151 fetched data in my db. and now when I run the nutch index command, I face with " java.io.IOException: Job failed!" here is my log: java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) I realize that in the mentioned page there is really long term so in schema.xml and managed-schema I changed the type of "id", "content",and "text" from "strings" to "text_general" : <field name="id" type="text_general"> but it didn't solve the problem. I'm no expert, so I'm not sure how to correct the analyzer without screwing up something else. I've read somewhere that I can 1. use (in index analyzer), a LengthFilterFactory in order to filter out those tokens that don't fall withing a requested length range. 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max length of indexed tokens but there are so many analyzer in the schema. should I change the analyzer defined for <fieldType name="text_general"...> ? if yes since the content and other fields' type are text_general, isn't it gonna affect all of them too? I would really appreciate any help. BTW, am using nutch 1.11 and solr 6.0.0 Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *

