Hello Shakiba, Look carefully:
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc784d2ca9_1b69_4718_a3cb_0cce46c846cf/solr/admin/collections/example_collection and Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. See http://stackoverflow.com/questions/25852164/solrj-authentication-fails-on-write-operation Markus -----Original message----- > From:shakiba davari <[email protected]> > Sent: Tuesday 21st June 2016 19:26 > To: [email protected] > Subject: Re: Indexing nutch crawled data in “Bluemix” solr > > Hi Markus, > Thanks for your response, > here is my hadoop log: > > > 2016-06-21 13:21:54,461 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: crawl/crawldb > 2016-06-21 13:21:54,461 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: crawl/linkdb > 2016-06-21 13:21:54,461 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: crawl/segments/20160620191357 > 2016-06-21 13:21:54,616 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2016-06-21 13:21:54,725 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: crawl/segments/20160620191450 > 2016-06-21 13:21:54,726 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: crawl/segments/20160620191633 > 2016-06-21 13:21:55,520 WARN conf.Configuration - > file:/tmp/hadoop-sdavari/mapred/staging/sdavari82799737/.staging/job_local82799737_0001/job.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2016-06-21 13:21:55,524 WARN conf.Configuration - > file:/tmp/hadoop-sdavari/mapred/staging/sdavari82799737/.staging/job_local82799737_0001/job.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2016-06-21 13:21:55,633 WARN conf.Configuration - > file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local82799737_0001/job_local82799737_0001.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2016-06-21 13:21:55,637 WARN conf.Configuration - > file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local82799737_0001/job_local82799737_0001.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2016-06-21 13:21:55,944 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2016-06-21 13:21:57,956 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2016-06-21 13:21:57,971 INFO solr.SolrUtils - Authenticating as: > 571d02db-d20c-4ab5-a807-a6324a0570b9 > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: content > dest: content > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: title dest: > title > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: host dest: > host > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: segment > dest: segment > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: boost dest: > boost > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: digest dest: > digest > 2016-06-21 13:21:58,189 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2016-06-21 13:21:58,666 INFO solr.SolrIndexWriter - Indexing 150 documents > 2016-06-21 13:21:59,344 INFO solr.SolrIndexWriter - Indexing 150 documents > 2016-06-21 13:21:59,802 WARN mapred.LocalJobRunner - job_local82799737_0001 > java.lang.Exception: java.io.IOException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.io.IOException > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:171) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:157) > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) > at > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.solr.client.solrj.SolrServerException: IOException > occured when talking to server at: > https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc784d2ca9_1b69_4718_a3cb_0cce46c846cf/solr/admin/collections/example_collection > at > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) > ... 11 more > Caused by: org.apache.http.client.ClientProtocolException > at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) > ... 15 more > Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot > retry request with a non-repeatable request entity. > at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:208) > at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) > at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) > at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) > at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) > ... 19 more > 2016-06-21 13:22:00,803 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) > > It's really taking so long. I appreciate any help and thoughts. > > Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari * > > > On Tue, Jun 21, 2016 at 7:56 AM, Markus Jelsma <[email protected]> > wrote: > > > Hello Shakiba - please check Nutch' logs. The error is reported there. > > > > Markus > > > > > > > > -----Original message----- > > > From:shakiba davari <[email protected]> > > > Sent: Thursday 16th June 2016 23:04 > > > To: [email protected] > > > Subject: Re: Indexing nutch crawled data in “Bluemix” solr > > > > > > Thanks so much Lewis. It really helped me. at least now I know that there > > > is a way to make it work. > > > I did used the command as you said: > > > > > > bin/nutch index -D solr.server.url=" > > > > > https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections > > > -D solr.auth=true -D solr.auth.username="USERNAME" -D > > > solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb > > > Crawl/segments/2016* > > > > > > and now the result is: > > > > > > Indexing 153 documents > > > Indexing 153 documents > > > Indexer: java.io.IOException: Job failed! > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > > > at > > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > > at > > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) > > > > > > > > > I guess it has something to do with the solr.server.url address, maybe > > the > > > end of it. I changed it in different ways > > > e.g " > > > > > https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update > > " > > > (since it is used for feeding JSON files the the bluemix solr ) > > > but no chance to now. > > > > > > any Idea what's happening now?? > > > > > > > > > Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari > > * > > > > > > > > > On Tue, Jun 14, 2016 at 4:58 PM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > > > Hi shakiba, > > > > > > > > On Sat, Jun 11, 2016 at 1:48 PM, <[email protected]> > > > > wrote: > > > > > > > > > From: shakiba davari <[email protected]> > > > > > To: [email protected] > > > > > Cc: > > > > > Date: Thu, 9 Jun 2016 13:11:43 -0400 > > > > > Subject: Indexing nutch crawled data in “Bluemix” solr > > > > > 1down votefavorite > > > > > < > > > > > > > > > > > http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr# > > > > > > > > > > > > > > > > I'm trying to index the nutch crawled data by Bluemix solr and I > > cannot > > > > > find anyway to do it. My main question is: Is there anybody that can > > help > > > > > me to do so? what should I do to send the result of my nutch crawled > > data > > > > > to my Blumix Solr. > > > > > > > > > > For the crawling I used nutch 1.11 and here is a part of what I did > > to > > > > now > > > > > and the problems I faced: I thought there may be two possible > > solutions: > > > > > > > > > > 1. By nutch command: > > > > > > > > > > “NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/ > > > > > -Dsolr.server.url="OURSOLRURL"” > > > > > > > > > > I can index the nutch crawled data by OURSOLR. However, I found some > > > > > problem with that. > > > > > > > > > > a-Though it sounds really odd, it could not accept the URL. I could > > > > handle > > > > > it by using the URL’s Encode instead. > > > > > > > > > > b-Since I have to connect to a specific Username and password, nutch > > > > could > > > > > not connect to my solr. Considering this: > > > > > > > > > > Active IndexWriters : > > > > > SolrIndexWriter > > > > > solr.server.type : Type of SolrServer to communicate with > > (default > > > > > 'http' however options include 'cloud', 'lb' and 'concurrent') > > > > > solr.server.url : URL of the Solr instance (mandatory) > > > > > solr.zookeeper.url : URL of the Zookeeper URL (mandatory if > > > > > 'cloud' value for solr.server.type) > > > > > solr.loadbalance.urls : Comma-separated string of Solr server > > > > > strings to be used (madatory if 'lb' value for solr.server.type) > > > > > solr.mapping.file : name of the mapping file for fields (default > > > > > solrindex-mapping.xml) > > > > > solr.commit.size : buffer size when sending to Solr (default > > 1000) > > > > > solr.auth : use authentication (default false) > > > > > solr.auth.username : username for authentication > > > > > solr.auth.password : password for authentication > > > > > > > > > > in the command line output,I tried to manage this problem by using > > > > > authentication parameters of the command "solr.auth=true > > > > > solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it. > > > > > > > > > > So up to now I’ve got to a point to use this command: > > > > > > > > > > ”bin/nutch index crawl/crawldb -linkdb crawl/linkdb > > crawl/segments/2016* > > > > > solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net > > > > > > > > > > > %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" > > > > > solr.auth=true solr.auth.username="USERNAME" > > solr.auth.password="PASS"“. > > > > > > > > > > But for some reason that I couldn’t realize yet, the command > > considers > > > > the > > > > > authentication parameters as crawled data directory and does not > > work. > > > > So I > > > > > guess it is not the right way to "Active IndexWriters" can anyone > > tell me > > > > > then how can I?? > > > > > > > > > > > > > Please enter the command line parameters IN FRONT of the Tool arguments > > > > e.g. bin/nutch index -D solr.server.url="https%3A%2F% > > > > 2Fgateway.watsonplatform.net > > > > > > > > > > %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" > > > > -D solr.auth=true -D solr.auth.username="USERNAME" -D > > > > solr.auth.password="PASS" crawl/crawldb -linkdb crawl/linkdb > > > > crawl/segments/ > > > > > > > > > > > > > > > > > > 1. By curl command: > > > > > > > > > > “curl -X POST -H "Content-Type: application/json" -u > > > > > "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" " > > > > > > > > > > > > > > > > https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update > > > > > " > > > > > --data-binary @{/path_to_file}/FILE.json” > > > > > > > > > > I thought maybe I can feed json files created by this command: > > > > > > > > > > bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment > > > > > crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename > > > > > -jsonArray -reverseKey but there are some problems here. > > > > > > > > > > a. this command provides so many files in complicated Paths which > > will > > > > take > > > > > so much time to manually post all of them.I guess for big cawlings > > it may > > > > > be even impossible. Is there any way to POST all the files in a > > directory > > > > > and its subdirectories at once by just one command?? > > > > > > > > > > > > > Unfortunately, right now AFAIK you cannot prevent the tool from > > creating > > > > the directory hell. You might be better off using the FileDumper tool > > > > instead > > > > ./bin/nutch dump > > > > > > > > > > > > > > > > > > b. there is a weird name "ÙÙ÷y œ" at the start of json files created > > by > > > > > commoncrawldump. > > > > > > > > > > > > > The data is encoded as CBOR. This is why the Bytes exist. > > > > > > > > > > > > > > > > > > c. I removed the name weird name and tried to POST just one of these > > > > files > > > > > but here is the result: > > > > > > > > > > > > > > > > > > > > > {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown > > > > > command 'url' at [9]","code":400}} > > > > > > > > > > > > > > No it just means that you are not using the index tool correctly and > > that > > > > possibly your input data is not in the correct format. > > > > Hope this help.s > > > > Lewis > > > > > > > > > >

