Re: Indexing nutch crawled data in “Bluemix” solr

shakiba davari Tue, 21 Jun 2016 10:27:06 -0700

Hi Markus,
Thanks for your response,
here is my hadoop log:

2016-06-21 13:21:54,461 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2016-06-21 13:21:54,461 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2016-06-21 13:21:54,461 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20160620191357
2016-06-21 13:21:54,616 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2016-06-21 13:21:54,725 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20160620191450
2016-06-21 13:21:54,726 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20160620191633
2016-06-21 13:21:55,520 WARN  conf.Configuration -
file:/tmp/hadoop-sdavari/mapred/staging/sdavari82799737/.staging/job_local82799737_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-06-21 13:21:55,524 WARN  conf.Configuration -
file:/tmp/hadoop-sdavari/mapred/staging/sdavari82799737/.staging/job_local82799737_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-06-21 13:21:55,633 WARN  conf.Configuration -
file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local82799737_0001/job_local82799737_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-06-21 13:21:55,637 WARN  conf.Configuration -
file:/tmp/hadoop-sdavari/mapred/local/localRunner/sdavari/job_local82799737_0001/job_local82799737_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-06-21 13:21:55,944 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2016-06-21 13:21:57,956 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-06-21 13:21:57,971 INFO  solr.SolrUtils - Authenticating as:
571d02db-d20c-4ab5-a807-a6324a0570b9
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: content
dest: content
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: title dest:
title
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: host dest:
host
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: segment
dest: segment
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: boost dest:
boost
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: digest dest:
digest
2016-06-21 13:21:58,189 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2016-06-21 13:21:58,666 INFO  solr.SolrIndexWriter - Indexing 150 documents
2016-06-21 13:21:59,344 INFO  solr.SolrIndexWriter - Indexing 150 documents
2016-06-21 13:21:59,802 WARN  mapred.LocalJobRunner - job_local82799737_0001
java.lang.Exception: java.io.IOException
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:171)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:157)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException
occured when talking to server at:
https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/sc784d2ca9_1b69_4718_a3cb_0cce46c846cf/solr/admin/collections/example_collection
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
... 11 more
Caused by: org.apache.http.client.ClientProtocolException
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 15 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot
retry request with a non-repeatable request entity.
at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:208)
at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
... 19 more
2016-06-21 13:22:00,803 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

It's really taking so long. I appreciate any help and thoughts.

Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *

On Tue, Jun 21, 2016 at 7:56 AM, Markus Jelsma <[email protected]>
wrote:

> Hello Shakiba - please check Nutch' logs. The error is reported there.
>
> Markus
>
>
>
> -----Original message-----
> > From:shakiba davari <[email protected]>
> > Sent: Thursday 16th June 2016 23:04
> > To: [email protected]
> > Subject: Re: Indexing nutch crawled data in “Bluemix” solr
> >
> > Thanks so much Lewis. It really helped me. at least now I know that there
> > is a way to make it work.
> > I did used the command as you said:
> >
> > bin/nutch index -D solr.server.url="
> >
> https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections
> > -D solr.auth=true -D solr.auth.username="USERNAME" -D
> > solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb
> > Crawl/segments/2016*
> >
> > and now the result is:
> >
> > Indexing 153 documents
> > Indexing 153 documents
> > Indexer: java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> >         at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> >         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> >
> >
> > I guess it has something to do with the solr.server.url address, maybe
> the
> > end of it. I changed it in different ways
> > e.g "
> >
> https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update
> "
> > (since it is used for feeding JSON files the the bluemix solr )
> > but no chance to now.
> >
> > any Idea what's happening now??
> >
> >
> > Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari
> *
> >
> >
> > On Tue, Jun 14, 2016 at 4:58 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > Hi shakiba,
> > >
> > > On Sat, Jun 11, 2016 at 1:48 PM, <[email protected]>
> > > wrote:
> > >
> > > > From: shakiba davari <[email protected]>
> > > > To: [email protected]
> > > > Cc:
> > > > Date: Thu, 9 Jun 2016 13:11:43 -0400
> > > > Subject: Indexing nutch crawled data in “Bluemix” solr
> > > > 1down votefavorite
> > > > <
> > > >
> > >
> http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr#
> > > > >
> > > >
> > > > I'm trying to index the nutch crawled data by Bluemix solr and I
> cannot
> > > > find anyway to do it. My main question is: Is there anybody that can
> help
> > > > me to do so? what should I do to send the result of my nutch crawled
> data
> > > > to my Blumix Solr.
> > > >
> > > >  For the crawling I used nutch 1.11 and here is a part of what I did
> to
> > > now
> > > > and the problems I faced: I thought there may be two possible
> solutions:
> > > >
> > > >    1. By nutch command:
> > > >
> > > > “NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/
> > > > -Dsolr.server.url="OURSOLRURL"”
> > > >
> > > > I can index the nutch crawled data by OURSOLR. However, I found some
> > > > problem with that.
> > > >
> > > > a-Though it sounds really odd, it could not accept the URL. I could
> > > handle
> > > > it by using the URL’s Encode instead.
> > > >
> > > > b-Since I have to connect to a specific Username and password, nutch
> > > could
> > > > not connect to my solr. Considering this:
> > > >
> > > >  Active IndexWriters :
> > > >  SolrIndexWriter
> > > >     solr.server.type : Type of SolrServer to communicate with
> (default
> > > > 'http' however options include 'cloud', 'lb' and 'concurrent')
> > > >     solr.server.url : URL of the Solr instance (mandatory)
> > > >     solr.zookeeper.url : URL of the Zookeeper URL (mandatory if
> > > > 'cloud' value for solr.server.type)
> > > >     solr.loadbalance.urls : Comma-separated string of Solr server
> > > > strings to be used (madatory if 'lb' value for solr.server.type)
> > > >     solr.mapping.file : name of the mapping file for fields (default
> > > > solrindex-mapping.xml)
> > > >     solr.commit.size : buffer size when sending to Solr (default
> 1000)
> > > >     solr.auth : use authentication (default false)
> > > >     solr.auth.username : username for authentication
> > > >     solr.auth.password : password for authentication
> > > >
> > > > in the command line output,I tried to manage this problem by using
> > > > authentication parameters of the command "solr.auth=true
> > > > solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it.
> > > >
> > > > So up to now I’ve got to a point to use this command:
> > > >
> > > > ”bin/nutch index crawl/crawldb -linkdb crawl/linkdb
> crawl/segments/2016*
> > > > solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net
> > > >
> > >
> %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
> > > > solr.auth=true solr.auth.username="USERNAME"
> solr.auth.password="PASS"“.
> > > >
> > > > But for some reason that I couldn’t realize yet, the command
> considers
> > > the
> > > > authentication parameters as crawled data directory and does not
> work.
> > > So I
> > > > guess it is not the right way to "Active IndexWriters" can anyone
> tell me
> > > > then how can I??
> > > >
> > >
> > > Please enter the command line parameters IN FRONT of the Tool arguments
> > > e.g. bin/nutch index -D solr.server.url="https%3A%2F%
> > > 2Fgateway.watsonplatform.net
> > >
> > >
> %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
> > > -D solr.auth=true -D solr.auth.username="USERNAME" -D
> > > solr.auth.password="PASS" crawl/crawldb -linkdb crawl/linkdb
> > > crawl/segments/
> > >
> > >
> > > >
> > > >    1. By curl command:
> > > >
> > > > “curl -X POST -H "Content-Type: application/json" -u
> > > > "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "
> > > >
> > > >
> > >
> https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update
> > > > "
> > > > --data-binary @{/path_to_file}/FILE.json”
> > > >
> > > > I thought maybe I can feed json files created by this command:
> > > >
> > > > bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment
> > > > crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename
> > > > -jsonArray -reverseKey but there are some problems here.
> > > >
> > > > a. this command provides so many files in complicated Paths which
> will
> > > take
> > > > so much time to manually post all of them.I guess for big cawlings
> it may
> > > > be even impossible. Is there any way to POST all the files in a
> directory
> > > > and its subdirectories at once by just one command??
> > > >
> > >
> > > Unfortunately, right now AFAIK you cannot prevent the tool from
> creating
> > > the directory hell. You might be better off using the FileDumper tool
> > > instead
> > > ./bin/nutch dump
> > >
> > >
> > > >
> > > > b. there is a weird name "ÙÙ÷y œ" at the start of json files created
> by
> > > > commoncrawldump.
> > > >
> > >
> > > The data is encoded as CBOR. This is why the Bytes exist.
> > >
> > >
> > > >
> > > > c. I removed the name weird name and tried to POST just one of these
> > > files
> > > > but here is the result:
> > > >
> > > >
> > > >
> > >
> {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown
> > > > command 'url' at [9]","code":400}}
> > > >
> > > >
> > > No it just means that you are not using the index tool correctly and
> that
> > > possibly your input data is not in the correct format.
> > > Hope this help.s
> > > Lewis
> > >
> >
>

Re: Indexing nutch crawled data in “Bluemix” solr

Reply via email to