Indexing nutch crawled data in “Bluemix” solr

shakiba davari Thu, 09 Jun 2016 10:12:07 -0700

1down votefavorite
<http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr#>

I'm trying to index the nutch crawled data by Bluemix solr and I cannot
find anyway to do it. My main question is: Is there anybody that can help
me to do so? what should I do to send the result of my nutch crawled data
to my Blumix Solr.

For the crawling I used nutch 1.11 and here is a part of what I did to now
and the problems I faced: I thought there may be two possible solutions:

1. By nutch command:

“NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/
-Dsolr.server.url="OURSOLRURL"”

I can index the nutch crawled data by OURSOLR. However, I found some
problem with that.

a-Though it sounds really odd, it could not accept the URL. I could handle
it by using the URL’s Encode instead.

b-Since I have to connect to a specific Username and password, nutch could
not connect to my solr. Considering this:

Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default
'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if
'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server
strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication

in the command line output,I tried to manage this problem by using
authentication parameters of the command "solr.auth=true
solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it.

So up to now I’ve got to a point to use this command:

”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net%2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“.

But for some reason that I couldn’t realize yet, the command considers the
authentication parameters as crawled data directory and does not work. So I
guess it is not the right way to "Active IndexWriters" can anyone tell me
then how can I??

1. By curl command:

“curl -X POST -H "Content-Type: application/json" -u
"BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "
https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update";
--data-binary @{/path_to_file}/FILE.json”

I thought maybe I can feed json files created by this command:

bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment
crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename
-jsonArray -reverseKey but there are some problems here.

a. this command provides so many files in complicated Paths which will take
so much time to manually post all of them.I guess for big cawlings it may
be even impossible. Is there any way to POST all the files in a directory
and its subdirectories at once by just one command??

b. there is a weird name "ÙÙ÷y œ" at the start of json files created by
commoncrawldump.

c. I removed the name weird name and tried to POST just one of these files
but here is the result:

{"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown
command 'url' at [9]","code":400}}

does it mean these files cannot be fed to bluemix solr and it is all
useless for me???

Indexing nutch crawled data in “Bluemix” solr

Reply via email to