Hi, I am trying to crawl some URLs in Apache Nutch, and then index them
with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch
and dump the crawled data as files (mostly html files) in a directory:

bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/

Then I send these files to the Bluemix Document Converter Service to create
a json file "compatible with Bluemix R&R Service", and post this Json file
to my R&R Service. My question is: How can I find a way to map each one of
these files to their related URL?

th commoncrawldump command creates json files that have URL field in them,
but i couldn't use these files for my purpose for various reasons, first
being that they contain some binary signs.

Bests
Shakiba Davari

Reply via email to