Hi, I am trying to crawl some URLs in Apache Nutch, and then index them with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch and dump the crawled data as files (mostly html files) in a directory:
bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/ Then I send these files to the Bluemix Document Converter Service to create a json file "compatible with Bluemix R&R Service", and post this Json file to my R&R Service. My question is: How can I find a way to map each one of these files to their related URL? th commoncrawldump command creates json files that have URL field in them, but i couldn't use these files for my purpose for various reasons, first being that they contain some binary signs. Bests Shakiba Davari

