nutch 1.8 pdf crawl issue

A Laxmi Sun, 28 Sep 2014 15:14:50 -0700

Hi,

I have crawled a bunch of PDF urls using Nutch 1.8. It returned empty
"title" and "content" for some of the PDF urls. When I pulled up one such
url, the text seems to be easily selectable and does *not* contain a bunch
of images as in (non-ocr'd pdf), I am confused about why Nutch returned
empty values for "title" and "content" for such a pdf. Example url for
which Nutch returned empty title and content-
http://www.fs.fed.us/global/iitf/pubs/ja_iitf_2012_holm001.pdf


The way I figured out title and content was empty was through Solr Admin.
After it was crawled and indexed in Solr, I search for that url in Solr
Admin UI and it had these values for title, content, url, type fields -

>From Solr Admin:

title,content,url,type
"",,http://www.fs.fed.us/global/iitf/pubs/ja_iitf_2012_holm001.pdf,"application/pdf,application,pdf";


Any thoughts please???

Thanks

nutch 1.8 pdf crawl issue

Reply via email to