Nutch 2.1 pdf parsing

Adriana Farina Thu, 23 May 2013 08:15:07 -0700

Hi,

I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase
0.90.4 as database.


I wrote a Java class from which I run the crawling cycle, the code that
implements the crawling cycle is the following:

                  for (int i = 0; i < depth; i++) {
batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),
System.currentTimeMillis(), false, false);
fetcher.fetch(batchid, 1, false, -1);
parser.parse(batchid, false, true);
updater.run(new String[0]);
  }

The problem is that I'm not able to parse the pdf files, inside HBase I got
no pdf content. The strange thing is that I got one row with the following
content: column=p:parsestat, timestamp=1369316742871,
value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: Unable
to successfully parse content\x00.

It seems to me that I have configured all nutch property files correctly.
Can anybody help me?

Thank you very much.


-- 
Adriana Farina

Nutch 2.1 pdf parsing

Reply via email to