Hi,
i'm trying to rebuild nutch to compile parse-pdf plugin with the
external libraries (jai_core.jar and jai_codec.jar).
So i downloaded the two jars and put them in the lib/ of
src/plugins/parse-pdf/.
I uncommented the two lines in plugin.xml (both in
src/plugins/parse-pdf/ and in plugins/parse-
"Again, this procedure does NOT work when using HDFS - you won't even see
the partial output (without some serious hacking)"
Got it !
"You can simply set the fetcher.parsing config option to false."
Found it !
Thanks for the help
2010/5/3 Andrzej Bialecki
> On 2010-05-03 22:58, Emmanuel de
i have problems about nutch.my project is link analysis i crawled
"www.mersin.edu.tr" and i analyse linkdb and i saw all about mersin.edu.tr
links.But i have to find other links in site example www.tubitak.gov.tr bu i
cannot find?i have to find these links ?please help me
Did u check crawl-urlfilter.txt?
All the domain names that you'd like to crawl have to mentioned.
e.g.
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*mersin\.edu\.tr/
+^http://([a-z0-9]*\.)*tubitak\.gov\.tr/
Also check property db.ignore.external.links in nutch-default.xml. Should be
se