Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The
/indexes crawl/crawldb/ crawl/linkdb
crawl/segments/20100415163946 crawl/segments/20100415164106
This seems to work for me. You may have already tried this workaround, but
just in case.
-Harry
On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius
mgris...@comcast.netwrote:
Two observations
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) I was using nutch 1.0 to crawl successfully, but had problems w/
parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to
parse all of the 'problem' pdfs that parse-pdf could not handle. The
crawldb
using Nutch nightly build nutch-2010-04-27_04-00-28:
I am trying to bin/nutch crawl a single html file generated by javadoc
and no links are followed. I verified this with bin/nutch readdb and
bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
seed doc specified is processed.
I
java
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
using Nutch nightly build nutch-2010-04-27_04-00-28:
I am trying to bin/nutch crawl a single html file generated by javadoc
and no links are followed. I verified this with bin/nutch readdb and
bin/nutch readlinkdb, and also
I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that
in nutch-site.xml I modified plugin.includes
parse-(html) works
parse-(tika) does not
I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote
Hi Chris,
Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!
-m.
On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
Hi Matthew,
Hi Matthew,
There is an open issue with Tika (e.g.
, matthew a. grisius mgris...@comcast.net wrote:
Hi Chris,
Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!
-m.
On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
Hi Matthew
those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
again.
See if that works. In the meanwhile, I'll inspect Julien's patch.
Thanks!
Cheers,
Chris
On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote:
Hi Chris,
It appears
10 matches
Mail list logo