nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius mgris...@comcast.netwrote: Two observations

nutch 1.1 crawl d/n complete issue

2010-04-16 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb

nutch crawl issue

2010-04-27 Thread matthew a. grisius
using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I

Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius
java On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread matthew a. grisius
I also share many of Phil's sentiments. I really want the project (bin/nutch crawl) to work for me as well and I want to help somehow. I would like to share a 5gb 'intranet' web site with ~50 people. And I have not graduated to making the 'deepcrawl' script work yet either, as I'm thinking that

Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote

Re: nutch crawl issue

2010-05-01 Thread matthew a. grisius
Hi Julien, On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote: Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order

Re: nutch crawl issue

2010-05-03 Thread matthew a. grisius
Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g.

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius
, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew

Re: nutch crawl issue

2010-05-05 Thread matthew a. grisius
those jars in your parse-tika/lib folder, replacing their originals. Then, try your nutch crawl again. See if that works. In the meanwhile, I'll inspect Julien's patch. Thanks! Cheers, Chris On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote: Hi Chris, It appears