Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Lewis John Mcgibbney Fri, 25 Nov 2011 09:20:56 -0800

Quite an episode indeed!

Is it possible to further this on Dev list?


New thread, focused subject... relative outcomes (fingers crossed) :0)

On Fri, Nov 25, 2011 at 4:49 PM, Markus Jelsma
<[email protected]>wrote:

>
>
> On Friday 25 November 2011 17:36:36 Mattmann, Chris A (388J) wrote:
> > Hey Ken,
> >
> > On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote:
> > > From my experience with Nutch and now Bixo, I think it's important to
> > > support a -debug mode with tools that dumps out info about all
> decisions
> > > being made on URLs, as otherwise tracking down what's going wrong with
> a
> > > crawl (especially when doing test crawls) can be very painful.
> >
> > +1, agreed. If you're like me you result to inserting System.out.printlns
> > everywhere :-)
> >
> I do that all the time when producing code. Setting log level to debug is
> too
> overwhelming in some cases.
>
> > > I have no idea where Nutch stands in this regard as of today, but I
> would
> > > assume that it would be possible to generate information that would
> have
> > > answered all of the "is it X" questions that came up during Chris's
> > > crawl. E.g.
> > >
> > > - which URLs were put on the fetch list, versus skipped.
> > > - which fetched documents were truncated.
> > > - which URLs in a parsed page were skipped, due to the max outlinks per
> > > page limit. - which URLs got filtered by regex
> >
> > These are great requirements for a debug tool. I've created a page on the
> > Wiki for folks to contribute to/discuss:
> >
> > http://wiki.apache.org/nutch/DebugTool
> >
> > Thanks, Ken!
> >
> > Cheers,
> > Chris
> >
> > > On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote:
> > >> Hey Guys,
> > >>
> > >> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm
> getting
> > >> all the at_download links.
> > >>
> > >> Phew! Who would have thought. Well glad Nutch is doing its thing, and
> > >> doing it correctly! :-)
> > >>
> > >> Thanks guys.
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote:
> > >>> Hey Guys,
> > >>>
> > >>> Here is the latest red-herring. I think I was using too small of a
> > >>> -topN parameter in my crawl, which was limiting the whole fetch. I
> was
> > >>> using -depth 10, and -topN 10, which thinking about it now was
> > >>> limiting to 100 pages at any depth level of links, which was too
> > >>> limited I think since most pages include > 100 pages in terms of
> > >>> outlinks and so forth. So parsing, regex, everything was working
> fine,
> > >>> it just wasn't following the links down, because it exceeded -topN *
> > >>> -depth.
> > >>>
> > >>> I'm running a new crawl now and it seems to be getting a TON more
> URLs.
> > >>> Full crawls for me were limited to around ~5k URLs before which I
> > >>> think was the problem. Fingers crossed!
> > >>>
> > >>> Cheers,
> > >>> Chris
> > >>>
> > >>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote:
> > >>>> Hey Markus,
> > >>>>
> > >>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
> > >>>>> Hi Chris
> > >>>>>
> > >>>>> https://issues.apache.org/jira/browse/NUTCH-1087
> > >>>>
> > >>>> Thanks for the pointer. I'll check it out.
> > >>>>
> > >>>>> Use the org.apache.nutch.net.URLFilterChecker to test.
> > >>>>
> > >>>> Sweet, I didn't know about this tool. OK, I tried it out, check it
> > >>>> (note that this includes my instrumented stuff, hence the printlns):
> > >>>>
> > >>>> echo
> > >>>> "
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker
> > >>>> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking
> > >>>> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@
> > >>>> EVALUATING at_download LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download
> LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download
> LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [true]
> > >>>> +
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file
> > >>>>
> > >>>> So, looks like it didn't match the first 2 rules, but matched the
> 3rd
> > >>>> one and thus it actually includes the URL fine. So, watch this, here
> > >>>> are my 3 relevant rules:
> > >>>>
> > >>>> # skip file: ftp: and mailto: urls
> > >>>> -^(file|ftp|mailto):
> > >>>>
> > >>>> # skip image and other suffixes we can't yet parse
> > >>>> # for a more extensive coverage use the urlfilter-suffix plugin
> > >>>>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zi
> > >>>>
> p|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp
> > >>>> eg|JPEG|bmp|BMP|js|JS)$
> > >>>>
> > >>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/
> > >>>>
> > >>>> So, that makes perfect sense. RegexURLFilter appears to be working
> > >>>> normally, so that's fine.
> > >>>>
> > >>>> So, .... what's the deal, then? ParserChecker works fine, it shows
> > >>>> that an outlink from this URL:
> > >>>>
> > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> > >>>>
> > >>>> Is in fact the at_download link:
> > >>>>
> > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> > >>>> org.apache.nutch.parse.ParserChecker
> > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view|
> > >>>> grep "download" outlink: toUrl:
> > >>>>
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_dow
> > >>>> nload/file anchor: watergat1summary.pdf
> > >>>> [chipotle:local/nutch/framework] mattmann%
> > >>>>
> > >>>> RegexURLFilter takes in either of those URLs, and says they are
> fine:
> > >>>>
> > >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> > >>>> org.apache.nutch.parse.ParserChecker
> > >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view|
> > >>>> grep download | awk '{print $3}' | ./bin/nutch
> > >>>> org.apache.nutch.net.URLFilterChecker -filterName
> > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING
> > >>>> at_download LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download
> LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download
> LINK!:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file]: matched? [true]
> > >>>> +
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> > >>>> wnload/file [chipotle:local/nutch/framework] mattmann%
> > >>>>
> > >>>> [chipotle:local/nutch/framework] mattmann% echo
> > >>>> "
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view";
> > >>>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> > >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter URL:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> > >>>> doesn't have at_download in it! URL:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> > >>>> doesn't have at_download in it! URL:
> > >>>> [
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> > >>>> doesn't have at_download in it!
> > >>>> +
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> > >>>> [chipotle:local/nutch/framework] mattmann%
> > >>>>
> > >>>> Any idea why i wouldn't get getting the at_download URLs downloaded
> > >>>> then? Here's http.content.limit, db.max.outlinks from my Nutch conf:
> > >>>>
> > >>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1
> > >>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml
> > >>>> confuse this setting with the http.content.limit setting.
> > >>>> </description>
> > >>>> --
> > >>>> <name>http.content.limit</name>
> > >>>> <value>-1</value>
> > >>>> --
> > >>>> <name>db.max.outlinks.per.page</name>
> > >>>> <value>-1</value>
> > >>>> --
> > >>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> > >>>> outlinks will be processed for a page; otherwise, all outlinks will
> > >>>> be processed. [chipotle:local/nutch/framework] mattmann%
> > >>>>
> > >>>>
> > >>>> Cheers,
> > >>>> Chris
> > >>>>
> > >>>>>> Hey Markus,
> > >>>>>>
> > >>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
> > >>>>>>>> Hey Markus,
> > >>>>>>>>
> > >>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> > >>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
> > >>>>>>>>> command. I don't know what happens if it isnt there.
> > >>>>>>>>
> > >>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my
> > >>>>>>>> built conf directory in runtime/local/conf from 1.4?
> > >>>>>>>
> > >>>>>>> Its gone! I checked and last saw it in 1.2. Strange
> > >>>>>>>
> > >>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5
> if
> > >>>>>>>>> you ask me.
> > >>>>>>>>
> > >>>>>>>> I'd be in favor of replacing the current Crawl command with a
> > >>>>>>>> simple Java driver that just calls the underlying Inject,
> > >>>>>>>> Generate, and Fetch tools. Would that work?
> > >>>>>>>
> > >>>>>>> There's an open issue to replace it with a basic crawl shell
> > >>>>>>> script. It's easier to understand and uses the same commands.
> > >>>>>>> Non-Java users should be able to deal with it better, and provide
> > >>>>>>> us with better problem descriptions.
> > >>>>>>
> > >>>>>> +1, that would be cool indeed. Do you know what issue it is?
> > >>>>>>
> > >>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can
> figure
> > >>>>>> out if it's dropping the at_download URLs for whatever reason.
> > >>>>>> Sigh.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> Chris
> > >>>>>>
> > >>>>>>>> Cheers,
> > >>>>>>>> Chris
> > >>>>>>>>
> > >>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J)
> wrote:
> > >>>>>>>>>> Hi Marek,
> > >>>>>>>>>>
> > >>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
> > >>>>>>>>>>> I think when you use the crawl command instead of the single
> > >>>>>>>>>>> commands, you have to specify the regEx rules in the
> > >>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the
> > >>>>>>>>>>> case in 1.4
> > >>>>>>>>>>>
> > >>>>>>>>>>> Could that be the problem?
> > >>>>>>>>>>
> > >>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf
> dir.
> > >>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled
> by
> > >>>>>>>>>> default and shipped with the basic config.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for trying to help though. I'm going to figure this
> out!
> > >>>>>>>>>> Or, someone is going to probably tell me what I'm doing wrong.
> > >>>>>>>>>> We'll see what happens first :-)
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Chris
> > >>>>>>>>>>
> > >>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> > >>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> > >>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently,
> > >>>>>>>>>>>>>> but I figured out how to make it work in 1.4 (instead of
> > >>>>>>>>>>>>>> editing the global, top-level conf/nutch-default.xml,
> > >>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml).
> > >>>>>>>>>>>>>> Crawling is forging ahead.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is
> partially
> > >>>>>>>>>>>>> why I suggested that we deliver the content of
> runtime/local
> > >>>>>>>>>>>>> as our binary release for next time. Most people use Nutch
> > >>>>>>>>>>>>> in local mode so this would make their lives easier, as for
> > >>>>>>>>>>>>> the advanced users (read pseudo or real distributed) they
> > >>>>>>>>>>>>> need to recompile the job file anyway and I'd expect them
> to
> > >>>>>>>>>>>>> use the src release
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for
> > >>>>>>>>>>>> 1.5.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get
> it
> > >>>>>>>>>>>> to crawl the PDFs... :(
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>> Chris
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>>>> ++++ Chris Mattmann, Ph.D.
> > >>>>>>>>>>>> Senior Computer Scientist
> > >>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246
> > >>>>>>>>>>>> Email: [email protected]
> > >>>>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>>>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>>>> ++++ Adjunct Assistant Professor, Computer Science
> Department
> > >>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>>>>>>>>>>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>>>> +++++
> > >>>>>>>>>>
> > >>>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>> ++ Chris Mattmann, Ph.D.
> > >>>>>>>>>> Senior Computer Scientist
> > >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>>>>>>>> Office: 171-266B, Mailstop: 171-246
> > >>>>>>>>>> Email: [email protected]
> > >>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department
> > >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>>>>>>>>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>>>> +++
> > >>>>>>>>
> > >>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>> Chris Mattmann, Ph.D.
> > >>>>>>>> Senior Computer Scientist
> > >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>>>>>> Office: 171-266B, Mailstop: 171-246
> > >>>>>>>> Email: [email protected]
> > >>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>> Adjunct Assistant Professor, Computer Science Department
> > >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>>>>>>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>>
> > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>> Chris Mattmann, Ph.D.
> > >>>>>> Senior Computer Scientist
> > >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>>>> Office: 171-266B, Mailstop: 171-246
> > >>>>>> Email: [email protected]
> > >>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>> Adjunct Assistant Professor, Computer Science Department
> > >>>>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>> Chris Mattmann, Ph.D.
> > >>>> Senior Computer Scientist
> > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>>> Office: 171-266B, Mailstop: 171-246
> > >>>> Email: [email protected]
> > >>>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>> Adjunct Assistant Professor, Computer Science Department
> > >>>> University of Southern California, Los Angeles, CA 90089 USA
> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>> Chris Mattmann, Ph.D.
> > >>> Senior Computer Scientist
> > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >>> Office: 171-266B, Mailstop: 171-246
> > >>> Email: [email protected]
> > >>> WWW:   http://sunset.usc.edu/~mattmann/
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>> Adjunct Assistant Professor, Computer Science Department
> > >>> University of Southern California, Los Angeles, CA 90089 USA
> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Senior Computer Scientist
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 171-266B, Mailstop: 171-246
> > >> Email: [email protected]
> > >> WWW:   http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Assistant Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >
> > > --------------------------
> > > Ken Krugler
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Mahout & Solr
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
*Lewis*

Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Reply via email to