Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Markus Jelsma Fri, 25 Nov 2011 08:49:18 -0800


On Friday 25 November 2011 17:36:36 Mattmann, Chris A (388J) wrote:
> Hey Ken,
> 
> On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote:
> > From my experience with Nutch and now Bixo, I think it's important to
> > support a -debug mode with tools that dumps out info about all decisions
> > being made on URLs, as otherwise tracking down what's going wrong with a
> > crawl (especially when doing test crawls) can be very painful.
> 
> +1, agreed. If you're like me you result to inserting System.out.printlns
> everywhere :-)
> 
I do that all the time when producing code. Setting log level to debug is too 
overwhelming in some cases.


> > I have no idea where Nutch stands in this regard as of today, but I would
> > assume that it would be possible to generate information that would have
> > answered all of the "is it X" questions that came up during Chris's
> > crawl. E.g.
> > 
> > - which URLs were put on the fetch list, versus skipped.
> > - which fetched documents were truncated.
> > - which URLs in a parsed page were skipped, due to the max outlinks per
> > page limit. - which URLs got filtered by regex
> 
> These are great requirements for a debug tool. I've created a page on the
> Wiki for folks to contribute to/discuss:
> 
> http://wiki.apache.org/nutch/DebugTool
> 
> Thanks, Ken!
> 
> Cheers,
> Chris
> 
> > On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote:
> >> Hey Guys,
> >> 
> >> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting
> >> all the at_download links.
> >> 
> >> Phew! Who would have thought. Well glad Nutch is doing its thing, and
> >> doing it correctly! :-)
> >> 
> >> Thanks guys.
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote:
> >>> Hey Guys,
> >>> 
> >>> Here is the latest red-herring. I think I was using too small of a
> >>> -topN parameter in my crawl, which was limiting the whole fetch. I was
> >>> using -depth 10, and -topN 10, which thinking about it now was
> >>> limiting to 100 pages at any depth level of links, which was too
> >>> limited I think since most pages include > 100 pages in terms of
> >>> outlinks and so forth. So parsing, regex, everything was working fine,
> >>> it just wasn't following the links down, because it exceeded -topN *
> >>> -depth.
> >>> 
> >>> I'm running a new crawl now and it seems to be getting a TON more URLs.
> >>> Full crawls for me were limited to around ~5k URLs before which I
> >>> think was the problem. Fingers crossed!
> >>> 
> >>> Cheers,
> >>> Chris
> >>> 
> >>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote:
> >>>> Hey Markus,
> >>>> 
> >>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
> >>>>> Hi Chris
> >>>>> 
> >>>>> https://issues.apache.org/jira/browse/NUTCH-1087
> >>>> 
> >>>> Thanks for the pointer. I'll check it out.
> >>>> 
> >>>>> Use the org.apache.nutch.net.URLFilterChecker to test.
> >>>> 
> >>>> Sweet, I didn't know about this tool. OK, I tried it out, check it
> >>>> (note that this includes my instrumented stuff, hence the printlns):
> >>>> 
> >>>> echo
> >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker
> >>>> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking
> >>>> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@
> >>>> EVALUATING at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [true]
> >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file
> >>>> 
> >>>> So, looks like it didn't match the first 2 rules, but matched the 3rd
> >>>> one and thus it actually includes the URL fine. So, watch this, here
> >>>> are my 3 relevant rules:
> >>>> 
> >>>> # skip file: ftp: and mailto: urls
> >>>> -^(file|ftp|mailto):
> >>>> 
> >>>> # skip image and other suffixes we can't yet parse
> >>>> # for a more extensive coverage use the urlfilter-suffix plugin
> >>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zi
> >>>> p|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp
> >>>> eg|JPEG|bmp|BMP|js|JS)$
> >>>> 
> >>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/
> >>>> 
> >>>> So, that makes perfect sense. RegexURLFilter appears to be working
> >>>> normally, so that's fine.
> >>>> 
> >>>> So, .... what's the deal, then? ParserChecker works fine, it shows
> >>>> that an outlink from this URL:
> >>>> 
> >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> >>>> 
> >>>> Is in fact the at_download link:
> >>>> 
> >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> >>>> org.apache.nutch.parse.ParserChecker
> >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view |
> >>>> grep "download" outlink: toUrl:
> >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_dow
> >>>> nload/file anchor: watergat1summary.pdf
> >>>> [chipotle:local/nutch/framework] mattmann%
> >>>> 
> >>>> RegexURLFilter takes in either of those URLs, and says they are fine:
> >>>> 
> >>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> >>>> org.apache.nutch.parse.ParserChecker
> >>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view |
> >>>> grep download | awk '{print $3}' | ./bin/nutch
> >>>> org.apache.nutch.net.URLFilterChecker -filterName
> >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING
> >>>> at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file]: matched? [true]
> >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_do
> >>>> wnload/file [chipotle:local/nutch/framework] mattmann%
> >>>> 
> >>>> [chipotle:local/nutch/framework] mattmann% echo
> >>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view";
> >>>> | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> >>>> org.apache.nutch.urlfilter.regex.RegexURLFilter URL:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >>>> doesn't have at_download in it! URL:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >>>> doesn't have at_download in it! URL:
> >>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >>>> doesn't have at_download in it!
> >>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> >>>> [chipotle:local/nutch/framework] mattmann%
> >>>> 
> >>>> Any idea why i wouldn't get getting the at_download URLs downloaded
> >>>> then? Here's http.content.limit, db.max.outlinks from my Nutch conf:
> >>>> 
> >>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1
> >>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml
> >>>> confuse this setting with the http.content.limit setting.
> >>>> </description>
> >>>> --
> >>>> <name>http.content.limit</name>
> >>>> <value>-1</value>
> >>>> --
> >>>> <name>db.max.outlinks.per.page</name>
> >>>> <value>-1</value>
> >>>> --
> >>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> >>>> outlinks will be processed for a page; otherwise, all outlinks will
> >>>> be processed. [chipotle:local/nutch/framework] mattmann%
> >>>> 
> >>>> 
> >>>> Cheers,
> >>>> Chris
> >>>> 
> >>>>>> Hey Markus,
> >>>>>> 
> >>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
> >>>>>>>> Hey Markus,
> >>>>>>>> 
> >>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> >>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
> >>>>>>>>> command. I don't know what happens if it isnt there.
> >>>>>>>> 
> >>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my
> >>>>>>>> built conf directory in runtime/local/conf from 1.4?
> >>>>>>> 
> >>>>>>> Its gone! I checked and last saw it in 1.2. Strange
> >>>>>>> 
> >>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if
> >>>>>>>>> you ask me.
> >>>>>>>> 
> >>>>>>>> I'd be in favor of replacing the current Crawl command with a
> >>>>>>>> simple Java driver that just calls the underlying Inject,
> >>>>>>>> Generate, and Fetch tools. Would that work?
> >>>>>>> 
> >>>>>>> There's an open issue to replace it with a basic crawl shell
> >>>>>>> script. It's easier to understand and uses the same commands.
> >>>>>>> Non-Java users should be able to deal with it better, and provide
> >>>>>>> us with better problem descriptions.
> >>>>>> 
> >>>>>> +1, that would be cool indeed. Do you know what issue it is?
> >>>>>> 
> >>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure
> >>>>>> out if it's dropping the at_download URLs for whatever reason.
> >>>>>> Sigh.
> >>>>>> 
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>>> 
> >>>>>>>> Cheers,
> >>>>>>>> Chris
> >>>>>>>> 
> >>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) 
wrote:
> >>>>>>>>>> Hi Marek,
> >>>>>>>>>> 
> >>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
> >>>>>>>>>>> I think when you use the crawl command instead of the single
> >>>>>>>>>>> commands, you have to specify the regEx rules in the
> >>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the
> >>>>>>>>>>> case in 1.4
> >>>>>>>>>>> 
> >>>>>>>>>>> Could that be the problem?
> >>>>>>>>>> 
> >>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir.
> >>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by
> >>>>>>>>>> default and shipped with the basic config.
> >>>>>>>>>> 
> >>>>>>>>>> Thanks for trying to help though. I'm going to figure this out!
> >>>>>>>>>> Or, someone is going to probably tell me what I'm doing wrong.
> >>>>>>>>>> We'll see what happens first :-)
> >>>>>>>>>> 
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Chris
> >>>>>>>>>> 
> >>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> >>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> >>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently,
> >>>>>>>>>>>>>> but I figured out how to make it work in 1.4 (instead of
> >>>>>>>>>>>>>> editing the global, top-level conf/nutch-default.xml,
> >>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml).
> >>>>>>>>>>>>>> Crawling is forging ahead.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially
> >>>>>>>>>>>>> why I suggested that we deliver the content of runtime/local
> >>>>>>>>>>>>> as our binary release for next time. Most people use Nutch
> >>>>>>>>>>>>> in local mode so this would make their lives easier, as for
> >>>>>>>>>>>>> the advanced users (read pseudo or real distributed) they
> >>>>>>>>>>>>> need to recompile the job file anyway and I'd expect them to
> >>>>>>>>>>>>> use the src release
> >>>>>>>>>>>> 
> >>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for
> >>>>>>>>>>>> 1.5.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it
> >>>>>>>>>>>> to crawl the PDFs... :(
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Chris
> >>>>>>>>>>>> 
> >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>>>> ++++ Chris Mattmann, Ph.D.
> >>>>>>>>>>>> Senior Computer Scientist
> >>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>>>>>> Email: [email protected]
> >>>>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>>>> ++++ Adjunct Assistant Professor, Computer Science Department
> >>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>>>> +++++
> >>>>>>>>>> 
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> ++ Chris Mattmann, Ph.D.
> >>>>>>>>>> Senior Computer Scientist
> >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>>>> Email: [email protected]
> >>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department
> >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> +++
> >>>>>>>> 
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>> Chris Mattmann, Ph.D.
> >>>>>>>> Senior Computer Scientist
> >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>> Email: [email protected]
> >>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> 
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Chris Mattmann, Ph.D.
> >>>>>> Senior Computer Scientist
> >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>> Email: [email protected]
> >>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Senior Computer Scientist
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 171-266B, Mailstop: 171-246
> >>>> Email: [email protected]
> >>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Assistant Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> 
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Senior Computer Scientist
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 171-266B, Mailstop: 171-246
> >>> Email: [email protected]
> >>> WWW:   http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Assistant Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Mahout & Solr
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Reply via email to