Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Markus Jelsma Fri, 25 Nov 2011 07:59:48 -0800

Right! I've been pulling my hair out on similar occasions! Another argument to 
get rid of the crawl command since the -depth parameter makes little sense in 
the long run in my opinion. There is no depth information.


I would recommend you and anyone else to learn to use the separate commands 
like Lewis wrote in the latest tutorial.

On Friday 25 November 2011 16:49:43 Mattmann, Chris A (388J) wrote:
> Hey Guys,
> 
> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting
> all the at_download links.
> 
> Phew! Who would have thought. Well glad Nutch is doing its thing, and
> doing it correctly! :-)
> 
> Thanks guys.
> 
> Cheers,
> Chris
> 
> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote:
> > Hey Guys,
> > 
> > Here is the latest red-herring. I think I was using too small of a -topN
> > parameter in my crawl, which was limiting the whole fetch. I was using
> > -depth 10, and -topN 10, which thinking about it now was limiting to 100
> > pages at any depth level of links, which was too limited I think since
> > most pages include > 100 pages in terms of outlinks and so forth. So
> > parsing, regex, everything was working fine, it just wasn't following
> > the links down, because it exceeded -topN * -depth.
> > 
> > I'm running a new crawl now and it seems to be getting a TON more URLs.
> > Full crawls for me were limited to around ~5k URLs before which I think
> > was the problem. Fingers crossed!
> > 
> > Cheers,
> > Chris
> > 
> > On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote:
> >> Hey Markus,
> >> 
> >> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
> >>> Hi Chris
> >>> 
> >>> https://issues.apache.org/jira/browse/NUTCH-1087
> >> 
> >> Thanks for the pointer. I'll check it out.
> >> 
> >>> Use the org.apache.nutch.net.URLFilterChecker to test.
> >> 
> >> Sweet, I didn't know about this tool. OK, I tried it out, check it (note
> >> that this includes my instrumented stuff, hence the printlns):
> >> 
> >> echo
> >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file" | ./bin/nutch org.apache.nutch.net.URLFilterChecker
> >> -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking
> >> URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@
> >> EVALUATING at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [true]
> >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file
> >> 
> >> So, looks like it didn't match the first 2 rules, but matched the 3rd
> >> one and thus it actually includes the URL fine. So, watch this, here
> >> are my 3 relevant rules:
> >> 
> >> # skip file: ftp: and mailto: urls
> >> -^(file|ftp|mailto):
> >> 
> >> # skip image and other suffixes we can't yet parse
> >> # for a more extensive coverage use the urlfilter-suffix plugin
> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|
> >> ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|J
> >> PEG|bmp|BMP|js|JS)$
> >> 
> >> +^http://([a-z0-9]*\.)*vault.fbi.gov/
> >> 
> >> So, that makes perfect sense. RegexURLFilter appears to be working
> >> normally, so that's fine.
> >> 
> >> So, .... what's the deal, then? ParserChecker works fine, it shows that
> >> an outlink from this URL:
> >> 
> >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> >> 
> >> Is in fact the at_download link:
> >> 
> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> >> org.apache.nutch.parse.ParserChecker
> >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view |
> >> grep "download" outlink: toUrl:
> >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_downl
> >> oad/file anchor: watergat1summary.pdf [chipotle:local/nutch/framework]
> >> mattmann%
> >> 
> >> RegexURLFilter takes in either of those URLs, and says they are fine:
> >> 
> >> [chipotle:local/nutch/framework] mattmann% ./bin/nutch
> >> org.apache.nutch.parse.ParserChecker
> >> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view |
> >> grep download | awk '{print $3}' | ./bin/nutch
> >> org.apache.nutch.net.URLFilterChecker -filterName
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter @#((#(#@ EVALUATING
> >> at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [false] @#((#(#@ EVALUATING at_download LINK!:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file]: matched? [true]
> >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_down
> >> load/file [chipotle:local/nutch/framework] mattmann%
> >> 
> >> [chipotle:local/nutch/framework] mattmann% echo
> >> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view"; |
> >> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter URL:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >> doesn't have at_download in it! URL:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >> doesn't have at_download in it! URL:
> >> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view]
> >> doesn't have at_download in it!
> >> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
> >> [chipotle:local/nutch/framework] mattmann%
> >> 
> >> Any idea why i wouldn't get getting the at_download URLs downloaded
> >> then? Here's http.content.limit, db.max.outlinks from my Nutch conf:
> >> 
> >> [chipotle:local/nutch/framework] mattmann% egrep -i -A1
> >> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml confuse
> >> this setting with the http.content.limit setting.
> >> </description>
> >> --
> >> <name>http.content.limit</name>
> >> <value>-1</value>
> >> --
> >> <name>db.max.outlinks.per.page</name>
> >> <value>-1</value>
> >> --
> >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> >> outlinks will be processed for a page; otherwise, all outlinks will be
> >> processed. [chipotle:local/nutch/framework] mattmann%
> >> 
> >> 
> >> Cheers,
> >> Chris
> >> 
> >>>> Hey Markus,
> >>>> 
> >>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
> >>>>>> Hey Markus,
> >>>>>> 
> >>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
> >>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
> >>>>>>> command. I don't know what happens if it isnt there.
> >>>>>> 
> >>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built
> >>>>>> conf directory in runtime/local/conf from 1.4?
> >>>>> 
> >>>>> Its gone! I checked and last saw it in 1.2. Strange
> >>>>> 
> >>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if
> >>>>>>> you ask me.
> >>>>>> 
> >>>>>> I'd be in favor of replacing the current Crawl command with a simple
> >>>>>> Java driver that just calls the underlying Inject, Generate, and
> >>>>>> Fetch tools. Would that work?
> >>>>> 
> >>>>> There's an open issue to replace it with a basic crawl shell script.
> >>>>> It's easier to understand and uses the same commands. Non-Java users
> >>>>> should be able to deal with it better, and provide us with better
> >>>>> problem descriptions.
> >>>> 
> >>>> +1, that would be cool indeed. Do you know what issue it is?
> >>>> 
> >>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure
> >>>> out if it's dropping the at_download URLs for whatever reason. Sigh.
> >>>> 
> >>>> Cheers,
> >>>> Chris
> >>>> 
> >>>>>> Cheers,
> >>>>>> Chris
> >>>>>> 
> >>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) 
wrote:
> >>>>>>>> Hi Marek,
> >>>>>>>> 
> >>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
> >>>>>>>>> I think when you use the crawl command instead of the single
> >>>>>>>>> commands, you have to specify the regEx rules in the
> >>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the
> >>>>>>>>> case in 1.4
> >>>>>>>>> 
> >>>>>>>>> Could that be the problem?
> >>>>>>>> 
> >>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir.
> >>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by
> >>>>>>>> default and shipped with the basic config.
> >>>>>>>> 
> >>>>>>>> Thanks for trying to help though. I'm going to figure this out!
> >>>>>>>> Or, someone is going to probably tell me what I'm doing wrong.
> >>>>>>>> We'll see what happens first :-)
> >>>>>>>> 
> >>>>>>>> Cheers,
> >>>>>>>> Chris
> >>>>>>>> 
> >>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
> >>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
> >>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but
> >>>>>>>>>>>> I figured out how to make it work in 1.4 (instead of editing
> >>>>>>>>>>>> the global, top-level conf/nutch-default.xml,
> >>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml).
> >>>>>>>>>>>> Crawling is forging ahead.
> >>>>>>>>>>> 
> >>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially
> >>>>>>>>>>> why I suggested that we deliver the content of runtime/local
> >>>>>>>>>>> as our binary release for next time. Most people use Nutch in
> >>>>>>>>>>> local mode so this would make their lives easier, as for the
> >>>>>>>>>>> advanced users (read pseudo or real distributed) they need to
> >>>>>>>>>>> recompile the job file anyway and I'd expect them to use the
> >>>>>>>>>>> src release
> >>>>>>>>>> 
> >>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for
> >>>>>>>>>> 1.5.
> >>>>>>>>>> 
> >>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to
> >>>>>>>>>> crawl the PDFs... :(
> >>>>>>>>>> 
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Chris
> >>>>>>>>>> 
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> ++ Chris Mattmann, Ph.D.
> >>>>>>>>>> Senior Computer Scientist
> >>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>>>> Email: [email protected]
> >>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> ++ Adjunct Assistant Professor, Computer Science Department
> >>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>> +++
> >>>>>>>> 
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>> Chris Mattmann, Ph.D.
> >>>>>>>> Senior Computer Scientist
> >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>>>> Email: [email protected]
> >>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> 
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Chris Mattmann, Ph.D.
> >>>>>> Senior Computer Scientist
> >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>>>> Office: 171-266B, Mailstop: 171-246
> >>>>>> Email: [email protected]
> >>>>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> Adjunct Assistant Professor, Computer Science Department
> >>>>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Senior Computer Scientist
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 171-266B, Mailstop: 171-246
> >>>> Email: [email protected]
> >>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Assistant Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: [email protected]
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Reply via email to