Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Ken Krugler Fri, 25 Nov 2011 07:59:27 -0800

One comment, from watching this email thread at a distance.

From my experience with Nutch and now Bixo, I think it's important to support a 
-debug mode with tools that dumps out info about all decisions being made on 
URLs, as otherwise tracking down what's going wrong with a crawl (especially 
when doing test crawls) can be very painful.


I have no idea where Nutch stands in this regard as of today, but I would 
assume that it would be possible to generate information that would have 
answered all of the "is it X" questions that came up during Chris's crawl. E.g.

- which URLs were put on the fetch list, versus skipped.
- which fetched documents were truncated.
- which URLs in a parsed page were skipped, due to the max outlinks per page 
limit.
- which URLs got filtered by regex

and so on

-- Ken

On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote:

> Hey Guys,
> 
> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting
> all the at_download links.
> 
> Phew! Who would have thought. Well glad Nutch is doing its thing, and
> doing it correctly! :-)
> 
> Thanks guys.
> 
> Cheers,
> Chris
> 
> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote:
> 
>> Hey Guys,
>> 
>> Here is the latest red-herring. I think I was using too small of a -topN 
>> parameter
>> in my crawl, which was limiting the whole fetch. I was using -depth 10, and 
>> -topN 10,
>> which thinking about it now was limiting to 100 pages at any depth level of 
>> links, which
>> was too limited I think since most pages include > 100 pages in terms of 
>> outlinks and
>> so forth. So parsing, regex, everything was working fine, it just wasn't 
>> following the
>> links down, because it exceeded -topN * -depth.
>> 
>> I'm running a new crawl now and it seems to be getting a TON more URLs. Full
>> crawls for me were limited to around ~5k URLs before which I think was the 
>> problem.
>> Fingers crossed!
>> 
>> Cheers,
>> Chris
>> 
>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote:
>> 
>>> Hey Markus,
>>> 
>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
>>> 
>>>> Hi Chris
>>>> 
>>>> https://issues.apache.org/jira/browse/NUTCH-1087
>>> 
>>> Thanks for the pointer. I'll check it out.
>>> 
>>>> 
>>>> Use the org.apache.nutch.net.URLFilterChecker to test.
>>> 
>>> Sweet, I didn't know about this tool. OK, I tried it out, check it (note 
>>> that this includes my instrumented stuff, hence
>>> the printlns):
>>> 
>>> echo 
>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file";
>>>  | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [false]
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [false]
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [true]
>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>> 
>>> So, looks like it didn't match the first 2 rules, but matched the 3rd one 
>>> and thus it actually includes the URL fine. So, watch this, here are
>>> my 3 relevant rules:
>>> 
>>> # skip file: ftp: and mailto: urls
>>> -^(file|ftp|mailto):
>>> 
>>> # skip image and other suffixes we can't yet parse
>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>> 
>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/
>>> 
>>> So, that makes perfect sense. RegexURLFilter appears to be working 
>>> normally, so that's fine.
>>> 
>>> So, .... what's the deal, then? ParserChecker works fine, it shows that an 
>>> outlink from this URL:
>>> 
>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>>> 
>>> Is in fact the at_download link:
>>> 
>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>>> org.apache.nutch.parse.ParserChecker 
>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
>>> "download"
>>> outlink: toUrl: 
>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>>  anchor: watergat1summary.pdf
>>> [chipotle:local/nutch/framework] mattmann%
>>> 
>>> RegexURLFilter takes in either of those URLs, and says they are fine:
>>> 
>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>>> org.apache.nutch.parse.ParserChecker 
>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
>>> download | awk '{print $3}' | ./bin/nutch 
>>> org.apache.nutch.net.URLFilterChecker -filterName 
>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [false]
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [false]
>>> @#((#(#@ EVALUATING at_download LINK!: 
>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>  matched? [true]
>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>> [chipotle:local/nutch/framework] mattmann%
>>> 
>>> [chipotle:local/nutch/framework] mattmann% echo 
>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view"; | 
>>> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>> doesn't have at_download in it!
>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>> doesn't have at_download in it!
>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>> doesn't have at_download in it!
>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>>> [chipotle:local/nutch/framework] mattmann%
>>> 
>>> Any idea why i wouldn't get getting the at_download URLs downloaded then? 
>>> Here's http.content.limit,
>>> db.max.outlinks from my Nutch conf:
>>> 
>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 
>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml
>>> confuse this setting with the http.content.limit setting.
>>> </description>
>>> --
>>> <name>http.content.limit</name>
>>> <value>-1</value>
>>> --
>>> <name>db.max.outlinks.per.page</name>
>>> <value>-1</value>
>>> --
>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
>>> outlinks
>>> will be processed for a page; otherwise, all outlinks will be processed.
>>> [chipotle:local/nutch/framework] mattmann%
>>> 
>>> 
>>> Cheers,
>>> Chris
>>> 
>>>> 
>>>>> Hey Markus,
>>>>> 
>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
>>>>>>> Hey Markus,
>>>>>>> 
>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
>>>>>>>> command. I don't know what happens if it isnt there.
>>>>>>> 
>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built
>>>>>>> conf directory in runtime/local/conf from 1.4?
>>>>>> 
>>>>>> Its gone! I checked and last saw it in 1.2. Strange
>>>>>> 
>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you
>>>>>>>> ask me.
>>>>>>> 
>>>>>>> I'd be in favor of replacing the current Crawl command with a simple
>>>>>>> Java driver that just calls the underlying Inject, Generate, and Fetch
>>>>>>> tools. Would that work?
>>>>>> 
>>>>>> There's an open issue to replace it with a basic crawl shell script. It's
>>>>>> easier to understand and uses the same commands. Non-Java users should
>>>>>> be able to deal with it better, and provide us with better problem
>>>>>> descriptions.
>>>>> 
>>>>> +1, that would be cool indeed. Do you know what issue it is?
>>>>> 
>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out
>>>>> if it's dropping the at_download URLs for whatever reason. Sigh.
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>>>> Cheers,
>>>>>>> Chris
>>>>>>> 
>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
>>>>>>>>> Hi Marek,
>>>>>>>>> 
>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>>>>>>>>>> I think when you use the crawl command instead of the single
>>>>>>>>>> commands, you have to specify the regEx rules in the
>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case
>>>>>>>>>> in 1.4
>>>>>>>>>> 
>>>>>>>>>> Could that be the problem?
>>>>>>>>> 
>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir.
>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by
>>>>>>>>> default and shipped with the basic config.
>>>>>>>>> 
>>>>>>>>> Thanks for trying to help though. I'm going to figure this out! Or,
>>>>>>>>> someone is going to probably tell me what I'm doing wrong.
>>>>>>>>> We'll see what happens first :-)
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Chris
>>>>>>>>> 
>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
>>>>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the
>>>>>>>>>>>>> global, top-level conf/nutch-default.xml,
>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling
>>>>>>>>>>>>> is forging ahead.
>>>>>>>>>>>> 
>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why
>>>>>>>>>>>> I suggested that we deliver the content of runtime/local as our
>>>>>>>>>>>> binary release for next time. Most people use Nutch in local
>>>>>>>>>>>> mode so this would make their lives easier, as for the advanced
>>>>>>>>>>>> users (read pseudo or real distributed) they need to recompile
>>>>>>>>>>>> the job file anyway and I'd expect them to use the src release
>>>>>>>>>>> 
>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>>>>>>>>>>> 
>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to
>>>>>>>>>>> crawl the PDFs... :(
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Chris
>>>>>>>>>>> 
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>>>>> Senior Computer Scientist
>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> 
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>>> Senior Computer Scientist
>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>>> Email: [email protected]
>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> 
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Senior Computer Scientist
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>> Email: [email protected]
>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: [email protected]
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Reply via email to