Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Mattmann, Chris A (388J) Fri, 25 Nov 2011 08:36:20 -0800

Hey Ken,

On Nov 25, 2011, at 7:58 AM, Ken Krugler wrote:


> From my experience with Nutch and now Bixo, I think it's important to support 
> a -debug mode with tools that dumps out info about all decisions being made 
> on URLs, as otherwise tracking down what's going wrong with a crawl 
> (especially when doing test crawls) can be very painful.

+1, agreed. If you're like me you result to inserting System.out.printlns 
everywhere :-)

>
> I have no idea where Nutch stands in this regard as of today, but I would 
> assume that it would be possible to generate information that would have 
> answered all of the "is it X" questions that came up during Chris's crawl. 
> E.g.
>
> - which URLs were put on the fetch list, versus skipped.
> - which fetched documents were truncated.
> - which URLs in a parsed page were skipped, due to the max outlinks per page 
> limit.
> - which URLs got filtered by regex
>

These are great requirements for a debug tool. I've created a page on the Wiki
for folks to contribute to/discuss:

http://wiki.apache.org/nutch/DebugTool

Thanks, Ken!

Cheers,
Chris

>
> On Nov 25, 2011, at 7:49am, Mattmann, Chris A (388J) wrote:
>
>> Hey Guys,
>>
>> Yep that was it. I had to use -topN 10000 -depth 10, and now I'm getting
>> all the at_download links.
>>
>> Phew! Who would have thought. Well glad Nutch is doing its thing, and
>> doing it correctly! :-)
>>
>> Thanks guys.
>>
>> Cheers,
>> Chris
>>
>> On Nov 24, 2011, at 10:46 PM, Mattmann, Chris A (388J) wrote:
>>
>>> Hey Guys,
>>>
>>> Here is the latest red-herring. I think I was using too small of a -topN 
>>> parameter
>>> in my crawl, which was limiting the whole fetch. I was using -depth 10, and 
>>> -topN 10,
>>> which thinking about it now was limiting to 100 pages at any depth level of 
>>> links, which
>>> was too limited I think since most pages include > 100 pages in terms of 
>>> outlinks and
>>> so forth. So parsing, regex, everything was working fine, it just wasn't 
>>> following the
>>> links down, because it exceeded -topN * -depth.
>>>
>>> I'm running a new crawl now and it seems to be getting a TON more URLs. Full
>>> crawls for me were limited to around ~5k URLs before which I think was the 
>>> problem.
>>> Fingers crossed!
>>>
>>> Cheers,
>>> Chris
>>>
>>> On Nov 24, 2011, at 10:55 AM, Mattmann, Chris A (388J) wrote:
>>>
>>>> Hey Markus,
>>>>
>>>> On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:
>>>>
>>>>> Hi Chris
>>>>>
>>>>> https://issues.apache.org/jira/browse/NUTCH-1087
>>>>
>>>> Thanks for the pointer. I'll check it out.
>>>>
>>>>>
>>>>> Use the org.apache.nutch.net.URLFilterChecker to test.
>>>>
>>>> Sweet, I didn't know about this tool. OK, I tried it out, check it (note 
>>>> that this includes my instrumented stuff, hence
>>>> the printlns):
>>>>
>>>> echo 
>>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file";
>>>>  | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
>>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [false]
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [false]
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [true]
>>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>>>
>>>> So, looks like it didn't match the first 2 rules, but matched the 3rd one 
>>>> and thus it actually includes the URL fine. So, watch this, here are
>>>> my 3 relevant rules:
>>>>
>>>> # skip file: ftp: and mailto: urls
>>>> -^(file|ftp|mailto):
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>>>
>>>> +^http://([a-z0-9]*\.)*vault.fbi.gov/
>>>>
>>>> So, that makes perfect sense. RegexURLFilter appears to be working 
>>>> normally, so that's fine.
>>>>
>>>> So, .... what's the deal, then? ParserChecker works fine, it shows that an 
>>>> outlink from this URL:
>>>>
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>>>>
>>>> Is in fact the at_download link:
>>>>
>>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>>>> org.apache.nutch.parse.ParserChecker 
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
>>>> "download"
>>>> outlink: toUrl: 
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>>>  anchor: watergat1summary.pdf
>>>> [chipotle:local/nutch/framework] mattmann%
>>>>
>>>> RegexURLFilter takes in either of those URLs, and says they are fine:
>>>>
>>>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>>>> org.apache.nutch.parse.ParserChecker 
>>>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
>>>> download | awk '{print $3}' | ./bin/nutch 
>>>> org.apache.nutch.net.URLFilterChecker -filterName 
>>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [false]
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [false]
>>>> @#((#(#@ EVALUATING at_download LINK!: 
>>>> [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
>>>>  matched? [true]
>>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>>> [chipotle:local/nutch/framework] mattmann%
>>>>
>>>> [chipotle:local/nutch/framework] mattmann% echo 
>>>> "http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view"; | 
>>>> ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
>>>> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
>>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>>> doesn't have at_download in it!
>>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>>> doesn't have at_download in it!
>>>> URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
>>>> doesn't have at_download in it!
>>>> +http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>>>> [chipotle:local/nutch/framework] mattmann%
>>>>
>>>> Any idea why i wouldn't get getting the at_download URLs downloaded then? 
>>>> Here's http.content.limit,
>>>> db.max.outlinks from my Nutch conf:
>>>>
>>>> [chipotle:local/nutch/framework] mattmann% egrep -i -A1 
>>>> "db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml
>>>> confuse this setting with the http.content.limit setting.
>>>> </description>
>>>> --
>>>> <name>http.content.limit</name>
>>>> <value>-1</value>
>>>> --
>>>> <name>db.max.outlinks.per.page</name>
>>>> <value>-1</value>
>>>> --
>>>> If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
>>>> outlinks
>>>> will be processed for a page; otherwise, all outlinks will be processed.
>>>> [chipotle:local/nutch/framework] mattmann%
>>>>
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>>
>>>>>> Hey Markus,
>>>>>>
>>>>>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
>>>>>>>> Hey Markus,
>>>>>>>>
>>>>>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
>>>>>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
>>>>>>>>> command. I don't know what happens if it isnt there.
>>>>>>>>
>>>>>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built
>>>>>>>> conf directory in runtime/local/conf from 1.4?
>>>>>>>
>>>>>>> Its gone! I checked and last saw it in 1.2. Strange
>>>>>>>
>>>>>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you
>>>>>>>>> ask me.
>>>>>>>>
>>>>>>>> I'd be in favor of replacing the current Crawl command with a simple
>>>>>>>> Java driver that just calls the underlying Inject, Generate, and Fetch
>>>>>>>> tools. Would that work?
>>>>>>>
>>>>>>> There's an open issue to replace it with a basic crawl shell script. 
>>>>>>> It's
>>>>>>> easier to understand and uses the same commands. Non-Java users should
>>>>>>> be able to deal with it better, and provide us with better problem
>>>>>>> descriptions.
>>>>>>
>>>>>> +1, that would be cool indeed. Do you know what issue it is?
>>>>>>
>>>>>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out
>>>>>> if it's dropping the at_download URLs for whatever reason. Sigh.
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
>>>>>>>>>> Hi Marek,
>>>>>>>>>>
>>>>>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>>>>>>>>>>> I think when you use the crawl command instead of the single
>>>>>>>>>>> commands, you have to specify the regEx rules in the
>>>>>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case
>>>>>>>>>>> in 1.4
>>>>>>>>>>>
>>>>>>>>>>> Could that be the problem?
>>>>>>>>>>
>>>>>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir.
>>>>>>>>>> Also it looks like urlfilter-regex is the one that's enabled by
>>>>>>>>>> default and shipped with the basic config.
>>>>>>>>>>
>>>>>>>>>> Thanks for trying to help though. I'm going to figure this out! Or,
>>>>>>>>>> someone is going to probably tell me what I'm doing wrong.
>>>>>>>>>> We'll see what happens first :-)
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Chris
>>>>>>>>>>
>>>>>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>>>>>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>>>>>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
>>>>>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the
>>>>>>>>>>>>>> global, top-level conf/nutch-default.xml,
>>>>>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling
>>>>>>>>>>>>>> is forging ahead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why
>>>>>>>>>>>>> I suggested that we deliver the content of runtime/local as our
>>>>>>>>>>>>> binary release for next time. Most people use Nutch in local
>>>>>>>>>>>>> mode so this would make their lives easier, as for the advanced
>>>>>>>>>>>>> users (read pseudo or real distributed) they need to recompile
>>>>>>>>>>>>> the job file anyway and I'd expect them to use the src release
>>>>>>>>>>>>
>>>>>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>>>>>>>>>>>>
>>>>>>>>>>>> In the meanwhile, time to figure out why I still can't get it to
>>>>>>>>>>>> crawl the PDFs... :(
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Chris
>>>>>>>>>>>>
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>>>>>> Senior Computer Scientist
>>>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>
>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>>>> Senior Computer Scientist
>>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>>>> Email: [email protected]
>>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>> Senior Computer Scientist
>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>> Email: [email protected]
>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Senior Computer Scientist
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>> Email: [email protected]
>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [RESOLUTION] Can't get Nutch to crawl PDFs (was Re: Can't get Nutch to crawl PDFs)

Reply via email to