Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 21:11:40 -0800

Uh...oh...I think I might have figured it out:

http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F


Check this:

[chipotle:local/nutch/framework] mattmann% ./bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
outlink | wc -l
     169

Hmmm...running test crawl right now with db.max.outlinks.per.page set to -1....

Cheers,
Chris

On Nov 23, 2011, at 8:52 PM, Mattmann, Chris A (388J) wrote:

> Here's a real use case too:
>
> ./bin/nutch org.apache.nutch.parse.ParserChecker 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>
> That produces, as one of its outlinks:
>
> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
> org.apache.nutch.parse.ParserChecker 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
> download
>  outlink: toUrl: 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>  anchor: watergat1summary.pdf
> [chipotle:local/nutch/framework] mattmann%
>
> That's correct. However, it doesn't seem like this outlink is being read at 
> least during the fetch/generate/crawl cycle, as
> I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse 
> the URL just fine b/c if I run ParserChecker
> direct to that URL, I see:
>
> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
> org.apache.nutch.parse.ParserChecker 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>   fetching: 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
> parsing: 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
> contentType: application/pdf
> ---------
> Url
> ---------------
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Watergate Summary Part 01 of 02
> Outlinks: 2
>  outlink: toUrl: Li:92 anchor:
>  outlink: toUrl: u92.:n. anchor:
> Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 
> Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; 
> filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
> Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML 
> Cache-Control=max-age=604800
> Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z 
> created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter 
> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT 
> Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI
> [chipotle:local/nutch/framework] mattmann%
>
> I'll keep digging. I wonder if it's a regex thing. I commented out 
> *everything* in my regex-urlfilter.txt besides:
>
> +^http://([a-z0-9]*\.)*vault.fbi.gov/
>
> It seems to get EVERYTHING on the site *but* these dang at_download URLs.
>
> Cheers,
> Chris
>
> On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote:
>
>> OK, it didn't work again: here are the URLs from a full crawl cycle:
>>
>> http://pastebin.com/Jx3Ar6Md
>>
>> When run independently, where I seed it with an *at_download* URL,
>> direct to the PDF, it parses the PDF. But when I run it like normal with 
>> topN 10 and
>> depth 10, it doesn't pick them up.
>>
>> /me stumped
>>
>> I'll poke around in the code but was just wondering if I was doing something
>> wrong.
>>
>> Cheers,
>> Chris
>>
>> On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote:
>>
>>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
>>> how to make it work in 1.4 (instead of editing the global, top-level 
>>> conf/nutch-default.xml,
>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is forging 
>>> ahead.
>>>
>>> I'll report back on if I'm able to grab the PDFs or not, using 1.4...
>>>
>>> Cheers,
>>> Chris
>>>
>>> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote:
>>>
>>>> *really* weird.
>>>>
>>>> With 1.4, even though I have my http.agent.name property set in 
>>>> conf/nutch-default.xml,
>>>> it keeps telling me this:
>>>>
>>>> Fetcher: No agents listed in 'http.agent.name' property.
>>>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No 
>>>> agents listed in 'http.agent.name' property.
>>>>    at 
>>>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
>>>>    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
>>>>    at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>> [chipotle:local/nutch/framework] mattmann%
>>>>
>>>> When I try and crawl.
>>>>
>>>> Is nutch-default.xml not read by the crawl command in 1.4?
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
>>>>
>>>>> Can you also try with trunk or 1.4?  I get different output with 
>>>>> parsechecker
>>>>> such as a proper title.
>>>>>
>>>>>
>>>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch
>>>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>> of-02/at_download/file
>>>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>> of-02/at_download/file
>>>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>> of-02/at_download/file
>>>>> contentType: application/pdf
>>>>> signature: 818fd03d7f9011b4f7000657e2aaf966
>>>>> ---------
>>>>> Url
>>>>> ---------------
>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>> of-02/at_download/file---------
>>>>> ParseData
>>>>> ---------
>>>>> Version: 5
>>>>> Status: success(1,0)
>>>>> Title: Watergate Summary Part 02 of 02
>>>>> Outlinks: 0
>>>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT 
>>>>> Content-Length=1228493
>>>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment;
>>>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT
>>>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>>>>> Server=HTML
>>>>> Cache-Control=max-age=604800
>>>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z
>>>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter
>>>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
>>>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf
>>>>> creator=FBI
>>>>>
>>>>>
>>>>>
>>>>>> Hey Markus,
>>>>>>
>>>>>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>>>>>
>>>>>> I'll try injecting that single URL and see if I can get it to download
>>>>>> using separate commands and see what happens! :-)
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>>>>>> Can you also check without merging segments? Or as a last resort, inject
>>>>>>> that single URL in an empty crawl db and do a single crawl cycle,
>>>>>>> preferably by using separate commands instead of the crawl command?
>>>>>>>
>>>>>>>> Hey Guys,
>>>>>>>>
>>>>>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>>>>>
>>>>>>>> http://vault.fbi.gov/
>>>>>>>>
>>>>>>>> My regex-url filter diff is:
>>>>>>>>
>>>>>>>> # accept anything else
>>>>>>>> #+.
>>>>>>>>
>>>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>>>>>
>>>>>>>> I'm trying to get it to parse PDFs like:
>>>>>>>>
>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>> ad/ file
>>>>>>>>
>>>>>>>> I see that my config ParserChecker lets me parse it OK:
>>>>>>>>
>>>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>>>>>> org.apache.nutch.parse.ParserChecker
>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>> ad /file fetching:
>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>> ad /file parsing:
>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>> ad /file contentType: application/pdf
>>>>>>>> ---------
>>>>>>>> Url
>>>>>>>> ---------------
>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>> ad/ file--------- ParseData
>>>>>>>> ---------
>>>>>>>> Version: 5
>>>>>>>> Status: success(1,0)
>>>>>>>> Title:
>>>>>>>> Outlinks: 0
>>>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>>>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>>>>>> Content-Type=application/pdf
>>>>>>>>
>>>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>>>>>> and handles * contentType.
>>>>>>>>
>>>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>>>>>> for URL, I see it getting to like:
>>>>>>>>
>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>>>>>
>>>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>>>>>> adding it to the outlinks, as I never see a:
>>>>>>>>
>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>>>>>> e
>>>>>>>>
>>>>>>>> In the URL list.
>>>>>>>>
>>>>>>>> I'm running this command to crawl:
>>>>>>>>
>>>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>>>>>
>>>>>>>> Any idea what I'm doing wrong?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>> Senior Computer Scientist
>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>> Email: [email protected]
>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Senior Computer Scientist
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>> Email: [email protected]
>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to