Re: Can't get Nutch to crawl PDFs

Mattmann, Chris A (388J) Wed, 23 Nov 2011 21:27:23 -0800

Umm...sigh, that didn't solve it.

I'll keep looking.


Cheers,
Chris

On Nov 23, 2011, at 9:11 PM, Mattmann, Chris A (388J) wrote:

> Uh...oh...I think I might have figured it out:
>
> http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F
>
> Check this:
>
> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
> org.apache.nutch.parse.ParserChecker 
> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
> outlink | wc -l
>     169
>
> Hmmm...running test crawl right now with db.max.outlinks.per.page set to 
> -1....
>
> Cheers,
> Chris
>
> On Nov 23, 2011, at 8:52 PM, Mattmann, Chris A (388J) wrote:
>
>> Here's a real use case too:
>>
>> ./bin/nutch org.apache.nutch.parse.ParserChecker 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
>>
>> That produces, as one of its outlinks:
>>
>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>> org.apache.nutch.parse.ParserChecker 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
>> download
>> outlink: toUrl: 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>  anchor: watergat1summary.pdf
>> [chipotle:local/nutch/framework] mattmann%
>>
>> That's correct. However, it doesn't seem like this outlink is being read at 
>> least during the fetch/generate/crawl cycle, as
>> I never get it picked up in my crawl. Nutch (and parse-tika) seem to parse 
>> the URL just fine b/c if I run ParserChecker
>> direct to that URL, I see:
>>
>> [chipotle:local/nutch/framework] mattmann% ./bin/nutch 
>> org.apache.nutch.parse.ParserChecker 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>>   fetching: 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>> parsing: 
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
>> contentType: application/pdf
>> ---------
>> Url
>> ---------------
>> http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: Watergate Summary Part 01 of 02
>> Outlinks: 2
>> outlink: toUrl: Li:92 anchor:
>> outlink: toUrl: u92.:n. anchor:
>> Content Metadata: Date=Thu, 24 Nov 2011 04:49:42 GMT Content-Length=6354860 
>> Expires=Thu, 01 Dec 2011 04:46:57 GMT Content-Disposition=attachment; 
>> filename="watergat1summary.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>> Server=HTML Cache-Control=max-age=604800
>> Parse Metadata: xmpTPg:NPages=123 Creation-Date=2000-02-16T22:44:25Z 
>> created=Wed Feb 16 14:44:25 PST 2000 Author=FBI producer=Acrobat PDFWriter 
>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT 
>> Last-Modified=2011-11-08T01:41:01Z Content-Type=application/pdf creator=FBI
>> [chipotle:local/nutch/framework] mattmann%
>>
>> I'll keep digging. I wonder if it's a regex thing. I commented out 
>> *everything* in my regex-urlfilter.txt besides:
>>
>> +^http://([a-z0-9]*\.)*vault.fbi.gov/
>>
>> It seems to get EVERYTHING on the site *but* these dang at_download URLs.
>>
>> Cheers,
>> Chris
>>
>> On Nov 23, 2011, at 5:48 PM, Mattmann, Chris A (388J) wrote:
>>
>>> OK, it didn't work again: here are the URLs from a full crawl cycle:
>>>
>>> http://pastebin.com/Jx3Ar6Md
>>>
>>> When run independently, where I seed it with an *at_download* URL,
>>> direct to the PDF, it parses the PDF. But when I run it like normal with 
>>> topN 10 and
>>> depth 10, it doesn't pick them up.
>>>
>>> /me stumped
>>>
>>> I'll poke around in the code but was just wondering if I was doing something
>>> wrong.
>>>
>>> Cheers,
>>> Chris
>>>
>>> On Nov 23, 2011, at 4:27 PM, Mattmann, Chris A (388J) wrote:
>>>
>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I figured out
>>>> how to make it work in 1.4 (instead of editing the global, top-level 
>>>> conf/nutch-default.xml,
>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling is 
>>>> forging ahead.
>>>>
>>>> I'll report back on if I'm able to grab the PDFs or not, using 1.4...
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> On Nov 23, 2011, at 4:10 PM, Mattmann, Chris A (388J) wrote:
>>>>
>>>>> *really* weird.
>>>>>
>>>>> With 1.4, even though I have my http.agent.name property set in 
>>>>> conf/nutch-default.xml,
>>>>> it keeps telling me this:
>>>>>
>>>>> Fetcher: No agents listed in 'http.agent.name' property.
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: 
>>>>> No agents listed in 'http.agent.name' property.
>>>>>   at 
>>>>> org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
>>>>>   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
>>>>>   at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>>>>>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>>>>> [chipotle:local/nutch/framework] mattmann%
>>>>>
>>>>> When I try and crawl.
>>>>>
>>>>> Is nutch-default.xml not read by the crawl command in 1.4?
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>>
>>>>> On Nov 23, 2011, at 3:54 PM, Markus Jelsma wrote:
>>>>>
>>>>>> Can you also try with trunk or 1.4?  I get different output with 
>>>>>> parsechecker
>>>>>> such as a proper title.
>>>>>>
>>>>>>
>>>>>> markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch
>>>>>> parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>>> of-02/at_download/file
>>>>>> fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>>> of-02/at_download/file
>>>>>> parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>>> of-02/at_download/file
>>>>>> contentType: application/pdf
>>>>>> signature: 818fd03d7f9011b4f7000657e2aaf966
>>>>>> ---------
>>>>>> Url
>>>>>> ---------------
>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-
>>>>>> of-02/at_download/file---------
>>>>>> ParseData
>>>>>> ---------
>>>>>> Version: 5
>>>>>> Status: success(1,0)
>>>>>> Title: Watergate Summary Part 02 of 02
>>>>>> Outlinks: 0
>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT 
>>>>>> Content-Length=1228493
>>>>>> Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment;
>>>>>> filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT
>>>>>> Connection=close Accept-Ranges=bytes Content-Type=application/pdf 
>>>>>> Server=HTML
>>>>>> Cache-Control=max-age=604800
>>>>>> Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z
>>>>>> created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat 
>>>>>> PDFWriter
>>>>>> 2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
>>>>>> Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf
>>>>>> creator=FBI
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hey Markus,
>>>>>>>
>>>>>>> I set the http.content.limit to -1, so it shouldn't have a limit.
>>>>>>>
>>>>>>> I'll try injecting that single URL and see if I can get it to download
>>>>>>> using separate commands and see what happens! :-)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Chris
>>>>>>>
>>>>>>> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
>>>>>>>> What's your http.content.limit set to? Does it allow for a 1.2MB file?
>>>>>>>> Can you also check without merging segments? Or as a last resort, 
>>>>>>>> inject
>>>>>>>> that single URL in an empty crawl db and do a single crawl cycle,
>>>>>>>> preferably by using separate commands instead of the crawl command?
>>>>>>>>
>>>>>>>>> Hey Guys,
>>>>>>>>>
>>>>>>>>> I'm using Nutch 1.3, and trying to get it to crawl:
>>>>>>>>>
>>>>>>>>> http://vault.fbi.gov/
>>>>>>>>>
>>>>>>>>> My regex-url filter diff is:
>>>>>>>>>
>>>>>>>>> # accept anything else
>>>>>>>>> #+.
>>>>>>>>>
>>>>>>>>> +^http://([a-z0-9*\.)*vault.fbi.gov/
>>>>>>>>>
>>>>>>>>> I'm trying to get it to parse PDFs like:
>>>>>>>>>
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>>> ad/ file
>>>>>>>>>
>>>>>>>>> I see that my config ParserChecker lets me parse it OK:
>>>>>>>>>
>>>>>>>>> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
>>>>>>>>> org.apache.nutch.parse.ParserChecker
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>>> ad /file fetching:
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>>> ad /file parsing:
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>>> ad /file contentType: application/pdf
>>>>>>>>> ---------
>>>>>>>>> Url
>>>>>>>>> ---------------
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
>>>>>>>>> ad/ file--------- ParseData
>>>>>>>>> ---------
>>>>>>>>> Version: 5
>>>>>>>>> Status: success(1,0)
>>>>>>>>> Title:
>>>>>>>>> Outlinks: 0
>>>>>>>>> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
>>>>>>>>> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
>>>>>>>>> Content-Disposition=attachment; filename="watergat2.pdf"
>>>>>>>>> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
>>>>>>>>> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
>>>>>>>>> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
>>>>>>>>> Content-Type=application/pdf
>>>>>>>>>
>>>>>>>>> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
>>>>>>>>> terms of the plugin.includes (as it looks like parse-tika) is included
>>>>>>>>> and handles * contentType.
>>>>>>>>>
>>>>>>>>> I see in my crawl log if I merge the segs, and dump them and then grep
>>>>>>>>> for URL, I see it getting to like:
>>>>>>>>>
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
>>>>>>>>>
>>>>>>>>> That type of URL, but then not grabbing the PDF once it parses it, or
>>>>>>>>> adding it to the outlinks, as I never see a:
>>>>>>>>>
>>>>>>>>> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
>>>>>>>>> e
>>>>>>>>>
>>>>>>>>> In the URL list.
>>>>>>>>>
>>>>>>>>> I'm running this command to crawl:
>>>>>>>>>
>>>>>>>>> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
>>>>>>>>>
>>>>>>>>> Any idea what I'm doing wrong?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>>> Senior Computer Scientist
>>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>>> Email: [email protected]
>>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Chris Mattmann, Ph.D.
>>>>>>> Senior Computer Scientist
>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>> Email: [email protected]
>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: [email protected]
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: [email protected]
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to