Re: Can't get Nutch to crawl PDFs

Markus Jelsma Wed, 23 Nov 2011 15:55:36 -0800

Can you also try with trunk or 1.4?  I get different output with parsechecker 
such as a proper title.



markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch 
parsechecker http://vault.fbi.gov/watergate/watergate-summary-part-02-
of-02/at_download/file
fetching: http://vault.fbi.gov/watergate/watergate-summary-part-02-
of-02/at_download/file
parsing: http://vault.fbi.gov/watergate/watergate-summary-part-02-
of-02/at_download/file
contentType: application/pdf
signature: 818fd03d7f9011b4f7000657e2aaf966
---------
Url
---------------
http://vault.fbi.gov/watergate/watergate-summary-part-02-
of-02/at_download/file---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Watergate Summary Part 02 of 02
Outlinks: 0
Content Metadata: Date=Wed, 23 Nov 2011 23:53:03 GMT Content-Length=1228493 
Expires=Wed, 30 Nov 2011 23:53:03 GMT Content-Disposition=attachment; 
filename="watergat2.pdf" Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT 
Connection=close Accept-Ranges=bytes Content-Type=application/pdf Server=HTML 
Cache-Control=max-age=604800 
Parse Metadata: xmpTPg:NPages=20 Creation-Date=2000-02-16T13:54:36Z 
created=Wed Feb 16 14:54:36 CET 2000 Author=FBI producer=Acrobat PDFWriter 
2.01 for Windows; modified using iText 2.1.7 by 1T3XT Last-
Modified=2011-11-08T03:25:49Z language=en Content-Type=application/pdf 
creator=FBI



> Hey Markus,
> 
> I set the http.content.limit to -1, so it shouldn't have a limit.
> 
> I'll try injecting that single URL and see if I can get it to download
> using separate commands and see what happens! :-)
> 
> Cheers,
> Chris
> 
> On Nov 23, 2011, at 3:29 PM, Markus Jelsma wrote:
> > What's your http.content.limit set to? Does it allow for a 1.2MB file?
> > Can you also check without merging segments? Or as a last resort, inject
> > that single URL in an empty crawl db and do a single crawl cycle,
> > preferably by using separate commands instead of the crawl command?
> > 
> >> Hey Guys,
> >> 
> >> I'm using Nutch 1.3, and trying to get it to crawl:
> >> 
> >> http://vault.fbi.gov/
> >> 
> >> My regex-url filter diff is:
> >> 
> >> # accept anything else
> >> #+.
> >> 
> >> +^http://([a-z0-9*\.)*vault.fbi.gov/
> >> 
> >> I'm trying to get it to parse PDFs like:
> >> 
> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
> >> ad/ file
> >> 
> >> I see that my config ParserChecker lets me parse it OK:
> >> 
> >> [chipotle:local/nutch/framework] mattmann% ./runtime/local/bin/nutch
> >> org.apache.nutch.parse.ParserChecker
> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
> >> ad /file fetching:
> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
> >> ad /file parsing:
> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
> >> ad /file contentType: application/pdf
> >> ---------
> >> Url
> >> ---------------
> >> http://vault.fbi.gov/watergate/watergate-summary-part-02-of-02/at_downlo
> >> ad/ file--------- ParseData
> >> ---------
> >> Version: 5
> >> Status: success(1,0)
> >> Title:
> >> Outlinks: 0
> >> Content Metadata: Date=Wed, 23 Nov 2011 21:55:46 GMT
> >> Content-Length=1228493 Expires=Wed, 30 Nov 2011 21:55:46 GMT
> >> Content-Disposition=attachment; filename="watergat2.pdf"
> >> Last-Modified=Fri, 08 Jul 2011 17:46:08 GMT Connection=close
> >> Accept-Ranges=bytes Content-Type=application/pdf Server=HTML
> >> Cache-Control=max-age=604800 Parse Metadata: xmpTPg:NPages=0
> >> Content-Type=application/pdf
> >> 
> >> I didn't change conf/parse-plugins.xml or conf/nutch-default.xml in
> >> terms of the plugin.includes (as it looks like parse-tika) is included
> >> and handles * contentType.
> >> 
> >> I see in my crawl log if I merge the segs, and dump them and then grep
> >> for URL, I see it getting to like:
> >> 
> >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/view
> >> 
> >> That type of URL, but then not grabbing the PDF once it parses it, or
> >> adding it to the outlinks, as I never see a:
> >> 
> >> http://vault.fbi.gov/watergate/watergate-part-36-37-of-1/at_download/fil
> >> e
> >> 
> >> In the URL list.
> >> 
> >> I'm running this command to crawl:
> >> 
> >> ./runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 10
> >> 
> >> Any idea what I'm doing wrong?
> >> 
> >> Cheers
> >> Chris
> >> 
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Can't get Nutch to crawl PDFs

Reply via email to