Re: how are CSV/TXT files handled
Hi, Tika is parsing properly, I think it was some kind of proxy issue and also the http.content.limit. Thanks! Remi On Fri, Feb 10, 2012 at 11:16 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Remi, Please ensure that your http.content limit is sufficient, what are you url filters? Any other configuration that could be knocking you off? lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf fetching: http://avis.free.fr/livret_278_recettes.pdf parsing: http://avis.free.fr/livret_278_recettes.pdf contentType: application/pdf signature: aa6e668dca553598a943d8abeb0e9f83 - Url --- http://avis.free.fr/livret_278_recettes.pdf - ParseData - Version: 5 Status: success(1,0) Title: Microsoft Word - RECETTES V.doc Outlinks: 3 outlink: toUrl: http://avis.free.fr anchor: outlink: toUrl: http://avis.free.fr anchor: outlink: toUrl: http://avea.net/cvg/ anchor: Content Metadata: ETag=2a11be-535d2-43e29257 Date=Fri, 10 Feb 2012 21:11:23 GMT Content-Length=341458 Last-Modified=Thu, 02 Feb 2006 23:14:31 GMT Content-Type=application/pdf Accept-Ranges=bytes Connection=close Server=Apache/ProXad [Aug 9 2008 02:45:09] Parse Metadata: xmpTPg:NPages=32 Creation-Date=2006-01-02T00:36:06Z created=Mon Jan 02 00:36:06 GMT 2006 Author=CARREFOUR producer=Acrobat Distiller 6.0 (Windows) Last-Modified=2006-01-02T00:36:06Z Content-Type=application/pdf creator=PScript5.dll Version 5.2 On Wed, Feb 8, 2012 at 2:04 PM, remi tassing tassingr...@gmail.com wrote: $ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf fetching: http://avis.free.fr/livret_278_recettes.pdf Can't fetch URL successfully lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls fetching: http://spreadsheetpage.com/downloads/xl/keno.xls parsing: http://spreadsheetpage.com/downloads/xl/keno.xls contentType: application/vnd.ms-excel signature: d3f1d947dfe727e33669dad44957be19 - Url --- http://spreadsheetpage.com/downloads/xl/keno.xls - ParseData - Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: ETag=a22003-17c00-4531a9cb1dd80 Date=Fri, 10 Feb 2012 21:14:40 GMT Content-Length=97280 Last-Modified=Mon, 28 Jul 2008 19:34:30 GMT Content-Type=application/vnd.ms-excel Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat) Parse Metadata: Creation-Date=1998-06-23T16:20:19Z Last-Author=John Walkenbach Application-Name=Microsoft Excel Author=John Walkenbach Company=JWalk And Associates Content-Type=application/vnd.ms-excel $ bin/nutch parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls fetching: http://spreadsheetpage.com/downloads/xl/keno.xls Can't fetch URL successfully
Re: how are CSV/TXT files handled
Ok I just did (It's great but I've been reluctant because recompiling always gives me errors). However, I'm still having a similar error: $ bin/nutch parsechecker http://URL fetching: http://URL parsing: http://URL contentType: application/ms-excel - Url --- http://URL- ParseData - Version: 5 Status: failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel Title: Outlinks: 0 Content Metadata: Parse Metadata: my nutch-default.xml and nutch-site.xml all have: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property Remi On Tue, Feb 7, 2012 at 11:17 AM, Markus Jelsma mar...@apache.org wrote: Upgrade to 1.4. With the nutch parsechecker command I get the following error message: Error: Could not find or load main class parsechecker, this doesn't sound good! On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote: The point that made me start thinking is because I got this error message: failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel I'm using Nutch-1.2 and my nutch-site.xml has: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde x-(basic|anchor)|q... Remi On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.com wrote: Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi
Re: how are CSV/TXT files handled
With the nutch parsechecker command I get the following error message: Error: Could not find or load main class parsechecker, this doesn't sound good! On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote: The point that made me start thinking is because I got this error message: failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel I'm using Nutch-1.2 and my nutch-site.xml has: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q... Remi On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.comwrote: Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi
how are CSV/TXT files handled
Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi
Re: how are CSV/TXT files handled
Upgrade to 1.4. With the nutch parsechecker command I get the following error message: Error: Could not find or load main class parsechecker, this doesn't sound good! On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote: The point that made me start thinking is because I got this error message: failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel I'm using Nutch-1.2 and my nutch-site.xml has: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde x-(basic|anchor)|q... Remi On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.comwrote: Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi
how are CSV/TXT files handled
Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi
Re: how are CSV/TXT files handled
The point that made me start thinking is because I got this error message: failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel I'm using Nutch-1.2 and my nutch-site.xml has: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q... Remi On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.com wrote: Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi