Re: how are CSV/TXT files handled

2012-02-15 Thread remi tassing
Hi,

Tika is parsing properly, I think it was some kind of proxy issue and also
the http.content.limit.

Thanks!

Remi

On Fri, Feb 10, 2012 at 11:16 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Remi,

 Please ensure that your http.content limit is sufficient, what are you url
 filters? Any other configuration that could be knocking you off?

 lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
 parsechecker http://avis.free.fr/livret_278_recettes.pdf
 fetching: http://avis.free.fr/livret_278_recettes.pdf
 parsing: http://avis.free.fr/livret_278_recettes.pdf
 contentType: application/pdf
 signature: aa6e668dca553598a943d8abeb0e9f83
 -
 Url
 ---
 http://avis.free.fr/livret_278_recettes.pdf
 -
 ParseData
 -
 Version: 5
 Status: success(1,0)
 Title: Microsoft Word - RECETTES V.doc
 Outlinks: 3
  outlink: toUrl: http://avis.free.fr anchor:
  outlink: toUrl: http://avis.free.fr anchor:
  outlink: toUrl: http://avea.net/cvg/ anchor:
 Content Metadata: ETag=2a11be-535d2-43e29257 Date=Fri, 10 Feb 2012
 21:11:23 GMT Content-Length=341458 Last-Modified=Thu, 02 Feb 2006 23:14:31
 GMT Content-Type=application/pdf Accept-Ranges=bytes Connection=close
 Server=Apache/ProXad [Aug  9 2008 02:45:09]
 Parse Metadata: xmpTPg:NPages=32 Creation-Date=2006-01-02T00:36:06Z
 created=Mon Jan 02 00:36:06 GMT 2006 Author=CARREFOUR producer=Acrobat
 Distiller 6.0 (Windows) Last-Modified=2006-01-02T00:36:06Z
 Content-Type=application/pdf creator=PScript5.dll Version 5.2

 On Wed, Feb 8, 2012 at 2:04 PM, remi tassing tassingr...@gmail.com
 wrote:

 
  $ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf
  fetching: http://avis.free.fr/livret_278_recettes.pdf
  Can't fetch URL successfully
 

 lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
 parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls
 fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
 parsing: http://spreadsheetpage.com/downloads/xl/keno.xls
 contentType: application/vnd.ms-excel
 signature: d3f1d947dfe727e33669dad44957be19
 -
 Url
 ---
 http://spreadsheetpage.com/downloads/xl/keno.xls
 -
 ParseData
 -
 Version: 5
 Status: success(1,0)
 Title:
 Outlinks: 0
 Content Metadata: ETag=a22003-17c00-4531a9cb1dd80 Date=Fri, 10 Feb 2012
 21:14:40 GMT Content-Length=97280 Last-Modified=Mon, 28 Jul 2008 19:34:30
 GMT Content-Type=application/vnd.ms-excel Connection=close
 Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat)
 Parse Metadata: Creation-Date=1998-06-23T16:20:19Z Last-Author=John
 Walkenbach Application-Name=Microsoft Excel Author=John Walkenbach
 Company=JWalk And Associates Content-Type=application/vnd.ms-excel



 
  $ bin/nutch parsechecker
 http://spreadsheetpage.com/downloads/xl/keno.xls
  fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
  Can't fetch URL successfully
 



Re: how are CSV/TXT files handled

2012-02-08 Thread remi tassing
Ok I just did (It's great but I've been reluctant because recompiling
always gives me errors).

However, I'm still having a similar error:
$ bin/nutch parsechecker http://URL
fetching: http://URL
parsing: http://URL
contentType: application/ms-excel
-
Url
---
http://URL-
ParseData
-
Version: 5
Status: failed(2,0): Can't retrieve Tika parser for mime-type
application/ms-excel
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

my nutch-default.xml and nutch-site.xml all have:
property
  nameplugin.includes/name

valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please
enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  /description
/property

Remi

On Tue, Feb 7, 2012 at 11:17 AM, Markus Jelsma mar...@apache.org wrote:

 Upgrade to 1.4.

  With the nutch parsechecker command I get the following error message:
 
  Error: Could not find or load main class parsechecker, this doesn't
 sound
  good!
 
  On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com
 wrote:
   The point that made me start thinking is because I got this error
   message:
  
   failed(2,0): Can't retrieve Tika parser for mime-type
   application/ms-excel
  
   I'm using Nutch-1.2 and my nutch-site.xml has:
  
   property
  
 nameplugin.includes/name
  
  
 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde
   x-(basic|anchor)|q...
  
   Remi
  
   On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.com
 wrote:
   Hey guys,
  
   I checked the mailing-list archive but couldn't get an answer on
 this. I
   think CSV and TXT don't need any kind of parsing, but how.are handled
 by
   default?
  
   Remi



Re: how are CSV/TXT files handled

2012-02-07 Thread remi tassing
With the nutch parsechecker command I get the following error message:

Error: Could not find or load main class parsechecker, this doesn't sound
good!

On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote:

 The point that made me start thinking is because I got this error message:

 failed(2,0): Can't retrieve Tika parser for mime-type
 application/ms-excel

 I'm using Nutch-1.2 and my nutch-site.xml has:

 property
   nameplugin.includes/name

 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q...

 Remi

 On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.comwrote:

 Hey guys,

 I checked the mailing-list archive but couldn't get an answer on this. I
 think CSV and TXT don't need any kind of parsing, but how.are handled by
 default?

 Remi





how are CSV/TXT files handled

2012-02-07 Thread remi tassing
Hey guys,

I checked the mailing-list archive but couldn't get an answer on this. I
think CSV and TXT don't need any kind of parsing, but how.are handled by
default?

Remi


Re: how are CSV/TXT files handled

2012-02-07 Thread Markus Jelsma
Upgrade to 1.4.

 With the nutch parsechecker command I get the following error message:
 
 Error: Could not find or load main class parsechecker, this doesn't sound
 good!
 
 On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote:
  The point that made me start thinking is because I got this error
  message:
  
  failed(2,0): Can't retrieve Tika parser for mime-type
  application/ms-excel
  
  I'm using Nutch-1.2 and my nutch-site.xml has:
  
  property
  
nameplugin.includes/name
  
  valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde
  x-(basic|anchor)|q...
  
  Remi
  
  On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.comwrote:
  Hey guys,
  
  I checked the mailing-list archive but couldn't get an answer on this. I
  think CSV and TXT don't need any kind of parsing, but how.are handled by
  default?
  
  Remi


how are CSV/TXT files handled

2012-02-06 Thread remi tassing
Hey guys,

I checked the mailing-list archive but couldn't get an answer on this. I
think CSV and TXT don't need any kind of parsing, but how.are handled by
default?

Remi


Re: how are CSV/TXT files handled

2012-02-06 Thread remi tassing
The point that made me start thinking is because I got this error message:

failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel

I'm using Nutch-1.2 and my nutch-site.xml has:

property
  nameplugin.includes/name

valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q...

Remi

On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.com wrote:

 Hey guys,

 I checked the mailing-list archive but couldn't get an answer on this. I
 think CSV and TXT don't need any kind of parsing, but how.are handled by
 default?

 Remi