Well, i guess that hunch was just partially right. I did add the image/jpeg 
MIMEtype to parse-plugins.xml and have it point to the parse-tika alias.

The following issue was that the Tika plugin was not registered in nutch-
default so i added it to nutch-site where i already had an overriding 
plugin.includes directive.

Thanks for the suggestions!


On Wednesday 26 May 2010 12:04:42 Markus Jelsma wrote:
> Hi,
> 
> I got the following exception:
> Exception in thread "main" org.apache.nutch.parse.ParseException: parser
>  not found for contentType=image/jpeg
> 
> It now does fetch the content and does not complain when i parse the
>  fetched segment. When i send it over for indexing in Solr i just get 2
>  documents instead of 3, the third should be the image. I also tried to get
>  information on the segment:
> 
> #bin/nutch readseg -list crawl/segments/20100526114113/
> NAME            GENERATED       FETCHER START           FETCHER END
> FETCHED PARSED
> 20100526114113  3               2010-05-26T11:41:32     2010-05-26T11:41:32
> 3       2
> 
> Perhaps little unclear in the e-mail but it tells me it has fetched 3 but
> parsed only 2. If i force parsing of the segment using bin/nutch parse
>  SEGDIR it'll tell me it already parsed the entire segment.
> Exception in thread "main" java.io.IOException: Segment already parsed!
> 
> I got the shipped tika-mimetypes.xmlconfiguration file which has
>  definitions for jpeg and other files. But, the shipped parse-plugins.xml
>  [1] file does not define any image MIME's.
> 
> The first block of MIME's are commented out:
>   The following mimetype are now handled by the default parser
>  (parse-tika). You can uncomment the associations below to override
>  parse-tika
>   and chose which plugin should be used for a given content type
> 
> And further on there is no JPEG MIME defined, nor does anything point to
>  the parse-tika alias. Could this be the problem? That Nutch simply does
>  not have Tika registered as parsing plugin although Tika itself is
>  configured to handle JPEG's?
> 
> 
> [1]: http://svn.apache.org/viewvc/nutch/trunk/conf/parse-
> plugins.xml?view=markup
> 
> On Monday 17 May 2010 16:04:00 Julien Nioche wrote:
> > Hi,
> >
> > I tried *bin/nutch org.apache.nutch.parse.ParserChecker
> > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg*
> > using the latest trunk from SVN and I am getting
> >
> > ---------
> >
> > > Version: 5
> > > Status: success(1,0)
> > > Title:
> > > Outlinks: 0
> > > Content Metadata: ETag="15dab-8280a1c0" Date=Mon, 17 May 2010 13:55:16
> > > GMT Content-Length=89515 Expires=Mon, 26 Jul 2010 13:55:16 GMT
> > > Last-Modified=Mon, 26 Jan 2009 13:13:51 GMT Content-Type=image/jpeg
> > > Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Debian)
> > > PHP/5.2.0-8+etch16 Cache-Control=max-age=6048000
> > > Parse Metadata: Software=Adobe Photoshop CS2 Windows Number of
> > > Components=3 Orientation=Top, left side (Horizontal / normal) Color
> > > Space=sRGB Image Height=156 pixels Data Precision=8 bits Exif Image
> > > Width=992 pixels Component 1=Y component: Quantization table 0,
> > > Sampling factors 1 horiz/1 vert Component 2=Cb component: Quantization
> > > table 1, Sampling factors 1 horiz/1 vert Compression=JPEG (old-style)
> > > Component 3=Cr component: Quantization table 1, Sampling factors 1
> > > horiz/1 vert Date/Time=2009:01:26 14:05:22 X Resolution=72 dots per
> > > inch Thumbnail Offset=302 bytes Exif Image Height=156 pixels Thumbnail
> > > Length=3259 bytes Resolution Unit=Inch Image Width=992 pixels Thumbnail
> > > Data=[3259 bytes of thumbnail data] Y Resolution=72 dots per inch
> >
> > could you try the command above?
> >
> > J.
> >
> > > Hi,
> > >
> > >
> > > It seems it still doens't work afterall. I updated all config files and
> > > the JPEG (and more new as it looks like). But the log still tells me it
> > > cannot find a suitable parser.
> > >
> > > ---------------
> > > 2010-05-17 15:20:06,636 WARN  parse.ParseUtil - No suitable parser
> > > found when
> > > trying to parse content
> > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg of type
> > > image/jpeg
> > > 2010-05-17 15:20:06,637 WARN  parse.Parser - Error parsing:
> > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg:
> > > org.apache.nutch.parse.ParseException: parser not found for
> > > contentType=image/jpeg
> > > url=http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg
> > >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> > >        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> > >        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> > >        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >        at
> > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at
> > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177
> > >) ---------------
> > >
> > >
> > > Cheers,
> > >
> > > On Monday 17 May 2010 14:37:54 Markus Jelsma wrote:
> > > > Hi,
> > > >
> > > >
> > > > I've got a copy of the nutch-2010-05-11_04-34-41 nightly build
> > > > because i need Tika to parse JPEG images and that would be in 1.1 as
> > > > i read somewhere [1].
> > > >
> > > > ---------------
> > > > 2010-05-17 14:36:13,074 WARN  parse.ParseUtil - No suitable parser
> > > > found when trying to parse content
> > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg of
> > > > type image/jpeg
> > > > 2010-05-17 14:36:13,075 WARN  parse.Parser - Error parsing:
> > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg:
> > > > org.apache.nutch.parse.ParseException: parser not found for
> > > > contentType=image/jpeg
> > > > url=http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg
> > > > ---------------
> > > >
> > > >
> > > > [1]: http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-nutch-
> > > > td710135.html
> > > >
> > > > Cheers,
> > > >
> > > > Markus Jelsma - Technisch Architect - Buyways BV
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > >
> > > Markus Jelsma - Technisch Architect - Buyways BV
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to