Good to know! Thanks! On Wednesday 26 May 2010 13:19:38 Julien Nioche wrote: > Hi Markus, > > Glad you got it to work. You should not need to specify the association > between Tika and the mime-type in parse-plugins.xml. If Tika is activated > via plugin.includes it will be used by default for any MimeType. > Parse-plugins.xml is now meant mostly to specify parsers for types not > parsed by Tika or override the Tika parser. > > J. > > > Well, i guess that hunch was just partially right. I did add the > > image/jpeg MIMEtype to parse-plugins.xml and have it point to the > > parse-tika alias. > > > > The following issue was that the Tika plugin was not registered in nutch- > > default so i added it to nutch-site where i already had an overriding > > plugin.includes directive. > > > > Thanks for the suggestions! > > > > On Wednesday 26 May 2010 12:04:42 Markus Jelsma wrote: > > > Hi, > > > > > > I got the following exception: > > > Exception in thread "main" org.apache.nutch.parse.ParseException: > > > parser not found for contentType=image/jpeg > > > > > > It now does fetch the content and does not complain when i parse the > > > fetched segment. When i send it over for indexing in Solr i just get 2 > > > documents instead of 3, the third should be the image. I also tried to > > > > get > > > > > information on the segment: > > > > > > #bin/nutch readseg -list crawl/segments/20100526114113/ > > > NAME GENERATED FETCHER START FETCHER END > > > FETCHED PARSED > > > 20100526114113 3 2010-05-26T11:41:32 > > > > 2010-05-26T11:41:32 > > > > > 3 2 > > > > > > Perhaps little unclear in the e-mail but it tells me it has fetched 3 > > > but parsed only 2. If i force parsing of the segment using bin/nutch > > > parse SEGDIR it'll tell me it already parsed the entire segment. > > > Exception in thread "main" java.io.IOException: Segment already parsed! > > > > > > I got the shipped tika-mimetypes.xmlconfiguration file which has > > > definitions for jpeg and other files. But, the shipped > > > parse-plugins.xml [1] file does not define any image MIME's. > > > > > > The first block of MIME's are commented out: > > > The following mimetype are now handled by the default parser > > > (parse-tika). You can uncomment the associations below to override > > > parse-tika > > > and chose which plugin should be used for a given content type > > > > > > And further on there is no JPEG MIME defined, nor does anything point > > > to the parse-tika alias. Could this be the problem? That Nutch simply > > > does not have Tika registered as parsing plugin although Tika itself is > > > configured to handle JPEG's? > > > > > > > > > [1]: http://svn.apache.org/viewvc/nutch/trunk/conf/parse- > > > plugins.xml?view=markup > > > > > > On Monday 17 May 2010 16:04:00 Julien Nioche wrote: > > > > Hi, > > > > > > > > I tried *bin/nutch org.apache.nutch.parse.ParserChecker > > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg* > > > > using the latest trunk from SVN and I am getting > > > > > > > > --------- > > > > > > > > > Version: 5 > > > > > Status: success(1,0) > > > > > Title: > > > > > Outlinks: 0 > > > > > Content Metadata: ETag="15dab-8280a1c0" Date=Mon, 17 May 2010 > > > > 13:55:16 > > > > > > > GMT Content-Length=89515 Expires=Mon, 26 Jul 2010 13:55:16 GMT > > > > > Last-Modified=Mon, 26 Jan 2009 13:13:51 GMT Content-Type=image/jpeg > > > > > Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Debian) > > > > > PHP/5.2.0-8+etch16 Cache-Control=max-age=6048000 > > > > > Parse Metadata: Software=Adobe Photoshop CS2 Windows Number of > > > > > Components=3 Orientation=Top, left side (Horizontal / normal) Color > > > > > Space=sRGB Image Height=156 pixels Data Precision=8 bits Exif Image > > > > > Width=992 pixels Component 1=Y component: Quantization table 0, > > > > > Sampling factors 1 horiz/1 vert Component 2=Cb component: > > > > Quantization > > > > > > > table 1, Sampling factors 1 horiz/1 vert Compression=JPEG > > > > > (old-style) Component 3=Cr component: Quantization table 1, > > > > > Sampling factors 1 horiz/1 vert Date/Time=2009:01:26 14:05:22 X > > > > > Resolution=72 dots per inch Thumbnail Offset=302 bytes Exif Image > > > > > Height=156 pixels > > > > Thumbnail > > > > > > > Length=3259 bytes Resolution Unit=Inch Image Width=992 pixels > > > > Thumbnail > > > > > > > Data=[3259 bytes of thumbnail data] Y Resolution=72 dots per inch > > > > > > > > could you try the command above? > > > > > > > > J. > > > > > > > > > Hi, > > > > > > > > > > > > > > > It seems it still doens't work afterall. I updated all config files > > > > and > > > > > > > the JPEG (and more new as it looks like). But the log still tells > > > > > me > > > > it > > > > > > > cannot find a suitable parser. > > > > > > > > > > --------------- > > > > > 2010-05-17 15:20:06,636 WARN parse.ParseUtil - No suitable parser > > > > > found when > > > > > trying to parse content > > > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg of > > > > type > > > > > > > image/jpeg > > > > > 2010-05-17 15:20:06,637 WARN parse.Parser - Error parsing: > > > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg: > > > > > org.apache.nutch.parse.ParseException: parser not found for > > > > > contentType=image/jpeg > > > > > url=http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg > > > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) > > > > > at > > > > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85) > > > > > > > at > > > > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41) > > > > > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > > at > > > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at > > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177 > > > > > > >) --------------- > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > On Monday 17 May 2010 14:37:54 Markus Jelsma wrote: > > > > > > Hi, > > > > > > > > > > > > > > > > > > I've got a copy of the nutch-2010-05-11_04-34-41 nightly build > > > > > > because i need Tika to parse JPEG images and that would be in 1.1 > > > > as > > > > > > > > i read somewhere [1]. > > > > > > > > > > > > --------------- > > > > > > 2010-05-17 14:36:13,074 WARN parse.ParseUtil - No suitable > > > > > > parser found when trying to parse content > > > > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg of > > > > > > type image/jpeg > > > > > > 2010-05-17 14:36:13,075 WARN parse.Parser - Error parsing: > > > > > > http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jpg: > > > > > > org.apache.nutch.parse.ParseException: parser not found for > > > > > > contentType=image/jpeg > > > > > > url=http://www.fcgroningen.nl/uploads/media/hollabovenplaat_01.jp > > > > > >g --------------- > > > > > > > > > > > > > > > > > > [1]: > > > > http://lucene.472066.n3.nabble.com/Adding-jpeg-parser-to-nutch- > > > > > > > > td710135.html > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Markus Jelsma - Technisch Architect - Buyways BV > > > > > > http://www.linkedin.com/in/markus17 > > > > > > 050-8536620 / 06-50258350 > > > > > > > > > > Markus Jelsma - Technisch Architect - Buyways BV > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350 > > > > > > Markus Jelsma - Technisch Architect - Buyways BV > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > > > Markus Jelsma - Technisch Architect - Buyways BV > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 >
Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

