Dear Nutcher's, I would like to share with you our work what we have done with Nutch. It could be good help for someone who is solving the same crawl issue -* crawling and extracting media for webpages*.
As a supporting part of our project LinkedTV <http://linkedtv.eu> at the University of Economics, Prague, <http://www.vse.cz/index-en.php> within the IRAPI search engine, we have created plugin *"Media-extractor"* for Nutch. Its purpose is to *extract media informatio*n (url, title, description, width,...) from webpage and respect the links between the page and the media (M:1 binding). The following media types are recognized: *image, video (multiple formats), audio.* A certain limitation of the plugin is that a (small) hack of the Nutch core code is required. Is it possible to* link the plugin* (https://github.com/KIZI/IRAPI/tree/master/nutch-plugin) from http://wiki.apache.org/nutch/PluginCentral ? Documentation can be found on its wiki: IRAPI - https://github.com/KIZI/IRAPI/wiki <https://github.com/KIZI/IRAPI/wiki> https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage <https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage> https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin----developer-perspective <https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin----developer-perspective> Plugin template following the Nutch style: plugin name : media-extractor plugin version : 2.8 provider : The University of Economics, Prague (VŠE). LinkedTV project plugin home url : https://github.com/KIZI/IRAPI/tree/master/nutch-plugin plugin download url : https://github.com/KIZI/IRAPI license : Apache 2.0 short description : Plugin for extracting media (image,video,audio) long description : Media-extractor plugin is plugin for Apache Nutch 2.3, created within the project LinkedTV at the University of Economics, Prague (VŠE). Its purpose is to extract media information (url, title, description, width,...) from webpages preserving the links between the page and the media (M:1 binding). The following media types are recognized: image, video, audio. configureable parameters : to prepare and install plugin see https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage meta data added to index : media informations, see schema.xml for all cores https://github.com/KIZI/IRAPI/tree/master/solr-example-conf/cores required jars : jsoup-1.7.2.jar plugin extension points : ParseFilter, IndexingFilter, IndexWriter plugin extension point interface : MediaExtractorParser, MediaIndexingFilter, MediaSOLRIndexWriter Best regards Barbora Červenková PS: If this is not the right place I should post this message, would you recommend me the right way I should post it? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-media-extractor-plugin-proposal-tp4207382.html Sent from the Nutch - User mailing list archive at Nabble.com.

