I've browsed a little around the plugin, one question, why do you need "that 
NutchDocument return Media" ? Which specific use case required this interaction?

Regards,

----- Original Message -----
From: "cervenkovab" <[email protected]>
To: [email protected]
Sent: Sunday, May 24, 2015 10:16:55 AM
Subject: [MASSMAIL]Nutch - media extractor plugin proposal

Dear Nutcher's,
I would like to share with you our work what we have done with Nutch. It
could be good help for someone who is solving the same crawl issue -*
crawling and extracting media for webpages*.

As a supporting part of our project  LinkedTV <http://linkedtv.eu>   at the 
University of Economics, Prague, <http://www.vse.cz/index-en.php>   within
the IRAPI search engine, we have created plugin *"Media-extractor"* for
Nutch.
Its purpose is to *extract media informatio*n (url, title, description,
width,...) from webpage and respect the links between the page and the media
(M:1 binding). The following media types are recognized: *image, video
(multiple formats), audio.*

A certain limitation of the plugin is that a (small) hack of the Nutch core
code is required.  

Is it possible to* link the plugin*
(https://github.com/KIZI/IRAPI/tree/master/nutch-plugin) from
http://wiki.apache.org/nutch/PluginCentral ?

Documentation can be found on its wiki:
IRAPI -  https://github.com/KIZI/IRAPI/wiki
<https://github.com/KIZI/IRAPI/wiki>  
https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage
<https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage>
  
https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin----developer-perspective
<https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin----developer-perspective>
  
Plugin template following the Nutch style:

plugin name             : media-extractor
plugin version           : 2.8
provider                    : The University of Economics, Prague (VŠE).
LinkedTV project
plugin home url        :
https://github.com/KIZI/IRAPI/tree/master/nutch-plugin
plugin download url  : https://github.com/KIZI/IRAPI
license                      : Apache 2.0
short description       : Plugin for extracting media (image,video,audio)
long description        : Media-extractor plugin is plugin for Apache Nutch
2.3, created within the project LinkedTV at the University of Economics,
Prague (VŠE). Its purpose is to extract media information (url, title,
description, width,...) from webpages preserving the links between the page
and the media (M:1 binding). The following media types are recognized:
image, video, audio.

configureable parameters         : to prepare and install plugin see
https://github.com/KIZI/IRAPI/wiki/Media-extractor-plugin---installation&usage
meta data added to index          : media informations, see schema.xml for
all cores https://github.com/KIZI/IRAPI/tree/master/solr-example-conf/cores
required jars                              : jsoup-1.7.2.jar
plugin extension points              : ParseFilter, IndexingFilter,
IndexWriter
plugin extension point interface : MediaExtractorParser,
MediaIndexingFilter, MediaSOLRIndexWriter

Best regards

Barbora Červenková

PS: If this is not the right place I should post this message, would you
recommend me the right way I should post it?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-media-extractor-plugin-proposal-tp4207382.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to