One more question, for what I can see the media is actually not crawled? But in some of the screenshots I can see that you detect the width and height properties of some documents, this is by detecting the corresponding HTML attribute with the extractors or by fetching/parsing the actual media (image, video) file.
Regards, ----- Original Message ----- From: "cervenkovab" <[email protected]> To: [email protected] Sent: Sunday, May 24, 2015 2:57:59 PM Subject: Re: [MASSMAIL]Nutch - media extractor plugin proposal Wow, thanks for such a quick reply... It is the simplification for data transmission in index time. Class MediaSOLRIndexWriter.java <https://github.com/KIZI/IRAPI/blob/master/nutch-plugin/media-extractor/src/java/org/apache/nutch/indexwriter/media/MediaSOLRIndexWriter.java> implements */IndexWriter/* so must override public*/ void write(final NutchDocument doc) throws IOException/*. There is a 1:M relation between webpage and internal media urls, in time the webpage is indexed also its media have to be indexed. That was the easiest way how to achieve it. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-media-extractor-plugin-proposal-tp4207382p4207402.html Sent from the Nutch - User mailing list archive at Nabble.com.

