Re: [MASSMAIL]Nutch - media extractor plugin proposal

Jorge Luis Betancourt González Fri, 29 May 2015 08:16:06 -0700

One more question, for what I can see the media is actually not crawled? But in 
some of the screenshots I can see that you detect the width and height 
properties of some documents, this is by detecting the corresponding HTML 
attribute with the extractors or by fetching/parsing the actual media (image, 
video) file.

Regards,

----- Original Message -----
From: "cervenkovab" <[email protected]>
To: [email protected]
Sent: Sunday, May 24, 2015 2:57:59 PM
Subject: Re: [MASSMAIL]Nutch - media extractor plugin proposal

Wow, thanks for such a quick reply...

It is the simplification for data transmission in index time.  
 Class  MediaSOLRIndexWriter.java
<https://github.com/KIZI/IRAPI/blob/master/nutch-plugin/media-extractor/src/java/org/apache/nutch/indexwriter/media/MediaSOLRIndexWriter.java>

implements */IndexWriter/* so must override public*/ void write(final
NutchDocument doc) throws IOException/*.
There is a 1:M relation between webpage and internal media urls, in time the
webpage is indexed also its media have to be indexed.

That was the easiest way how to achieve it.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-media-extractor-plugin-proposal-tp4207382p4207402.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: [MASSMAIL]Nutch - media extractor plugin proposal

Reply via email to