I've got the HtmlParserFilter and the IndexingFilter down.

I added the video links in the metatags, and I extracted it to be added
into the NutchDocument as a new field.

I've to call an API(not http) to push the data to Solr. So I gotta write a
IndexWriter plugin for this.  But I noticed that the IndexWriter plugin
only takes NutchDocument as input.


That means I gotta add the content-metadata and Parse metadata from the
Parse object into the NutchDocument in the IndexingFilter, if I want to
index meta tags? Or is there another way to do this.


On Wed, May 28, 2014 at 12:14 AM, Jorge Luis Betancourt Gonzalez <
[email protected]> wrote:

> I’ve done something similar, not with iframes but with other custom needed
> elements, but the logic will apply. Implement a custom HtmlParseFilter and
> a IndexingFilter, this way you could control how you want the data to be
> indexed. But you’re on a right track, perhaps not overriding parse-html,
> but implementing a new plugin just for your logic.
>
> Greetings!
>
> On May 27, 2014, at 9:46 AM, Alan Francis <[email protected]> wrote:
>
> > I have a use case in which we want to separate pages which have an iframe
> > embed tag from youtube. and add it as a additional field for indexing.
> >
> > I am using apache Nutch 1.8 with Solr 4.8
> >
> > What I have done so far is to over-ride the "parse-html" plugin and
> > identify iframe tags with youtube urls in ComContentUtils.getTextHelper()
> > and append it in "content" with some special tags
> >
> > I then receive the content in an Custom Indexing filter plugin to extract
> > the urls from the content and add it as a new field in NutchDocument.
> >
> > Is there a better way to do this?
> >
> >
> >
> > --
> > -Alan Francis
>
> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de
> julio de 2014. Ver www.uci.cu
>



-- 
-Alan Francis

Reply via email to