I’ve done something similar, not with iframes but with other custom needed elements, but the logic will apply. Implement a custom HtmlParseFilter and a IndexingFilter, this way you could control how you want the data to be indexed. But you’re on a right track, perhaps not overriding parse-html, but implementing a new plugin just for your logic.
Greetings! On May 27, 2014, at 9:46 AM, Alan Francis <[email protected]> wrote: > I have a use case in which we want to separate pages which have an iframe > embed tag from youtube. and add it as a additional field for indexing. > > I am using apache Nutch 1.8 with Solr 4.8 > > What I have done so far is to over-ride the "parse-html" plugin and > identify iframe tags with youtube urls in ComContentUtils.getTextHelper() > and append it in "content" with some special tags > > I then receive the content in an Custom Indexing filter plugin to extract > the urls from the content and add it as a new field in NutchDocument. > > Is there a better way to do this? > > > > -- > -Alan Francis VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

