Thanks for the prompt answer!

On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma
<[email protected]>wrote:

> Hi,
>
> Do i understand you correctly if you want all iframe src attributes on a
> given page stored in the iframe field?
>
> The src attributes are not extracted and there is no facility to do so
> right now. You should create your own HTMLParseFilter, loop through the
> document looking for iframe tags and collect the src attribute. Then add
> those as parse metadata. You can then index them with the index-metadata
> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it
> sure will in 1.7.
>
> Use the bin/nutch parsechecker and indexchecker tools to check if your
> plugin works.
>
> Cheers
>
>
>
> -----Original message-----
> > From:Amit Sela <[email protected]>
> > Sent: Tuesday 25th June 2013 16:26
> > To: [email protected]
> > Subject: Fetch iframe from HTML (if exists)
> >
> > Hi all,
> >
> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe
> > src field into Solr.
> > i.e.,
> > <iframe src="something" scrolling="" frameborder="".......>
> > So i want to fetch the iframe and index it as iframe so that I could find
> > URLS by iframe src.
> >
> > I'm crawling with no depth over a seed list, and I don't want to crawl to
> > the iframe src, just to index and store it.
> >
> > I tried adding
> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
> >
> > and
> > <field name="iframe" type="text_general" stored="true" indexed="true"
> > multiValued="true"/> to schema.xml
> >
> > and
> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
> >
> > What am I missing ?
> >
> > Thanks,
> >
> > Amit.
> >
>

Reply via email to