Thanks for the prompt answer!
On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <[email protected]>wrote: > Hi, > > Do i understand you correctly if you want all iframe src attributes on a > given page stored in the iframe field? > > The src attributes are not extracted and there is no facility to do so > right now. You should create your own HTMLParseFilter, loop through the > document looking for iframe tags and collect the src attribute. Then add > those as parse metadata. You can then index them with the index-metadata > plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it > sure will in 1.7. > > Use the bin/nutch parsechecker and indexchecker tools to check if your > plugin works. > > Cheers > > > > -----Original message----- > > From:Amit Sela <[email protected]> > > Sent: Tuesday 25th June 2013 16:26 > > To: [email protected] > > Subject: Fetch iframe from HTML (if exists) > > > > Hi all, > > > > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe > > src field into Solr. > > i.e., > > <iframe src="something" scrolling="" frameborder="".......> > > So i want to fetch the iframe and index it as iframe so that I could find > > URLS by iframe src. > > > > I'm crawling with no depth over a seed list, and I don't want to crawl to > > the iframe src, just to index and store it. > > > > I tried adding > > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml > > > > and > > <field name="iframe" type="text_general" stored="true" indexed="true" > > multiValued="true"/> to schema.xml > > > > and > > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml. > > > > What am I missing ? > > > > Thanks, > > > > Amit. > > >

