No order does not matter. Try adding iframe to the ignore_tags configuration directive in your nutch-site. parser.html.outlinks.ignore_tags
-----Original message----- > From:Amit Sela <am...@infolinks.com> > Sent: Wednesday 26th June 2013 16:03 > To: user@nutch.apache.org > Subject: Re: Fetch iframe from HTML (if exists) > > In nutch-site.xml plugin.includes my custom filter is last and I have > no htmlparsefilter.order so my filter should be applied last, right ? > > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <am...@infolinks.com> wrote: > > > So I managed to create and deploy my plugin, which initially used > > content.getContent() and it worked. > > Then, I wanted to parse the fetched content as DocumentFragment (by > > iterating over the child nodes). > > This doesn't work. I logged DocumentFragment.toString() in my > > MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I > > see: [#document-fragment: null] for all URLS. > > > > How do I get nutch to pass the parsed html as DocumentFragment ? Should I > > state htmlparsefilter.order in nutch-site.xml ? if so, in what order ? > > > > Thanks. > > > > > > > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <am...@infolinks.com> wrote: > > > >> Thanks for the prompt answer! > >> > >> > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma < > >> markus.jel...@openindex.io> wrote: > >> > >>> Hi, > >>> > >>> Do i understand you correctly if you want all iframe src attributes on a > >>> given page stored in the iframe field? > >>> > >>> The src attributes are not extracted and there is no facility to do so > >>> right now. You should create your own HTMLParseFilter, loop through the > >>> document looking for iframe tags and collect the src attribute. Then add > >>> those as parse metadata. You can then index them with the index-metadata > >>> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it > >>> sure will in 1.7. > >>> > >>> Use the bin/nutch parsechecker and indexchecker tools to check if your > >>> plugin works. > >>> > >>> Cheers > >>> > >>> > >>> > >>> -----Original message----- > >>> > From:Amit Sela <am...@infolinks.com> > >>> > Sent: Tuesday 25th June 2013 16:26 > >>> > To: user@nutch.apache.org > >>> > Subject: Fetch iframe from HTML (if exists) > >>> > > >>> > Hi all, > >>> > > >>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the > >>> iframe > >>> > src field into Solr. > >>> > i.e., > >>> > <iframe src="something" scrolling="" frameborder="".......> > >>> > So i want to fetch the iframe and index it as iframe so that I could > >>> find > >>> > URLS by iframe src. > >>> > > >>> > I'm crawling with no depth over a seed list, and I don't want to crawl > >>> to > >>> > the iframe src, just to index and store it. > >>> > > >>> > I tried adding > >>> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml > >>> > > >>> > and > >>> > <field name="iframe" type="text_general" stored="true" indexed="true" > >>> > multiValued="true"/> to schema.xml > >>> > > >>> > and > >>> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml. > >>> > > >>> > What am I missing ? > >>> > > >>> > Thanks, > >>> > > >>> > Amit. > >>> > > >>> > >> > >> > > >