So I managed to create and deploy my plugin, which initially used content.getContent() and it worked. Then, I wanted to parse the fetched content as DocumentFragment (by iterating over the child nodes). This doesn't work. I logged DocumentFragment.toString() in my MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I see: [#document-fragment: null] for all URLS.
How do I get nutch to pass the parsed html as DocumentFragment ? Should I state htmlparsefilter.order in nutch-site.xml ? if so, in what order ? Thanks. On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected]> wrote: > Thanks for the prompt answer! > > > On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <[email protected] > > wrote: > >> Hi, >> >> Do i understand you correctly if you want all iframe src attributes on a >> given page stored in the iframe field? >> >> The src attributes are not extracted and there is no facility to do so >> right now. You should create your own HTMLParseFilter, loop through the >> document looking for iframe tags and collect the src attribute. Then add >> those as parse metadata. You can then index them with the index-metadata >> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it >> sure will in 1.7. >> >> Use the bin/nutch parsechecker and indexchecker tools to check if your >> plugin works. >> >> Cheers >> >> >> >> -----Original message----- >> > From:Amit Sela <[email protected]> >> > Sent: Tuesday 25th June 2013 16:26 >> > To: [email protected] >> > Subject: Fetch iframe from HTML (if exists) >> > >> > Hi all, >> > >> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe >> > src field into Solr. >> > i.e., >> > <iframe src="something" scrolling="" frameborder="".......> >> > So i want to fetch the iframe and index it as iframe so that I could >> find >> > URLS by iframe src. >> > >> > I'm crawling with no depth over a seed list, and I don't want to crawl >> to >> > the iframe src, just to index and store it. >> > >> > I tried adding >> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml >> > >> > and >> > <field name="iframe" type="text_general" stored="true" indexed="true" >> > multiValued="true"/> to schema.xml >> > >> > and >> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml. >> > >> > What am I missing ? >> > >> > Thanks, >> > >> > Amit. >> > >> > >

