Hi Amit, > [#document-fragment: null] that does not mean that your DocumentFragment is empty. DocumentFragment.toString() does not print the DOM as XML.
How to do this? Have a look at serializeToXML in http://svn.apache.org/viewvc/any23/trunk/core/src/main/java/org/apache/any23/extractor /html/DomUtils.java?view=markup To serialize a DocumentFragment you have to iterate over all child nodes, e.g.: NodeList nodes = doc.getChildNodes(); for (int i = 0; i < nodes.getLength(); i++) { LOG.info(serializeToXML(nodes.item(i), true)); } > How do I get nutch to pass the parsed html as DocumentFragment ? Should I > state htmlparsefilter.order in nutch-site.xml ? if so, in what order ? As said by others: the DOM should be there! Sebastian On 06/26/2013 04:00 PM, Amit Sela wrote: > So I managed to create and deploy my plugin, which initially used > content.getContent() and it worked. > Then, I wanted to parse the fetched content as DocumentFragment (by > iterating over the child nodes). > This doesn't work. I logged DocumentFragment.toString() in my > MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I > see: [#document-fragment: null] for all URLS. > > How do I get nutch to pass the parsed html as DocumentFragment ? Should I > state htmlparsefilter.order in nutch-site.xml ? if so, in what order ? > > Thanks. > > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <am...@infolinks.com> wrote: > >> Thanks for the prompt answer! >> >> >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <markus.jel...@openindex.io >>> wrote: >> >>> Hi, >>> >>> Do i understand you correctly if you want all iframe src attributes on a >>> given page stored in the iframe field? >>> >>> The src attributes are not extracted and there is no facility to do so >>> right now. You should create your own HTMLParseFilter, loop through the >>> document looking for iframe tags and collect the src attribute. Then add >>> those as parse metadata. You can then index them with the index-metadata >>> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it >>> sure will in 1.7. >>> >>> Use the bin/nutch parsechecker and indexchecker tools to check if your >>> plugin works. >>> >>> Cheers >>> >>> >>> >>> -----Original message----- >>>> From:Amit Sela <am...@infolinks.com> >>>> Sent: Tuesday 25th June 2013 16:26 >>>> To: user@nutch.apache.org >>>> Subject: Fetch iframe from HTML (if exists) >>>> >>>> Hi all, >>>> >>>> I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe >>>> src field into Solr. >>>> i.e., >>>> <iframe src="something" scrolling="" frameborder="".......> >>>> So i want to fetch the iframe and index it as iframe so that I could >>> find >>>> URLS by iframe src. >>>> >>>> I'm crawling with no depth over a seed list, and I don't want to crawl >>> to >>>> the iframe src, just to index and store it. >>>> >>>> I tried adding >>>> <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml >>>> >>>> and >>>> <field name="iframe" type="text_general" stored="true" indexed="true" >>>> multiValued="true"/> to schema.xml >>>> >>>> and >>>> <field dest="iframe" source="iframe"/> to solrindex-mapping.xml. >>>> >>>> What am I missing ? >>>> >>>> Thanks, >>>> >>>> Amit. >>>> >>> >> >> >