So I managed to create and deploy my plugin, which initially used
content.getContent() and it worked.
Then, I wanted to parse the fetched content as DocumentFragment (by
iterating over the child nodes).
This doesn't work. I logged DocumentFragment.toString() in my
MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I
see: [#document-fragment: null] for all URLS.

How do I get nutch to pass the parsed html as DocumentFragment ? Should I
state htmlparsefilter.order in nutch-site.xml ? if so, in what order ?

Thanks.




On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected]> wrote:

> Thanks for the prompt answer!
>
>
> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <[email protected]
> > wrote:
>
>> Hi,
>>
>> Do i understand you correctly if you want all iframe src attributes on a
>> given page stored in the iframe field?
>>
>> The src attributes are not extracted and there is no facility to do so
>> right now. You should create your own HTMLParseFilter, loop through the
>> document looking for iframe tags and collect the src attribute. Then add
>> those as parse metadata. You can then index them with the index-metadata
>> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it
>> sure will in 1.7.
>>
>> Use the bin/nutch parsechecker and indexchecker tools to check if your
>> plugin works.
>>
>> Cheers
>>
>>
>>
>> -----Original message-----
>> > From:Amit Sela <[email protected]>
>> > Sent: Tuesday 25th June 2013 16:26
>> > To: [email protected]
>> > Subject: Fetch iframe from HTML (if exists)
>> >
>> > Hi all,
>> >
>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe
>> > src field into Solr.
>> > i.e.,
>> > <iframe src="something" scrolling="" frameborder="".......>
>> > So i want to fetch the iframe and index it as iframe so that I could
>> find
>> > URLS by iframe src.
>> >
>> > I'm crawling with no depth over a seed list, and I don't want to crawl
>> to
>> > the iframe src, just to index and store it.
>> >
>> > I tried adding
>> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
>> >
>> > and
>> > <field name="iframe" type="text_general" stored="true" indexed="true"
>> > multiValued="true"/> to schema.xml
>> >
>> > and
>> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
>> >
>> > What am I missing ?
>> >
>> > Thanks,
>> >
>> > Amit.
>> >
>>
>
>

Reply via email to