Re: Fetch iframe from HTML (if exists)

Sebastian Nagel Thu, 27 Jun 2013 14:37:23 -0700

Hi Amit,

> [#document-fragment: null]
that does not mean that your DocumentFragment is empty.
DocumentFragment.toString() does not print the DOM as XML.


How to do this? Have a look at serializeToXML in
  
http://svn.apache.org/viewvc/any23/trunk/core/src/main/java/org/apache/any23/extractor
/html/DomUtils.java?view=markup
To serialize a DocumentFragment you have to iterate over all child nodes, e.g.:

    NodeList nodes = doc.getChildNodes();
    for (int i = 0; i < nodes.getLength(); i++) {
      LOG.info(serializeToXML(nodes.item(i), true));
    }

> How do I get nutch to pass the parsed html as DocumentFragment ? Should I
> state htmlparsefilter.order in nutch-site.xml ? if so, in what order ?

As said by others: the DOM should be there!

Sebastian


On 06/26/2013 04:00 PM, Amit Sela wrote:
> So I managed to create and deploy my plugin, which initially used
> content.getContent() and it worked.
> Then, I wanted to parse the fetched content as DocumentFragment (by
> iterating over the child nodes).
> This doesn't work. I logged DocumentFragment.toString() in my
> MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I
> see: [#document-fragment: null] for all URLS.
> 
> How do I get nutch to pass the parsed html as DocumentFragment ? Should I
> state htmlparsefilter.order in nutch-site.xml ? if so, in what order ?
> 
> Thanks.
> 
> 
> 
> 
> On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected]> wrote:
> 
>> Thanks for the prompt answer!
>>
>>
>> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <[email protected]
>>> wrote:
>>
>>> Hi,
>>>
>>> Do i understand you correctly if you want all iframe src attributes on a
>>> given page stored in the iframe field?
>>>
>>> The src attributes are not extracted and there is no facility to do so
>>> right now. You should create your own HTMLParseFilter, loop through the
>>> document looking for iframe tags and collect the src attribute. Then add
>>> those as parse metadata. You can then index them with the index-metadata
>>> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it
>>> sure will in 1.7.
>>>
>>> Use the bin/nutch parsechecker and indexchecker tools to check if your
>>> plugin works.
>>>
>>> Cheers
>>>
>>>
>>>
>>> -----Original message-----
>>>> From:Amit Sela <[email protected]>
>>>> Sent: Tuesday 25th June 2013 16:26
>>>> To: [email protected]
>>>> Subject: Fetch iframe from HTML (if exists)
>>>>
>>>> Hi all,
>>>>
>>>> I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the iframe
>>>> src field into Solr.
>>>> i.e.,
>>>> <iframe src="something" scrolling="" frameborder="".......>
>>>> So i want to fetch the iframe and index it as iframe so that I could
>>> find
>>>> URLS by iframe src.
>>>>
>>>> I'm crawling with no depth over a seed list, and I don't want to crawl
>>> to
>>>> the iframe src, just to index and store it.
>>>>
>>>> I tried adding
>>>> <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
>>>>
>>>> and
>>>> <field name="iframe" type="text_general" stored="true" indexed="true"
>>>> multiValued="true"/> to schema.xml
>>>>
>>>> and
>>>> <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
>>>>
>>>> What am I missing ?
>>>>
>>>> Thanks,
>>>>
>>>> Amit.
>>>>
>>>
>>
>>
>

Re: Fetch iframe from HTML (if exists)

Reply via email to