RE: Fetch iframe from HTML (if exists)

Markus Jelsma Wed, 26 Jun 2013 07:08:53 -0700

No order does not matter. Try adding iframe to the ignore_tags configuration 
directive in your nutch-site.
parser.html.outlinks.ignore_tags


 
 
-----Original message-----
> From:Amit Sela <am...@infolinks.com>
> Sent: Wednesday 26th June 2013 16:03
> To: user@nutch.apache.org
> Subject: Re: Fetch iframe from HTML (if exists)
> 
> In nutch-site.xml plugin.includes my custom filter is last and I have
> no htmlparsefilter.order  so my filter should be applied last, right ?
> 
> 
> 
> On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <am...@infolinks.com> wrote:
> 
> > So I managed to create and deploy my plugin, which initially used
> > content.getContent() and it worked.
> > Then, I wanted to parse the fetched content as DocumentFragment (by
> > iterating over the child nodes).
> > This doesn't work. I logged DocumentFragment.toString() in my
> > MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce logs I
> > see: [#document-fragment: null] for all URLS.
> >
> > How do I get nutch to pass the parsed html as DocumentFragment ? Should I
> > state htmlparsefilter.order in nutch-site.xml ? if so, in what order ?
> >
> > Thanks.
> >
> >
> >
> >
> > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <am...@infolinks.com> wrote:
> >
> >> Thanks for the prompt answer!
> >>
> >>
> >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
> >> markus.jel...@openindex.io> wrote:
> >>
> >>> Hi,
> >>>
> >>> Do i understand you correctly if you want all iframe src attributes on a
> >>> given page stored in the iframe field?
> >>>
> >>> The src attributes are not extracted and there is no facility to do so
> >>> right now. You should create your own HTMLParseFilter, loop through the
> >>> document looking for iframe tags and collect the src attribute. Then add
> >>> those as parse metadata. You can then index them with the index-metadata
> >>> plugin. I'm not sure it supports multi valued metafields in Nutch 1.6, it
> >>> sure will in 1.7.
> >>>
> >>> Use the bin/nutch parsechecker and indexchecker tools to check if your
> >>> plugin works.
> >>>
> >>> Cheers
> >>>
> >>>
> >>>
> >>> -----Original message-----
> >>> > From:Amit Sela <am...@infolinks.com>
> >>> > Sent: Tuesday 25th June 2013 16:26
> >>> > To: user@nutch.apache.org
> >>> > Subject: Fetch iframe from HTML (if exists)
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the
> >>> iframe
> >>> > src field into Solr.
> >>> > i.e.,
> >>> > <iframe src="something" scrolling="" frameborder="".......>
> >>> > So i want to fetch the iframe and index it as iframe so that I could
> >>> find
> >>> > URLS by iframe src.
> >>> >
> >>> > I'm crawling with no depth over a seed list, and I don't want to crawl
> >>> to
> >>> > the iframe src, just to index and store it.
> >>> >
> >>> > I tried adding
> >>> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
> >>> >
> >>> > and
> >>> > <field name="iframe" type="text_general" stored="true" indexed="true"
> >>> > multiValued="true"/> to schema.xml
> >>> >
> >>> > and
> >>> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
> >>> >
> >>> > What am I missing ?
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Amit.
> >>> >
> >>>
> >>
> >>
> >
>

RE: Fetch iframe from HTML (if exists)

Reply via email to