Re: Fetch iframe from HTML (if exists)

Lewis John Mcgibbney Wed, 26 Jun 2013 09:21:48 -0700

It looks like your on a pre 1.3 version of Nutch here.
It is highly recommended to upgrade.
Thanks
Lewis


On Wednesday, June 26, 2013, Amit Sela <[email protected]> wrote:
> I did succeed in parsing using content and iterating over every line but
> I'd prefer do it with DocumentFragment.
> my plugin.includes has:
>
protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta
> So I us parse-html but also tika, text metatags and js. maybe it's to much
> ? I copied this configuration from an example I saw. I do know that I use
> metatags (I index keywords and description) but I'm not sure about the
> rest...
>
>
> On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma
> <[email protected]>wrote:
>
>> Of course, forget it. What parser do you use? Maybe the old parse-html
>> doesn't report it back.You can also try to print every element you loop
>> over and check if it's there or not.
>>
>>
>>
>> -----Original message-----
>> > From:Amit Sela <[email protected]>
>> > Sent: Wednesday 26th June 2013 16:11
>> > To: [email protected]
>> > Subject: Re: Fetch iframe from HTML (if exists)
>> >
>> > How will it affect ? I Crawl with no depth (depth 1) so outlinks don't
>> > matter and it seems that the urls fetched don't get parsed, or am I
>> > misunderstanding something ?
>> >
>> >
>> > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma
>> > <[email protected]>wrote:
>> >
>> > > No order does not matter. Try adding iframe to the ignore_tags
>> > > configuration directive in your nutch-site.
>> > > parser.html.outlinks.ignore_tags
>> > >
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Amit Sela <[email protected]>
>> > > > Sent: Wednesday 26th June 2013 16:03
>> > > > To: [email protected]
>> > > > Subject: Re: Fetch iframe from HTML (if exists)
>> > > >
>> > > > In nutch-site.xml plugin.includes my custom filter is last and I
have
>> > > > no htmlparsefilter.order  so my filter should be applied last,
right
>> ?
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <[email protected]>
>> wrote:
>> > > >
>> > > > > So I managed to create and deploy my plugin, which initially used
>> > > > > content.getContent() and it worked.
>> > > > > Then, I wanted to parse the fetched content as DocumentFragment
(by
>> > > > > iterating over the child nodes).
>> > > > > This doesn't work. I logged DocumentFragment.toString() in my
>> > > > > MyCustomHtmlParseFilter in filter method, and in the Parse
>> MapReduce
>> > > logs I
>> > > > > see: [#document-fragment: null] for all URLS.
>> > > > >
>> > > > > How do I get nutch to pass the parsed html as DocumentFragment ?
>> > > Should I
>> > > > > state htmlparsefilter.order in nutch-site.xml ? if so, in what
>> order ?
>> > > > >
>> > > > > Thanks.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected]>
>> > > wrote:
>> > > > >
>> > > > >> Thanks for the prompt answer!
>> > > > >>
>> > > > >>
>> > > > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
>> > > > >> [email protected]> wrote:
>> > > > >>
>> > > > >>> Hi,
>> > > > >>>
>> > > > >>> Do i understand you correctly if you want all iframe src
>> attributes
>> > > on a
>> > > > >>> given page stored in the iframe field?
>> > > > >>>
>> > > > >>> The src attributes are not extracted and there is no facility
to
>> do
>> > > so
>> > > > >>> right now. You should create your own HTMLParseFilter, loop
>> through
>> > > the
>> > > > >>> document looking for iframe tags and collect the src attribute.
>> Then
>> > > add
>> > > > >>> those as

-- 
*Lewis*

Re: Fetch iframe from HTML (if exists)

Reply via email to