Nope. Using nutch 1.6.
I ended up using
org.jsoup.nodes.Document document = Jsoup.parse(content.
List<org.jsoup.nodes.Node> childNodes = document.childNodes();



On Wed, Jun 26, 2013 at 7:19 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> It looks like your on a pre 1.3 version of Nutch here.
> It is highly recommended to upgrade.
> Thanks
> Lewis
>
> On Wednesday, June 26, 2013, Amit Sela <[email protected]> wrote:
> > I did succeed in parsing using content and iterating over every line but
> > I'd prefer do it with DocumentFragment.
> > my plugin.includes has:
> >
>
> protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta
> > So I us parse-html but also tika, text metatags and js. maybe it's to
> much
> > ? I copied this configuration from an example I saw. I do know that I use
> > metatags (I index keywords and description) but I'm not sure about the
> > rest...
> >
> >
> > On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma
> > <[email protected]>wrote:
> >
> >> Of course, forget it. What parser do you use? Maybe the old parse-html
> >> doesn't report it back.You can also try to print every element you loop
> >> over and check if it's there or not.
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:Amit Sela <[email protected]>
> >> > Sent: Wednesday 26th June 2013 16:11
> >> > To: [email protected]
> >> > Subject: Re: Fetch iframe from HTML (if exists)
> >> >
> >> > How will it affect ? I Crawl with no depth (depth 1) so outlinks don't
> >> > matter and it seems that the urls fetched don't get parsed, or am I
> >> > misunderstanding something ?
> >> >
> >> >
> >> > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma
> >> > <[email protected]>wrote:
> >> >
> >> > > No order does not matter. Try adding iframe to the ignore_tags
> >> > > configuration directive in your nutch-site.
> >> > > parser.html.outlinks.ignore_tags
> >> > >
> >> > >
> >> > >
> >> > > -----Original message-----
> >> > > > From:Amit Sela <[email protected]>
> >> > > > Sent: Wednesday 26th June 2013 16:03
> >> > > > To: [email protected]
> >> > > > Subject: Re: Fetch iframe from HTML (if exists)
> >> > > >
> >> > > > In nutch-site.xml plugin.includes my custom filter is last and I
> have
> >> > > > no htmlparsefilter.order  so my filter should be applied last,
> right
> >> ?
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <[email protected]>
> >> wrote:
> >> > > >
> >> > > > > So I managed to create and deploy my plugin, which initially
> used
> >> > > > > content.getContent() and it worked.
> >> > > > > Then, I wanted to parse the fetched content as DocumentFragment
> (by
> >> > > > > iterating over the child nodes).
> >> > > > > This doesn't work. I logged DocumentFragment.toString() in my
> >> > > > > MyCustomHtmlParseFilter in filter method, and in the Parse
> >> MapReduce
> >> > > logs I
> >> > > > > see: [#document-fragment: null] for all URLS.
> >> > > > >
> >> > > > > How do I get nutch to pass the parsed html as DocumentFragment ?
> >> > > Should I
> >> > > > > state htmlparsefilter.order in nutch-site.xml ? if so, in what
> >> order ?
> >> > > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <[email protected]
> >
> >> > > wrote:
> >> > > > >
> >> > > > >> Thanks for the prompt answer!
> >> > > > >>
> >> > > > >>
> >> > > > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
> >> > > > >> [email protected]> wrote:
> >> > > > >>
> >> > > > >>> Hi,
> >> > > > >>>
> >> > > > >>> Do i understand you correctly if you want all iframe src
> >> attributes
> >> > > on a
> >> > > > >>> given page stored in the iframe field?
> >> > > > >>>
> >> > > > >>> The src attributes are not extracted and there is no facility
> to
> >> do
> >> > > so
> >> > > > >>> right now. You should create your own HTMLParseFilter, loop
> >> through
> >> > > the
> >> > > > >>> document looking for iframe tags and collect the src
> attribute.
> >> Then
> >> > > add
> >> > > > >>> those as
>
> --
> *Lewis*
>

Reply via email to