Re: Fetch iframe from HTML (if exists)

Lewis John Mcgibbney Wed, 26 Jun 2013 09:59:59 -0700

Unless you have some custom parse-text and query-* plugins then you are
good to remove these entries from plugin.includes as they have been dropped
for some time now.


Good to hear you got it working.

On Wednesday, June 26, 2013, Amit Sela <[email protected]> wrote:
> Nope. Using nutch 1.6.
> I ended up using
> org.jsoup.nodes.Document document = Jsoup.parse(content.
> List<org.jsoup.nodes.Node> childNodes = document.childNodes();
>
>
>
> On Wed, Jun 26, 2013 at 7:19 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> It looks like your on a pre 1.3 version of Nutch here.
>> It is highly recommended to upgrade.
>> Thanks
>> Lewis
>>
>> On Wednesday, June 26, 2013, Amit Sela <[email protected]> wrote:
>> > I did succeed in parsing using content and iterating over every line
but
>> > I'd prefer do it with DocumentFragment.
>> > my plugin.includes has:
>> >
>>
>>
protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta
>> > So I us parse-html but also tika, text metatags and js. maybe it's to
>> much
>> > ? I copied this configuration from an example I saw. I do know that I
use
>> > metatags (I index keywords and description) but I'm not sure about the
>> > rest...
>> >
>> >
>> > On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma
>> > <[email protected]>wrote:
>> >
>> >> Of course, forget it. What parser do you use? Maybe the old parse-html
>> >> doesn't report it back.You can also try to print every element you
loop
>> >> over and check if it's there or not.
>> >>
>> >>
>> >>
>> >> -----Original message-----
>> >> > From:Amit Sela <[email protected]>
>> >> > Sent: Wednesday 26th June 2013 16:11
>> >> > To: [email protected]
>> >> > Subject: Re: Fetch iframe from HTML (if exists)
>> >> >
>> >> > How will it affect ? I Crawl with no depth (depth 1) so outlinks
don't
>> >> > matter and it seems that the urls fetched don't get parsed, or am I
>> >> > misunderstanding something ?
>> >> >
>> >> >
>> >> > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma
>> >> > <[email protected]>wrote:
>> >> >
>> >> > > No order does not matter. Try adding iframe to the ignore_tags
>> >> > > configuration directive in your nutch-site.
>> >> > > parser.html.outlinks.ignore_tags
>> >> > >
>> >> > >
>> >> > >
>> >> > > -----Original message-----
>> >> > > > From:Amit Sela <[email protected]>
>> >> > > > Sent: Wednesday 26th June 2013 16:03
>> >> > > > To: [email protected]
>> >> > > > Subject: Re: Fetch iframe from HTML (if exists)
>> >> > > >
>> >> > > > In nutch-site.xml plugin.includes my custom filter is last and I
>> have
>> >> > > > no htmlparsefilter.order  so my filter should be applied last,
>> right
>> >> ?
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <[email protected]>
>> >> wrote:
>> >> > > >
>> >> > > > > So I managed to create and deploy my plugin, which initially
>> used
>> >> > > > > content.getContent() and it worked.
>> >> > > > > Then, I wanted to parse the fetched content as
DocumentFragment
>> (by
>> >> > > > > iterating over the child nodes).
>> >> > > > > This doesn't work. I logged DocumentFragment.toString() in my
>> >> > > > > MyCustomHtmlParseFilter in filter method, and in the Parse
>> >> MapReduce
>> >> > > logs I
>> >> > > > > see: [#document-fragment: null] for all URLS.
>> *Lewis*
>>
>

-- 
*Lewis*

Re: Fetch iframe from HTML (if exists)

Reply via email to