We got it to work on a few million pages so in terms of volume, yes. These
pages all belonged to the same site though.

On 19 December 2013 14:16, Markus Jelsma <[email protected]> wrote:

> Is that something that could work on a massive scale? If not, i'd prefer a
> Javascript engine and a DOM environment such as EnvJS where it can run in.
> It is only very unfortunate that EnvJS hasn't been worked on how quite some
> time now.
>
> -----Original message-----
> > From:Julien Nioche <[email protected]>
> > Sent: Thursday 19th December 2013 15:05
> > To: [email protected]
> > Subject: Re: In reference to
> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML
> content generated by Javascript)
> >
> > One option is to write a custom protocol implementation which uses the
> > Selenium API to navigate / resolve the javascript and return some byte
> > content for the parser to process. You need to have a selenium server
> > running indeed. We did use ChromeDriver as a Selenium-compatible server
> to
> > do some bespoke navigation from a page and that worked fine.
> >
> >
> > On 19 December 2013 14:00, Markus Jelsma <[email protected]>
> wrote:
> >
> > > From what i understood about Selenium is that it requires Selenium to
> run
> > > as service somewhere outside MapReduce, which is a problem in itself.
> > > Please correct me if i am wrong. If Selenium can emulate the DOM as
> just a
> > > library we could indeed process AJAX websites.
> > >
> > > I've did tests once in Nutch with SpiderMonkey and Rhino but didn't
> get it
> > > to work that time. Using SpiderMonkey or another Javascript engine is
> quite
> > > easy but without the DOM we're helpless.
> > >
> > >
> > > -----Original message-----
> > > > From:Lewis John Mcgibbney <[email protected]>
> > > > Sent: Thursday 19th December 2013 14:31
> > > > To: [email protected]
> > > > Subject: Re: In reference to
> > > http://www.mail-archive.com/[email protected]/msg09999.html (Get
> HTML
> > > content generated by Javascript)
> > > >
> > > > Hi Nibal,
> > > >
> > > > On Sun, Dec 15, 2013 at 11:26 PM, <[email protected]
> >
> > > wrote:
> > > >
> > > > >
> > > > > of Single Page Web-apps and JavaScript-only web-applications is
> > > > > sky-rocketing.....well, isn't this a high priority issue????
> > > > >
> > > >
> > > > It would appear not. Unless folk provide patches then core
> contributers
> > > > have not got around to addressing this particular issue.
> > > >
> > > >
> > > > > If I had the technical knowledge, I would have contributed, but I
> don't
> > > > > think I have clearly gotten my head around understanding
> > > > > Nutch fully yet.
> > > > >
> > > >
> > > > That is a real shame. Its always nice to get contributions :)
> > > >
> > > >
> > > > >
> > > > > Note: my small research led me to a lot of Java based
> implementation
> > > > > including Selenium, HttpUnit and CrawlAjax being alternatives.
> > > > > I was wondering if in case this does not appear to be a high
> priority,
> > > does
> > > > > someone have any guidance to offer regarding this matter?
> > > > >
> > > >
> > > > Personally no i don't but maybe others do.
> > > >
> > > > Lewis
> > > >
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to