We got it to work on a few million pages so in terms of volume, yes. These pages all belonged to the same site though.
On 19 December 2013 14:16, Markus Jelsma <[email protected]> wrote: > Is that something that could work on a massive scale? If not, i'd prefer a > Javascript engine and a DOM environment such as EnvJS where it can run in. > It is only very unfortunate that EnvJS hasn't been worked on how quite some > time now. > > -----Original message----- > > From:Julien Nioche <[email protected]> > > Sent: Thursday 19th December 2013 15:05 > > To: [email protected] > > Subject: Re: In reference to > http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML > content generated by Javascript) > > > > One option is to write a custom protocol implementation which uses the > > Selenium API to navigate / resolve the javascript and return some byte > > content for the parser to process. You need to have a selenium server > > running indeed. We did use ChromeDriver as a Selenium-compatible server > to > > do some bespoke navigation from a page and that worked fine. > > > > > > On 19 December 2013 14:00, Markus Jelsma <[email protected]> > wrote: > > > > > From what i understood about Selenium is that it requires Selenium to > run > > > as service somewhere outside MapReduce, which is a problem in itself. > > > Please correct me if i am wrong. If Selenium can emulate the DOM as > just a > > > library we could indeed process AJAX websites. > > > > > > I've did tests once in Nutch with SpiderMonkey and Rhino but didn't > get it > > > to work that time. Using SpiderMonkey or another Javascript engine is > quite > > > easy but without the DOM we're helpless. > > > > > > > > > -----Original message----- > > > > From:Lewis John Mcgibbney <[email protected]> > > > > Sent: Thursday 19th December 2013 14:31 > > > > To: [email protected] > > > > Subject: Re: In reference to > > > http://www.mail-archive.com/[email protected]/msg09999.html (Get > HTML > > > content generated by Javascript) > > > > > > > > Hi Nibal, > > > > > > > > On Sun, Dec 15, 2013 at 11:26 PM, <[email protected] > > > > > wrote: > > > > > > > > > > > > > > of Single Page Web-apps and JavaScript-only web-applications is > > > > > sky-rocketing.....well, isn't this a high priority issue???? > > > > > > > > > > > > > It would appear not. Unless folk provide patches then core > contributers > > > > have not got around to addressing this particular issue. > > > > > > > > > > > > > If I had the technical knowledge, I would have contributed, but I > don't > > > > > think I have clearly gotten my head around understanding > > > > > Nutch fully yet. > > > > > > > > > > > > > That is a real shame. Its always nice to get contributions :) > > > > > > > > > > > > > > > > > > Note: my small research led me to a lot of Java based > implementation > > > > > including Selenium, HttpUnit and CrawlAjax being alternatives. > > > > > I was wondering if in case this does not appear to be a high > priority, > > > does > > > > > someone have any guidance to offer regarding this matter? > > > > > > > > > > > > > Personally no i don't but maybe others do. > > > > > > > > Lewis > > > > > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

