Hi Patrick, Looks like PhantomJS is QT/CPP based, that is not something i think we can use from Nutch' HtmlParser implementation. Please correct me again if i am wrong :) I think it must be entirely Java based or we need a DOM environment written in Javascript such as EnvJS that we can run inside SpiderMonkey together with the page's Javascript.
Cheers -----Original message----- > From:Patrick Kirsch <[email protected]> > Sent: Thursday 19th December 2013 15:19 > To: [email protected] > Subject: Re: In reference to > http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML > content generated by Javascript) > > Hey, > Am 19.12.2013 15:00, schrieb Markus Jelsma: > > From what i understood about Selenium is that it requires Selenium to run > > as service somewhere outside MapReduce, which is a problem in itself. > > Please correct me if i am wrong. If Selenium can emulate the DOM as just a > > library we could indeed process AJAX websites. > Selenium is intended to use for click automation and simulate > (pre-defined) workflows usually done by users (e.g. testing process). > So I'm not sure, how this will work. > Given a random single-page-site it is not (definitly-)clear which click > will produce ajax/json requests resolving in changing the DOM significantly. > > > > > I've did tests once in Nutch with SpiderMonkey and Rhino but didn't get it > > to work that time. Using SpiderMonkey or another Javascript engine is quite > > easy but without the DOM we're helpless. > Ususally I use phantomjs, did you also tried that? > > At least Selenium has waitFor() events (e.g. with XPATHs or IDs), so it > is possible to trigger ajax/json events and collect the rendered (html) > result. > > > > > > -----Original message----- > >> From:Lewis John Mcgibbney <[email protected]> > >> Sent: Thursday 19th December 2013 14:31 > >> To: [email protected] > >> Subject: Re: In reference to > >> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML > >> content generated by Javascript) > >> > >> Hi Nibal, > >> > >> On Sun, Dec 15, 2013 at 11:26 PM, <[email protected]> > >> wrote: > >> > >>> > >>> of Single Page Web-apps and JavaScript-only web-applications is > >>> sky-rocketing.....well, isn't this a high priority issue???? > >>> > >> > >> It would appear not. Unless folk provide patches then core contributers > >> have not got around to addressing this particular issue. > >> > >> > >>> If I had the technical knowledge, I would have contributed, but I don't > >>> think I have clearly gotten my head around understanding > >>> Nutch fully yet. > >>> > >> > >> That is a real shame. Its always nice to get contributions :) > >> > >> > >>> > >>> Note: my small research led me to a lot of Java based implementation > >>> including Selenium, HttpUnit and CrawlAjax being alternatives. > >>> I was wondering if in case this does not appear to be a high priority, > >>> does > >>> someone have any guidance to offer regarding this matter? > >>> > >> > >> Personally no i don't but maybe others do. > >> > >> Lewis > >> > > > >

