Hi Patrick,

Looks like PhantomJS is QT/CPP based, that is not something i think we can use 
from Nutch' HtmlParser implementation. Please correct me again if i am wrong :) 
I think it must be entirely Java based or we need a DOM environment written in 
Javascript such as EnvJS that we can run inside SpiderMonkey together with the 
page's Javascript.

Cheers

 
 
-----Original message-----
> From:Patrick Kirsch <[email protected]>
> Sent: Thursday 19th December 2013 15:19
> To: [email protected]
> Subject: Re: In reference to 
> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML 
> content generated by Javascript)
> 
> Hey,
> Am 19.12.2013 15:00, schrieb Markus Jelsma:
> >  From what i understood about Selenium is that it requires Selenium to run 
> > as service somewhere outside MapReduce, which is a problem in itself. 
> > Please correct me if i am wrong. If Selenium can emulate the DOM as just a 
> > library we could indeed process AJAX websites.
> Selenium is intended to use for click automation and simulate 
> (pre-defined) workflows usually done by users (e.g. testing process).
> So I'm not sure, how this will work.
> Given a random single-page-site it is not (definitly-)clear which click 
> will produce ajax/json requests resolving in changing the DOM significantly.
> 
> >
> > I've did tests once in Nutch with SpiderMonkey and Rhino but didn't get it 
> > to work that time. Using SpiderMonkey or another Javascript engine is quite 
> > easy but without the DOM we're helpless.
> Ususally I use phantomjs, did you also tried that?
> 
> At least Selenium has waitFor() events (e.g. with XPATHs or IDs), so it 
> is possible to trigger ajax/json events and collect the rendered (html) 
> result.
> >
> >
> > -----Original message-----
> >> From:Lewis John Mcgibbney <[email protected]>
> >> Sent: Thursday 19th December 2013 14:31
> >> To: [email protected]
> >> Subject: Re: In reference to 
> >> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML 
> >> content generated by Javascript)
> >>
> >> Hi Nibal,
> >>
> >> On Sun, Dec 15, 2013 at 11:26 PM, <[email protected]> 
> >> wrote:
> >>
> >>>
> >>> of Single Page Web-apps and JavaScript-only web-applications is
> >>> sky-rocketing.....well, isn't this a high priority issue????
> >>>
> >>
> >> It would appear not. Unless folk provide patches then core contributers
> >> have not got around to addressing this particular issue.
> >>
> >>
> >>> If I had the technical knowledge, I would have contributed, but I don't
> >>> think I have clearly gotten my head around understanding
> >>> Nutch fully yet.
> >>>
> >>
> >> That is a real shame. Its always nice to get contributions :)
> >>
> >>
> >>>
> >>> Note: my small research led me to a lot of Java based implementation
> >>> including Selenium, HttpUnit and CrawlAjax being alternatives.
> >>> I was wondering if in case this does not appear to be a high priority, 
> >>> does
> >>> someone have any guidance to offer regarding this matter?
> >>>
> >>
> >> Personally no i don't but maybe others do.
> >>
> >> Lewis
> >>
> >
> 
> 

Reply via email to