msg09999.html (Get HTML content generated by Javascript)

Markus Jelsma Thu, 19 Dec 2013 06:44:20 -0800
All i did with Rhino was attempt to get it up and running inside a ParseFilter 
plugin. I did not succeed that time and didn't try it again. The Rhino website 
is quite confusing. I should work on it again some time. 
 
-----Original message-----
> From:Patrick Kirsch <[email protected]>
> Sent: Thursday 19th December 2013 15:39
> To: [email protected]
> Subject: Re: In reference to 
> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML 
> content generated by Javascript)
> 
> Hey Markus,
> 
> Am 19.12.2013 15:25, schrieb Markus Jelsma:
> > Looks like PhantomJS is QT/CPP based, that is not something i think we can 
> > use from Nutch' HtmlParser implementation. Please correct me again if i am 
> > wrong :) I think it must be entirely Java based or we need a DOM 
> > environment written in Javascript such as EnvJS that we can run inside 
> > SpiderMonkey together with the page's Javascript.
> Rhino is java based, did you tried it and with what results?
> Can you share that experience?
> 
> 
> >
> > Cheers
> >
> >
> Regards
> >
> > -----Original message-----
> >> From:Patrick Kirsch <[email protected]>
> >> Sent: Thursday 19th December 2013 15:19
> >> To: [email protected]
> >> Subject: Re: In reference to 
> >> http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML 
> >> content generated by Javascript)
> >>
> >> Hey,
> >> Am 19.12.2013 15:00, schrieb Markus Jelsma:
> >>>   From what i understood about Selenium is that it requires Selenium to 
> >>> run as service somewhere outside MapReduce, which is a problem in itself. 
> >>> Please correct me if i am wrong. If Selenium can emulate the DOM as just 
> >>> a library we could indeed process AJAX websites.
> >> Selenium is intended to use for click automation and simulate
> >> (pre-defined) workflows usually done by users (e.g. testing process).
> >> So I'm not sure, how this will work.
> >> Given a random single-page-site it is not (definitly-)clear which click
> >> will produce ajax/json requests resolving in changing the DOM 
> >> significantly.
> >>
> >>>
> >>> I've did tests once in Nutch with SpiderMonkey and Rhino but didn't get 
> >>> it to work that time. Using SpiderMonkey or another Javascript engine is 
> >>> quite easy but without the DOM we're helpless.
> >> Ususally I use phantomjs, did you also tried that?
> >>
> >> At least Selenium has waitFor() events (e.g. with XPATHs or IDs), so it
> >> is possible to trigger ajax/json events and collect the rendered (html)
> >> result.
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Lewis John Mcgibbney <[email protected]>
> >>>> Sent: Thursday 19th December 2013 14:31
> >>>> To: [email protected]
> >>>> Subject: Re: In reference to 
> >>>> http://www.mail-archive.com/[email protected]/msg09999.html (Get 
> >>>> HTML content generated by Javascript)
> >>>>
> >>>> Hi Nibal,
> >>>>
> >>>> On Sun, Dec 15, 2013 at 11:26 PM, <[email protected]> 
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> of Single Page Web-apps and JavaScript-only web-applications is
> >>>>> sky-rocketing.....well, isn't this a high priority issue????
> >>>>>
> >>>>
> >>>> It would appear not. Unless folk provide patches then core contributers
> >>>> have not got around to addressing this particular issue.
> >>>>
> >>>>
> >>>>> If I had the technical knowledge, I would have contributed, but I don't
> >>>>> think I have clearly gotten my head around understanding
> >>>>> Nutch fully yet.
> >>>>>
> >>>>
> >>>> That is a real shame. Its always nice to get contributions :)
> >>>>
> >>>>
> >>>>>
> >>>>> Note: my small research led me to a lot of Java based implementation
> >>>>> including Selenium, HttpUnit and CrawlAjax being alternatives.
> >>>>> I was wondering if in case this does not appear to be a high priority, 
> >>>>> does
> >>>>> someone have any guidance to offer regarding this matter?
> >>>>>
> >>>>
> >>>> Personally no i don't but maybe others do.
> >>>>
> >>>> Lewis
> >>>>
> >>>
> >>
> >>
> >
> 
>
RE: In reference to http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML content generated by Javascript)

Reply via email to