Hey Markus,
Am 19.12.2013 15:25, schrieb Markus Jelsma:
Looks like PhantomJS is QT/CPP based, that is not something i think we can use
from Nutch' HtmlParser implementation. Please correct me again if i am wrong :)
I think it must be entirely Java based or we need a DOM environment written in
Javascript such as EnvJS that we can run inside SpiderMonkey together with the
page's Javascript.
Rhino is java based, did you tried it and with what results?
Can you share that experience?
Cheers
Regards
-----Original message-----
From:Patrick Kirsch <[email protected]>
Sent: Thursday 19th December 2013 15:19
To: [email protected]
Subject: Re: In reference to
http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML
content generated by Javascript)
Hey,
Am 19.12.2013 15:00, schrieb Markus Jelsma:
From what i understood about Selenium is that it requires Selenium to run as
service somewhere outside MapReduce, which is a problem in itself. Please
correct me if i am wrong. If Selenium can emulate the DOM as just a library we
could indeed process AJAX websites.
Selenium is intended to use for click automation and simulate
(pre-defined) workflows usually done by users (e.g. testing process).
So I'm not sure, how this will work.
Given a random single-page-site it is not (definitly-)clear which click
will produce ajax/json requests resolving in changing the DOM significantly.
I've did tests once in Nutch with SpiderMonkey and Rhino but didn't get it to
work that time. Using SpiderMonkey or another Javascript engine is quite easy
but without the DOM we're helpless.
Ususally I use phantomjs, did you also tried that?
At least Selenium has waitFor() events (e.g. with XPATHs or IDs), so it
is possible to trigger ajax/json events and collect the rendered (html)
result.
-----Original message-----
From:Lewis John Mcgibbney <[email protected]>
Sent: Thursday 19th December 2013 14:31
To: [email protected]
Subject: Re: In reference to
http://www.mail-archive.com/[email protected]/msg09999.html (Get HTML
content generated by Javascript)
Hi Nibal,
On Sun, Dec 15, 2013 at 11:26 PM, <[email protected]> wrote:
of Single Page Web-apps and JavaScript-only web-applications is
sky-rocketing.....well, isn't this a high priority issue????
It would appear not. Unless folk provide patches then core contributers
have not got around to addressing this particular issue.
If I had the technical knowledge, I would have contributed, but I don't
think I have clearly gotten my head around understanding
Nutch fully yet.
That is a real shame. Its always nice to get contributions :)
Note: my small research led me to a lot of Java based implementation
including Selenium, HttpUnit and CrawlAjax being alternatives.
I was wondering if in case this does not appear to be a high priority, does
someone have any guidance to offer regarding this matter?
Personally no i don't but maybe others do.
Lewis