We did this using selenium where we turned off protocol-http and used custom protocol-selenium where 'http' was bound to it.
Simple way is to let the page render and get the entire text i.e. in Nutch terminology it becomes ParseText; On Sat, Jun 22, 2013 at 1:28 AM, Julien Nioche < [email protected]> wrote: > One way around this is to have a custom protocol implementation and get it > to fetch via Selenium > > J. > > On 21 June 2013 19:54, Lewis John Mcgibbney <[email protected] > >wrote: > > > Hi, > > Nearly all of this page is generated by JS right? > > Right now my answer is no. We fetch then parse page source... which in > this > > case is mostly all JS. The magic happens in the browser. > > ... > > Lewis > > > > > > On Tue, Jun 18, 2013 at 10:59 PM, Deals Collect <[email protected] > > >wrote: > > > > > Hi all, > > > > > > Can Nutch get the HTML content generated by Javascript? For example, > this > > > job site > > > > > > > > > https://schneiderele.taleo.net/careersection/2/jobdetail.ftl?job=72522&lang=en > > > > > > > > > Many thanks, > > > > > > > > > > > -- > > *Lewis* > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

