Hi Lewis,

Thanks for your email, I tried all options with no success before reaching you 
again referring to the link you had provided.

Here I am trying to crawl websites which are having the runtime rendered 
content to extract and parse.
I downloaded the Nutch provided in the below email. Added 
protocol-interactiveselenium, protocol-selenium along with protocol-httpclient 
(as the websites are with https).
Selenium - firefox is configured and it is working properly, and selenium is 
configured and running while doing the crawling.

Yet, not able to get rendered content. Here I attached the nutch-site.xml for 
reference to see any input of missing configuration.

Just wondering to guess if the issue with the tika parsers not able to parse 
the extracted runtime rendered content or the issue with the Solr (I am using 
Apache solr to index parsed data) for not having the indexed field to represent 
the data (schema has content, url, title and id).

Any input really appreciated to resolve the issue.

Environment: CentOS-7
Firefox: 60.3 oesr (64 bit)
Selenium : v3.4.0
Geckodriver: 0.23.0 ( 2018-10-04)
Apache Nutch: 1.x

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Lewis John McGibbney <lewi...@apache.org> 
Sent: 09 December 2018 02:11
To: user@nutch.apache.org
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi Venkata,
This functionality is not available in 2.X at the moment.
The functionality is available in the 1.x primary branch. You can learn about 
the implementation at both

https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=GnB3WTYiEnYCx1Od7W3275L8fdtKPxH3KRi%2B7DXRvGM%3D&amp;reserved=0,
 and

https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-interactiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=YENo6rTbviI7ctl5K6%2BV0Bw4NCGMo9l2CDJOGauRWV8%3D&amp;reserved=0

Lewis

On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: 
> Hi,
> 
> Was trying to fetch the content rendered by ajax call using Apache Nutch 
> 2.3.1.
> Seems, it is not able to get the actual rendered content only getting the 
> view source page ( as part of protocol-js plugin).
> Has anyone able to fetch the rendered content from Ajax call using Nutch 
> 2.3.1 or any suggestions?
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 

Attachment: nutch-site.xml
Description: nutch-site.xml

Reply via email to