Hi,

> protocol-httpclient (as the websites are with https).

With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient
is also active, it may be used instead of protocol-selenium. There is
no need to activate it, the description in nutch-default.xml needs to
be fixed, see https://issues.apache.org/jira/browse/NUTCH-2678

Note that protocol-interactiveselenium will support https in 1.16.

Best,
Sebastian

On 12/16/18 1:40 PM, Venkata MR wrote:
> Hi Lewis,
> 
> Thanks for your email, I tried all options with no success before reaching 
> you again referring to the link you had provided.
> 
> Here I am trying to crawl websites which are having the runtime rendered 
> content to extract and parse.
> I downloaded the Nutch provided in the below email. Added 
> protocol-interactiveselenium, protocol-selenium along with 
> protocol-httpclient (as the websites are with https).
> Selenium - firefox is configured and it is working properly, and selenium is 
> configured and running while doing the crawling.
> 
> Yet, not able to get rendered content. Here I attached the nutch-site.xml for 
> reference to see any input of missing configuration.
> 
> Just wondering to guess if the issue with the tika parsers not able to parse 
> the extracted runtime rendered content or the issue with the Solr (I am using 
> Apache solr to index parsed data) for not having the indexed field to 
> represent the data (schema has content, url, title and id).
> 
> Any input really appreciated to resolve the issue.
> 
> Environment: CentOS-7
> Firefox: 60.3 oesr (64 bit)
> Selenium : v3.4.0
> Geckodriver: 0.23.0 ( 2018-10-04)
> Apache Nutch: 1.x
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Lewis John McGibbney <lewi...@apache.org> 
> Sent: 09 December 2018 02:11
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax
> 
> Hi Venkata,
> This functionality is not available in 2.X at the moment.
> The functionality is available in the 1.x primary branch. You can learn about 
> the implementation at both
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=GnB3WTYiEnYCx1Od7W3275L8fdtKPxH3KRi%2B7DXRvGM%3D&amp;reserved=0,
>  and
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-interactiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C96535ceefbb64a833f3e08d65d4d6e83%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636798984502302804&amp;sdata=YENo6rTbviI7ctl5K6%2BV0Bw4NCGMo9l2CDJOGauRWV8%3D&amp;reserved=0
> 
> Lewis
> 
> On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: 
>> Hi,
>>
>> Was trying to fetch the content rendered by ajax call using Apache Nutch 
>> 2.3.1.
>> Seems, it is not able to get the actual rendered content only getting the 
>> view source page ( as part of protocol-js plugin).
>> Has anyone able to fetch the rendered content from Ajax call using Nutch 
>> 2.3.1 or any suggestions?
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> ::DISCLAIMER::
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> The contents of this e-mail and any attachment(s) are confidential and 
>> intended for the named recipient(s) only. E-mail transmission is not 
>> guaranteed to be secure or error-free as information could be intercepted, 
>> corrupted, lost, destroyed, arrive late or incomplete, or may contain 
>> viruses in transmission. The e mail and its contents (with or without 
>> referred errors) shall therefore not attach any liability on the originator 
>> or HCL or its affiliates. Views or opinions, if any, presented in this email 
>> are solely those of the author and may not necessarily reflect the views or 
>> opinions of HCL or its affiliates. Any form of reproduction, dissemination, 
>> copying, disclosure, modification, distribution and / or publication of this 
>> message without the prior written consent of authorized representative of 
>> HCL is strictly prohibited. If you have received this email in error please 
>> delete it and notify the sender immediately. Before opening any email and/or 
>> attachments, please check them for viruses and other defects.
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>

Reply via email to