[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-12-10 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714510#comment-16714510
 ] 

Sebastian Nagel commented on NUTCH-2676:


[~virt], thanks for the update. There is already an option to [white list 
hosts|https://wiki.apache.org/nutch/WhiteListRobots/] (NUTCH-1927). After a 
longer discussion we agreed on this - it makes it easy to ignore the robots.txt 
for a list of hosts you're allowed to but still would require a change in the 
source code if anybody wants to generally ignore the robots.txt standard. It's 
implemented in lib-http and should be available for protocol-selenium as well 
(but I never tested it here).

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-12-10 Thread Stas Batururimi (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714531#comment-16714531
 ] 

Stas Batururimi commented on NUTCH-2676:


So, I have already made a patch for this in the source code of 
FetcherThread.java for our needs. So, I could push it some time later in a 
separate Issue if necessary. Let me know what do you think about.

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-10 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2678:


 Summary: Allow for per-host configurable protocol plugin
 Key: NUTCH-2678
 URL: https://issues.apache.org/jira/browse/NUTCH-2678
 Project: Nutch
  Issue Type: Improvement
  Components: protocol
Affects Versions: 1.15
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.16
 Attachments: NUTCH-2678.patch

Introduces new parameter for protocol plugins called host. It takes a comma 
separated set of host names. Protocols are resolved by hostname first, then by 
protocol as it is now.

{code}
   

  
 
 
  
   
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin

2018-12-10 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2678:
-
Attachment: NUTCH-2678.patch

> Allow for per-host configurable protocol plugin
> ---
>
> Key: NUTCH-2678
> URL: https://issues.apache.org/jira/browse/NUTCH-2678
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2678.patch
>
>
> Introduces new parameter for protocol plugins called host. It takes a comma 
> separated set of host names. Protocols are resolved by hostname first, then 
> by protocol as it is now.
> {code}
>   name="HttpProtocol"
>   point="org.apache.nutch.protocol.Protocol">
>   class="org.apache.nutch.protocol.http.Http">
>  
>  
>   
>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)