[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714510#comment-16714510 ] Sebastian Nagel commented on NUTCH-2676: [~virt], thanks for the update. There is already an option to [white list hosts|https://wiki.apache.org/nutch/WhiteListRobots/] (NUTCH-1927). After a longer discussion we agreed on this - it makes it easy to ignore the robots.txt for a list of hosts you're allowed to but still would require a change in the source code if anybody wants to generally ignore the robots.txt standard. It's implemented in lib-http and should be available for protocol-selenium as well (but I never tested it here). > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714531#comment-16714531 ] Stas Batururimi commented on NUTCH-2676: So, I have already made a patch for this in the source code of FetcherThread.java for our needs. So, I could push it some time later in a separate Issue if necessary. Let me know what do you think about. > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2678) Allow for per-host configurable protocol plugin
Markus Jelsma created NUTCH-2678: Summary: Allow for per-host configurable protocol plugin Key: NUTCH-2678 URL: https://issues.apache.org/jira/browse/NUTCH-2678 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: 1.15 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.16 Attachments: NUTCH-2678.patch Introduces new parameter for protocol plugins called host. It takes a comma separated set of host names. Protocols are resolved by hostname first, then by protocol as it is now. {code} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2678) Allow for per-host configurable protocol plugin
[ https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2678: - Attachment: NUTCH-2678.patch > Allow for per-host configurable protocol plugin > --- > > Key: NUTCH-2678 > URL: https://issues.apache.org/jira/browse/NUTCH-2678 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2678.patch > > > Introduces new parameter for protocol plugins called host. It takes a comma > separated set of host names. Protocols are resolved by hostname first, then > by protocol as it is now. > {code} > name="HttpProtocol" > point="org.apache.nutch.protocol.Protocol"> > class="org.apache.nutch.protocol.http.Http"> > > > > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)