[jira] [Commented] (NUTCH-2789) Documentation: update links to point to cwiki

2020-06-10 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132676#comment-17132676 ] Hudson commented on NUTCH-2789: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3686 (See

[jira] [Commented] (NUTCH-2788) ParseData: improve presentation of Metadata in method toString()

2020-06-10 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132637#comment-17132637 ] Hudson commented on NUTCH-2788: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3685 (See

[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

2020-06-10 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132638#comment-17132638 ] Hudson commented on NUTCH-2787: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3685 (See

[jira] [Commented] (NUTCH-2790) CSVIndexWriter does not escape leading quotes properly

2020-06-10 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132639#comment-17132639 ] Hudson commented on NUTCH-2790: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3685 (See

[jira] [Updated] (NUTCH-2791) domainstats, protocolstats and crawlcomplete do not handle GCS URLs

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2791: --- Affects Version/s: (was: 1.17) 1.16 > domainstats, protocolstats

[jira] [Updated] (NUTCH-2791) domainstats, protocolstats and crawlcomplete do not handle GCS URLs

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2791: --- Fix Version/s: 1.17 > domainstats, protocolstats and crawlcomplete do not handle GCS URLs >

[jira] [Resolved] (NUTCH-2789) Documentation: update links to point to cwiki

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2789. Resolution: Fixed > Documentation: update links to point to cwiki >

[jira] [Resolved] (NUTCH-2788) ParseData: improve presentation of Metadata in method toString()

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2788. Resolution: Implemented Committed/merged for 1.17 - thank for the reviews! > ParseData:

[GitHub] [nutch] sebastian-nagel merged pull request #529: NUTCH-2788 ParseData: improve presentation of Metadata in method toString()

2020-06-10 Thread GitBox
sebastian-nagel merged pull request #529: URL: https://github.com/apache/nutch/pull/529 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[jira] [Commented] (NUTCH-2788) ParseData: improve presentation of Metadata in method toString()

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132621#comment-17132621 ] ASF GitHub Bot commented on NUTCH-2788: --- sebastian-nagel merged pull request #529: URL:

[jira] [Updated] (NUTCH-2788) ParseData: improve presentation of Metadata in method toString()

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2788: --- Fix Version/s: (was: 1.18) 1.17 > ParseData: improve presentation of

[jira] [Resolved] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2787. Resolution: Fixed Fixed/merged. Thanks for the reviews! > CrawlDb JSON dump does not

[jira] [Resolved] (NUTCH-2790) CSVIndexWriter does not escape leading quotes properly

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2790. Resolution: Fixed Thanks, [~pmezard]! > CSVIndexWriter does not escape leading quotes

[jira] [Commented] (NUTCH-2790) CSVIndexWriter does not escape leading quotes properly

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132603#comment-17132603 ] ASF GitHub Bot commented on NUTCH-2790: --- sebastian-nagel merged pull request #532: URL:

[jira] [Updated] (NUTCH-2790) CSVIndexWriter does not escape leading quotes properly

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2790: --- Fix Version/s: 1.17 > CSVIndexWriter does not escape leading quotes properly >

[jira] [Updated] (NUTCH-2790) CSVIndexWriter does not escape leading quotes properly

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2790: --- Component/s: plugin > CSVIndexWriter does not escape leading quotes properly >

[GitHub] [nutch] sebastian-nagel merged pull request #532: NUTCH-2790 indexer-csv: escape field leading quote character

2020-06-10 Thread GitBox
sebastian-nagel merged pull request #532: URL: https://github.com/apache/nutch/pull/532 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #533: NUTCH-2791 Handle GCS URLs in stats commands

2020-06-10 Thread GitBox
sebastian-nagel commented on a change in pull request #533: URL: https://github.com/apache/nutch/pull/533#discussion_r438319215 ## File path: src/java/org/apache/nutch/util/CrawlCompletionStats.java ## @@ -153,9 +153,7 @@ public int run(String[] args) throws Exception {

[jira] [Commented] (NUTCH-2791) domainstats, protocolstats and crawlcomplete do not handle GCS URLs

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132600#comment-17132600 ] ASF GitHub Bot commented on NUTCH-2791: --- sebastian-nagel commented on a change in pull request

[GitHub] [nutch] sebastian-nagel commented on pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
sebastian-nagel commented on pull request #534: URL: https://github.com/apache/nutch/pull/534#issuecomment-642132183 > Is it OK to just change the interface and implement what you suggest? Yes, that's ok. We'll put a notice about a breaking change to the release notes, so that users

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130921#comment-17130921 ] ASF GitHub Bot commented on NUTCH-2793: --- sebastian-nagel commented on pull request #534: URL:

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130918#comment-17130918 ] ASF GitHub Bot commented on NUTCH-2793: --- sebastian-nagel commented on a change in pull request

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
sebastian-nagel commented on a change in pull request #534: URL: https://github.com/apache/nutch/pull/534#discussion_r438267053 ## File path: src/plugin/indexer-csv/README.md ## @@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote character |

[GitHub] [nutch] pmezard commented on a change in pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
pmezard commented on a change in pull request #534: URL: https://github.com/apache/nutch/pull/534#discussion_r438258817 ## File path: src/plugin/indexer-csv/README.md ## @@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote character | maxfieldlength | Max.

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130906#comment-17130906 ] ASF GitHub Bot commented on NUTCH-2793: --- pmezard commented on a change in pull request #534: URL:

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130905#comment-17130905 ] ASF GitHub Bot commented on NUTCH-2793: --- pmezard commented on pull request #534: URL:

[GitHub] [nutch] pmezard commented on pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
pmezard commented on pull request #534: URL: https://github.com/apache/nutch/pull/534#issuecomment-642122887 What are the backward compatibility requirements for nutch? Is it OK to just change the interface and implement what you suggest? Should it be best-effort to keep things BC? Or is

[jira] [Commented] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130814#comment-17130814 ] Patrick Mézard commented on NUTCH-2792: --- What solution would you favor then [2], [3], something

[jira] [Commented] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130803#comment-17130803 ] Sebastian Nagel commented on NUTCH-2792: Agreed, the -params option should be used by all index

[jira] [Updated] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2792: --- Fix Version/s: 1.18 > nutch index -params is only used in Solr indexer >

[jira] [Updated] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2793: --- Fix Version/s: 1.18 > CSV indexer does not work in distributed mode >

[jira] [Updated] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2793: --- Component/s: plugin > CSV indexer does not work in distributed mode >

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130783#comment-17130783 ] ASF GitHub Bot commented on NUTCH-2793: --- sebastian-nagel commented on a change in pull request

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
sebastian-nagel commented on a change in pull request #534: URL: https://github.com/apache/nutch/pull/534#discussion_r438197577 ## File path: src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java ## @@ -192,7 +189,7 @@ protected int find(String

[jira] [Commented] (NUTCH-2755) Remove obsolete plugin indexer-elastic-rest

2020-06-10 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130718#comment-17130718 ] Sebastian Nagel commented on NUTCH-2755: Hi [~mfeltscher], should be possible but I've never

[jira] [Commented] (NUTCH-2501) allow to set Java heap size when using crawl script in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130713#comment-17130713 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script

2020-06-10 Thread GitBox
sebastian-nagel commented on a change in pull request #279: URL: https://github.com/apache/nutch/pull/279#discussion_r438150802 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130607#comment-17130607 ] Patrick Mézard commented on NUTCH-2793: --- PR sent here https://github.com/apache/nutch/pull/534 >

[jira] [Issue Comment Deleted] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Mézard updated NUTCH-2793: -- Comment: was deleted (was: PR sent here https://github.com/apache/nutch/pull/534) > CSV

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130593#comment-17130593 ] ASF GitHub Bot commented on NUTCH-2793: --- pmezard opened a new pull request #534: URL:

[GitHub] [nutch] pmezard opened a new pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

2020-06-10 Thread GitBox
pmezard opened a new pull request #534: URL: https://github.com/apache/nutch/pull/534 Before the change, the output file name was hard-coded to "nutch.csv". When running in distributed mode, multiple reducers would clobber each other output. After the change, the filename is

[jira] [Updated] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Mézard updated NUTCH-2793: -- Description: Reasons are discussed in

[jira] [Created] (NUTCH-2793) CSV indexer does not work in distributed mode

2020-06-10 Thread Jira
Patrick Mézard created NUTCH-2793: - Summary: CSV indexer does not work in distributed mode Key: NUTCH-2793 URL: https://issues.apache.org/jira/browse/NUTCH-2793 Project: Nutch Issue Type:

RE: [PROPOSAL] Replace whitelist blacklist with allowlist denylist

2020-06-10 Thread Markus Jelsma
Hello Lewis, I understand the proposal. As an engineer, however, i have some points i would like to address: * The proposed change is not backward compatible, which weighs heavy because it is also not a technical necessity. * Our users, myself included, have to make a small or, depending on

[jira] [Updated] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Mézard updated NUTCH-2792: -- Description: `nutch index` help displays: {code:java} General options: ... -params

[jira] [Commented] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Jira
[ https://issues.apache.org/jira/browse/NUTCH-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130421#comment-17130421 ] Patrick Mézard commented on NUTCH-2792: ---

[jira] [Created] (NUTCH-2792) nutch index -params is only used in Solr indexer

2020-06-10 Thread Jira
Patrick Mézard created NUTCH-2792: - Summary: nutch index -params is only used in Solr indexer Key: NUTCH-2792 URL: https://issues.apache.org/jira/browse/NUTCH-2792 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

2020-06-10 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130289#comment-17130289 ] ASF GitHub Bot commented on NUTCH-2787: --- pmezard commented on pull request #531: URL:

[GitHub] [nutch] pmezard commented on pull request #531: NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly

2020-06-10 Thread GitBox
pmezard commented on pull request #531: URL: https://github.com/apache/nutch/pull/531#issuecomment-641768990 +1, the change fixes my issue. This is an automated message from the Apache Git Service. To respond to the message,