[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347837#comment-16347837 ] Hudson commented on NUTCH-2508: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3501 (See [https://builds.apache.org/job/Nutch-trunk/3501/]) fix for NUTCH-2508 contributed by mfeltscher (moreno: [https://github.com/apache/nutch/commit/4f82d8f2355a87c779e14cd6abde40a095c3349b]) * (edit) conf/nutch-default.xml > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762 ] Markus Jelsma commented on NUTCH-2466: -- Another note, curious to see browser developers allow over ten redirects. I never observed any fruition to follow more than a few. Stranger even is IE's choice to jump from eleven to 120! If anyone reading this can clarify the usefulness of following more than ten redirects? Or even 120? That made bad choices, or i haven't seen their views about the variety of crap on the web. Probably the latter is true. > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762 ] Markus Jelsma edited comment on NUTCH-2466 at 1/31/18 11:14 PM: Another note, curious to see browser developers allow over ten redirects. I never observed any fruition to follow more than a few. Stranger even is IE's choice to jump from eleven to 120! If anyone reading this can clarify the usefulness of following more than ten redirects? Or even 120? They made bad choices, or i haven't seen their views about the variety of crap on the web. Probably the latter is true. was (Author: markus17): Another note, curious to see browser developers allow over ten redirects. I never observed any fruition to follow more than a few. Stranger even is IE's choice to jump from eleven to 120! If anyone reading this can clarify the usefulness of following more than ten redirects? Or even 120? That made bad choices, or i haven't seen their views about the variety of crap on the web. Probably the latter is true. > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347749#comment-16347749 ] Markus Jelsma commented on NUTCH-2466: -- Glad to hear this will work for you! > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347742#comment-16347742 ] Moreno Feltscher commented on NUTCH-2466: - I absolutely get your point and I'm a 100% with you on this - forever is not a good idea in any scenario :-) Just wanted to make sure I understand this change correctly. FYI, Google Chrome treats 21 redirects as "too many" - I'm going to use 20 for {{sitemap.redir.max}} in my setup => https://stackoverflow.com/a/36041063/5884584 > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2508. - Resolution: Fixed Thank you [~mfeltscher] > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347737#comment-16347737 ] ASF GitHub Bot commented on NUTCH-2508: --- lewismc closed pull request #283: fix for NUTCH-2508 contributed by mfeltscher URL: https://github.com/apache/nutch/pull/283 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 550ed48a4..87c405883 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -280,7 +280,7 @@ http.proxy.exception.list - A comma separated list of URL's and hosts that don't use the proxy + A comma separated list of hosts that don't use the proxy (e.g. intranets). Example: www.apache.org This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2508: Fix Version/s: 1.15 > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347735#comment-16347735 ] Markus Jelsma commented on NUTCH-2466: -- Hello Moreno, Well, we obviously could allow a -1 setting and treat that as forever, but forever is infinite and it would hang the Nutch task until Hadoop treats it as timed out, usually within ten minutes. The setting is an int, so if you want, you can set it to the maximum positive integer and handle just over two billion consecutive redirects. Y I believe that would justify the meaning of forever in this context, do you agree? As a side note, having dealt with the crudeness of the www for many years, i consider any sequence of more than four redirects as the root a whole other problem. Our (company, not asf nutch) maximum setting is always three, higher than that has, so far, always lead to circular redirects. > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347729#comment-16347729 ] ASF GitHub Bot commented on NUTCH-2501: --- mfeltscher commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r165213301 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: @sebastian-nagel Any comments on this? :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
[ https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347726#comment-16347726 ] ASF GitHub Bot commented on NUTCH-2508: --- mfeltscher opened a new pull request #283: fix for NUTCH-2508 contributed by mfeltscher URL: https://github.com/apache/nutch/pull/283 This is a small documentation fix since the description of `http.proxy.exception.list` is misleading. Only hosts can be defined as you can see here: https://github.com/apache/nutch/blob/master/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L370 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Misleading documentation about http.proxy.exception.list > > > Key: NUTCH-2508 > URL: https://issues.apache.org/jira/browse/NUTCH-2508 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > > The description about {{http.proxy.exception.list}} states that domains as > well as URLs can be configured to be excluded from being routed through a > pre-configured proxy. This is misleading since only hosts are being checked > when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347718#comment-16347718 ] Moreno Feltscher commented on NUTCH-2466: - Is there any way to configure this so that nutch follows redirects forever (which was the case before this patch)? > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
Moreno Feltscher created NUTCH-2508: --- Summary: Misleading documentation about http.proxy.exception.list Key: NUTCH-2508 URL: https://issues.apache.org/jira/browse/NUTCH-2508 Project: Nutch Issue Type: Bug Reporter: Moreno Feltscher Assignee: Moreno Feltscher The description about {{http.proxy.exception.list}} states that domains as well as URLs can be configured to be excluded from being routed through a pre-configured proxy. This is misleading since only hosts are being checked when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346947#comment-16346947 ] Hudson commented on NUTCH-2466: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3500 (See [https://builds.apache.org/job/Nutch-trunk/3500/]) NUTCH-2466 (markus: [https://github.com/apache/nutch/commit/2b66cdaf8a18123c4e33c55a5c3b2cd863385896]) * (edit) conf/nutch-default.xml * (edit) src/java/org/apache/nutch/util/SitemapProcessor.java > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2466. -- Resolution: Fixed > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346862#comment-16346862 ] Markus Jelsma commented on NUTCH-2466: -- Thanks! remote: Sending notification emails to: ['"comm...@nutch.apache.org"'] remote: To git@github:apache/nutch.git remote:87c7a2e..2b66cda 2b66cdaf8a18123c4e33c55a5c3b2cd863385896 -> master remote: Syncing refs/heads/master... To https://gitbox.apache.org/repos/asf/nutch.git 87c7a2e5..2b66cdaf master -> master > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346821#comment-16346821 ] Sebastian Nagel commented on NUTCH-2466: +1 > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346768#comment-16346768 ] Markus Jelsma commented on NUTCH-2466: -- New patch! > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Attachment: NUTCH-2466.patch > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346744#comment-16346744 ] Sebastian Nagel commented on NUTCH-2466: It may be safer to break the loop in case the URL is set to null by filters/normalizers. +1 otherwise! > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346730#comment-16346730 ] Markus Jelsma commented on NUTCH-2466: -- Will commit shortly unless objections. > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2507) NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction
artodeto created NUTCH-2507: --- Summary: NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction Key: NUTCH-2507 URL: https://issues.apache.org/jira/browse/NUTCH-2507 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.14 Reporter: artodeto h2. h2. Section "Step-by-Step: Indexing into Apache Solr" replace: {code:java} Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone{code} with: {code:java} Example: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch ${NUTCH_RUNTIME_HOME}/crawl /crawldb/ -linkdb ${NUTCH_RUNTIME_HOME}/crawl /linkdb/ ${NUTCH_RUNTIME_HOME}/crawl /segments/20131108063838 / -filter -normalize -deleteGo{code} h2. Section "Step-by-Step: Deleting Duplicates" replace: {code:java} Usage: bin/nutch dedup Example: /bin/nutch dedup http://localhost:8983/solr {code} with: {code:java} Usage: bin/nutch dedup Example: /bin/nutch dedup ${NUTCH_RUNTIME_HOME}/crawl/crawldb/ http://localhost:8983/sol {code} h2. Section "Step-by-Step: Cleaning Solr" replace: {code:java} Usage: bin/nutch clean -Dsolr.server.url= Example: /bin/nutch clean -Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb/ {code} with: {code} Usage: bin/nutch clean -Dsolr.server.url= Example: /bin/nutch clean -Dsolr.server.url=http://localhost:8983/solr/nutch ${NUTCH_RUNTIME_HOME}/crawl/crawldb/ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)