[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347837#comment-16347837
 ] 

Hudson commented on NUTCH-2508:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3501 (See 
[https://builds.apache.org/job/Nutch-trunk/3501/])
fix for NUTCH-2508 contributed by mfeltscher (moreno: 
[https://github.com/apache/nutch/commit/4f82d8f2355a87c779e14cd6abde40a095c3349b])
* (edit) conf/nutch-default.xml


> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762
 ] 

Markus Jelsma commented on NUTCH-2466:
--

Another note, curious to see browser developers allow over ten redirects. I 
never observed any fruition to follow more than a few. Stranger even is IE's 
choice to jump from eleven to 120!

If anyone reading this can clarify the usefulness of following more than ten 
redirects? Or even 120? 

That made bad choices, or i haven't seen their views about the variety of crap 
on the web. Probably the latter is true.

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347762#comment-16347762
 ] 

Markus Jelsma edited comment on NUTCH-2466 at 1/31/18 11:14 PM:


Another note, curious to see browser developers allow over ten redirects. I 
never observed any fruition to follow more than a few. Stranger even is IE's 
choice to jump from eleven to 120!

If anyone reading this can clarify the usefulness of following more than ten 
redirects? Or even 120? 

They made bad choices, or i haven't seen their views about the variety of crap 
on the web. Probably the latter is true.


was (Author: markus17):
Another note, curious to see browser developers allow over ten redirects. I 
never observed any fruition to follow more than a few. Stranger even is IE's 
choice to jump from eleven to 120!

If anyone reading this can clarify the usefulness of following more than ten 
redirects? Or even 120? 

That made bad choices, or i haven't seen their views about the variety of crap 
on the web. Probably the latter is true.

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347749#comment-16347749
 ] 

Markus Jelsma commented on NUTCH-2466:
--

Glad to hear this will work for you!

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347742#comment-16347742
 ] 

Moreno Feltscher commented on NUTCH-2466:
-

I absolutely get your point and I'm a 100% with you on this - forever is not a 
good idea in any scenario :-) Just wanted to make sure I understand this change 
correctly.
FYI, Google Chrome treats 21 redirects as "too many" - I'm going to use 20 for 
{{sitemap.redir.max}} in my setup => 
https://stackoverflow.com/a/36041063/5884584

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2508.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347737#comment-16347737
 ] 

ASF GitHub Bot commented on NUTCH-2508:
---

lewismc closed pull request #283: fix for NUTCH-2508 contributed by mfeltscher
URL: https://github.com/apache/nutch/pull/283
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 550ed48a4..87c405883 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -280,7 +280,7 @@
 
   http.proxy.exception.list
   
-  A comma separated list of URL's and hosts that don't use the 
proxy 
+  A comma separated list of hosts that don't use the proxy 
   (e.g. intranets). Example: www.apache.org
 
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2508:

Fix Version/s: 1.15

> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347735#comment-16347735
 ] 

Markus Jelsma commented on NUTCH-2466:
--

Hello Moreno,

Well, we obviously could allow a -1 setting and treat that as forever, but 
forever is infinite and it would hang the Nutch task until Hadoop treats it as 
timed out, usually within ten minutes.

The setting is an int, so if you want, you can set it to the maximum positive 
integer and handle just over two billion consecutive redirects. Y

I believe that would justify the meaning of forever in this context, do you 
agree?

As a side note, having dealt with the crudeness of the www for many years, i 
consider any sequence of more than four redirects as the root a whole other 
problem. Our (company, not asf nutch) maximum setting is always three, higher 
than that has, so far, always lead to circular redirects.


> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347729#comment-16347729
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

mfeltscher commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r165213301
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   @sebastian-nagel Any comments on this? :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347726#comment-16347726
 ] 

ASF GitHub Bot commented on NUTCH-2508:
---

mfeltscher opened a new pull request #283: fix for NUTCH-2508 contributed by 
mfeltscher
URL: https://github.com/apache/nutch/pull/283
 
 
   This is a small documentation fix since the description of 
`http.proxy.exception.list` is misleading. Only hosts can be defined as you can 
see here: 
https://github.com/apache/nutch/blob/master/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L370


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Misleading documentation about http.proxy.exception.list
> 
>
> Key: NUTCH-2508
> URL: https://issues.apache.org/jira/browse/NUTCH-2508
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> The description about {{http.proxy.exception.list}} states that domains as 
> well as URLs can be configured to be excluded from being routed through a 
> pre-configured proxy. This is misleading since only hosts are being checked 
> when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347718#comment-16347718
 ] 

Moreno Feltscher commented on NUTCH-2466:
-

Is there any way to configure this so that nutch follows redirects forever 
(which was the case before this patch)?

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2508:
---

 Summary: Misleading documentation about http.proxy.exception.list
 Key: NUTCH-2508
 URL: https://issues.apache.org/jira/browse/NUTCH-2508
 Project: Nutch
  Issue Type: Bug
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


The description about {{http.proxy.exception.list}} states that domains as well 
as URLs can be configured to be excluded from being routed through a 
pre-configured proxy. This is misleading since only hosts are being checked 
when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346947#comment-16346947
 ] 

Hudson commented on NUTCH-2466:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3500 (See 
[https://builds.apache.org/job/Nutch-trunk/3500/])
NUTCH-2466 (markus: 
[https://github.com/apache/nutch/commit/2b66cdaf8a18123c4e33c55a5c3b2cd863385896])
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java


> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2466.
--
Resolution: Fixed

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346862#comment-16346862
 ] 

Markus Jelsma commented on NUTCH-2466:
--

Thanks!

remote: Sending notification emails to: ['"comm...@nutch.apache.org" 
']
remote: To git@github:apache/nutch.git
remote:87c7a2e..2b66cda  2b66cdaf8a18123c4e33c55a5c3b2cd863385896 -> master
remote: Syncing refs/heads/master...
To https://gitbox.apache.org/repos/asf/nutch.git
   87c7a2e5..2b66cdaf  master -> master


> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346821#comment-16346821
 ] 

Sebastian Nagel commented on NUTCH-2466:


+1

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346768#comment-16346768
 ] 

Markus Jelsma commented on NUTCH-2466:
--

New patch!

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2466:
-
Attachment: NUTCH-2466.patch

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346744#comment-16346744
 ] 

Sebastian Nagel commented on NUTCH-2466:


It may be safer to break the loop in case the URL is set to null by 
filters/normalizers.
+1 otherwise!

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346730#comment-16346730
 ] 

Markus Jelsma commented on NUTCH-2466:
--

Will commit shortly unless objections.

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2507) NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction

2018-01-31 Thread artodeto (JIRA)
artodeto created NUTCH-2507:
---

 Summary: NutchTutorial wiki pages as a lot of outdated command 
line calls when it starts with the solr interaction
 Key: NUTCH-2507
 URL: https://issues.apache.org/jira/browse/NUTCH-2507
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.14
Reporter: artodeto


h2. h2. Section "Step-by-Step: Indexing into Apache Solr"

replace:
{code:java}
Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb 
crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize 
-deleteGone{code}
with:
{code:java}
Example: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch 
${NUTCH_RUNTIME_HOME}/crawl
/crawldb/ -linkdb ${NUTCH_RUNTIME_HOME}/crawl
/linkdb/ ${NUTCH_RUNTIME_HOME}/crawl
/segments/20131108063838
/ -filter -normalize -deleteGo{code}
 
h2. Section "Step-by-Step: Deleting Duplicates"

replace:
{code:java}
 Usage: bin/nutch dedup 
 Example: /bin/nutch dedup http://localhost:8983/solr
{code}
with:
{code:java}
 Usage: bin/nutch dedup  
 Example: /bin/nutch dedup ${NUTCH_RUNTIME_HOME}/crawl/crawldb/ 
http://localhost:8983/sol
{code}

h2. Section "Step-by-Step: Cleaning Solr"

replace:
{code:java}
 Usage: bin/nutch clean -Dsolr.server.url= 
 Example: /bin/nutch clean 
-Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb/
{code}
with:
{code}
 Usage: bin/nutch clean -Dsolr.server.url= 
 Example: /bin/nutch clean 
-Dsolr.server.url=http://localhost:8983/solr/nutch 
${NUTCH_RUNTIME_HOME}/crawl/crawldb/
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)