date:20230108

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-08 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655842#comment-17655842
 ] 

Hudson commented on NUTCH-2634:
---

FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #91 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/91/])
NUTCH-2634 Some links marked as "nofollow" are followed anyway (snagel: 
[https://github.com/apache/nutch/commit/dfdd00f3189839b6ed7d60651e5daa33f0038265])
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestDOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java


> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Build failed in Jenkins: Nutch » Nutch-trunk #91

2023-01-08 Thread Apache Jenkins Server

See 


Changes:

[Sebastian Nagel] NUTCH-2634 Some links marked as "nofollow" are followed anyway


--
[...truncated 1.71 MB...]
[dependency-check] Download Started for NVD CVE - 2003
[dependency-check] Download Complete for NVD CVE - 2003  (574 ms)
[dependency-check] Processing Started for NVD CVE - 2003
[dependency-check] Processing Complete for NVD CVE - 2003  (2308 ms)
[dependency-check] Processing Complete for NVD CVE - 2002  (7945 ms)
[dependency-check] Download Started for NVD CVE - 2004
[dependency-check] Download Complete for NVD CVE - 2004  (631 ms)
[dependency-check] Processing Started for NVD CVE - 2004
[dependency-check] Processing Complete for NVD CVE - 2004  (2752 ms)
[dependency-check] Download Started for NVD CVE - 2005
[dependency-check] Download Complete for NVD CVE - 2005  (647 ms)
[dependency-check] Processing Started for NVD CVE - 2005
[dependency-check] Download Started for NVD CVE - 2006
[dependency-check] Processing Complete for NVD CVE - 2005  (4664 ms)
[dependency-check] Download Complete for NVD CVE - 2006  (697 ms)
[dependency-check] Processing Started for NVD CVE - 2006
[dependency-check] Download Started for NVD CVE - 2007
[dependency-check] Download Complete for NVD CVE - 2007  (698 ms)
[dependency-check] Processing Started for NVD CVE - 2007
[dependency-check] Processing Complete for NVD CVE - 2006  (6931 ms)
[dependency-check] Download Started for NVD CVE - 2008
[dependency-check] Download Complete for NVD CVE - 2008  (706 ms)
[dependency-check] Processing Started for NVD CVE - 2008
[dependency-check] Processing Complete for NVD CVE - 2007  (6622 ms)
[dependency-check] Download Started for NVD CVE - 2009
[dependency-check] Download Complete for NVD CVE - 2009  (753 ms)
[dependency-check] Processing Started for NVD CVE - 2009
[dependency-check] Download Started for NVD CVE - 2010
[dependency-check] Processing Complete for NVD CVE - 2008  (8799 ms)
[dependency-check] Download Complete for NVD CVE - 2010  (658 ms)
[dependency-check] Processing Started for NVD CVE - 2010
[dependency-check] Download Started for NVD CVE - 2011
[dependency-check] Processing Complete for NVD CVE - 2009  (9278 ms)
[dependency-check] Download Complete for NVD CVE - 2011  (674 ms)
[dependency-check] Processing Started for NVD CVE - 2011
[dependency-check] Download Started for NVD CVE - 2012
[dependency-check] Download Complete for NVD CVE - 2012  (753 ms)
[dependency-check] Processing Started for NVD CVE - 2012
[dependency-check] Processing Complete for NVD CVE - 2010  (11323 ms)
[dependency-check] Download Started for NVD CVE - 2013
[dependency-check] Download Complete for NVD CVE - 2013  (710 ms)
[dependency-check] Processing Started for NVD CVE - 2013
[dependency-check] Processing Complete for NVD CVE - 2011  (11889 ms)
[dependency-check] Download Started for NVD CVE - 2014
[dependency-check] Download Complete for NVD CVE - 2014  (678 ms)
[dependency-check] Processing Started for NVD CVE - 2014
[dependency-check] Download Started for NVD CVE - 2015
[dependency-check] Processing Complete for NVD CVE - 2012  (14072 ms)
[dependency-check] Download Complete for NVD CVE - 2015  (723 ms)
[dependency-check] Processing Started for NVD CVE - 2015
[dependency-check] Download Started for NVD CVE - 2016
[dependency-check] Download Complete for NVD CVE - 2016  (731 ms)
[dependency-check] Processing Started for NVD CVE - 2016
[dependency-check] Processing Complete for NVD CVE - 2013  (14204 ms)
[dependency-check] Processing Complete for NVD CVE - 2014  (12944 ms)
[dependency-check] Download Started for NVD CVE - 2017
[dependency-check] Download Complete for NVD CVE - 2017  (767 ms)
[dependency-check] Processing Started for NVD CVE - 2017
[dependency-check] Processing Complete for NVD CVE - 2015  (10318 ms)
[dependency-check] Download Started for NVD CVE - 2018
[dependency-check] Download Complete for NVD CVE - 2018  (767 ms)
[dependency-check] Processing Started for NVD CVE - 2018
[dependency-check] Processing Complete for NVD CVE - 2016  (10937 ms)
[dependency-check] Download Started for NVD CVE - 2019
[dependency-check] Download Complete for NVD CVE - 2019  (751 ms)
[dependency-check] Processing Started for NVD CVE - 2019
[dependency-check] Processing Complete for NVD CVE - 2017  (11148 ms)
[dependency-check] Download Started for NVD CVE - 2020
[dependency-check] Download Complete for NVD CVE - 2020  (784 ms)
[dependency-check] Processing Started for NVD CVE - 2020
[dependency-check] Processing Complete for NVD CVE - 2018  (10346 ms)
[dependency-check] Download Started for NVD CVE - 2021
[dependency-check] Download Complete for NVD CVE - 2021  (760 ms)
[dependency-check] Processing Started for NVD CVE - 2021
[dependency-check] Processing Complete for NVD CVE - 2019  (13422 ms)
[dependency-check] Download Started for NVD CVE - 2022
[dependency-check] Download Complete for NVD CVE -

[jira] [Closed] (NUTCH-1429) CrawlDBReader to dump on exception and HTTP code

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1429.
--

> CrawlDBReader to dump on exception and HTTP code
> 
>
> Key: NUTCH-1429
> URL: https://issues.apache.org/jira/browse/NUTCH-1429
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Priority: Minor
>
> The CrawlDBReader tool can dump based on status and URL regex but not on 
> status db_gone combined with an HTTP exception and HTTP response code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1429) CrawlDBReader to dump on exception and HTTP code

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1429.

Resolution: Implemented

Closing this issue as it is implemented by NUTCH-1980.

> CrawlDBReader to dump on exception and HTTP code
> 
>
> Key: NUTCH-1429
> URL: https://issues.apache.org/jira/browse/NUTCH-1429
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Priority: Minor
>
> The CrawlDBReader tool can dump based on status and URL regex but not on 
> status db_gone combined with an HTTP exception and HTTP response code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-849.
---
Resolution: Abandoned

(closing this old dependency issue) - thanks anyway, [~phamtuanminh2004] !

> different versions of the same library in nutch-2.0-dev.job and local\lib 
> directory 
> 
>
> Key: NUTCH-849
> URL: https://issues.apache.org/jira/browse/NUTCH-849
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4, nutchgora
> Environment: Window XP SP3, Cygwin
>Reporter: Pham Tuan Minh
>Priority: Minor
>
> Hi,
> I found that after building runtime, In nutch-2.0-dev.job and local\lib 
> directory contains different versions of the same library
> ant-1.7.1.jar
> ant-1.6.5.jar
> servlet-api-2.5-20081211.jar
> servlet-api-2.5-6.1.14.jar
> I predict these libraries come from different dependencies branch. Anyone 
> help me to fix it?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-849.
-

> different versions of the same library in nutch-2.0-dev.job and local\lib 
> directory 
> 
>
> Key: NUTCH-849
> URL: https://issues.apache.org/jira/browse/NUTCH-849
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4, nutchgora
> Environment: Window XP SP3, Cygwin
>Reporter: Pham Tuan Minh
>Priority: Minor
>
> Hi,
> I found that after building runtime, In nutch-2.0-dev.job and local\lib 
> directory contains different versions of the same library
> ant-1.7.1.jar
> ant-1.6.5.jar
> servlet-api-2.5-20081211.jar
> servlet-api-2.5-6.1.14.jar
> I predict these libraries come from different dependencies branch. Anyone 
> help me to fix it?
> Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-1942) Remove TopLevelDomain

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1942:
---
Fix Version/s: 1.20

> Remove TopLevelDomain 
> --
>
> Key: NUTCH-1942
> URL: https://issues.apache.org/jira/browse/NUTCH-1942
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Priority: Minor
>  Labels: crawler-commons, newbie
> Fix For: 1.20
>
>
> We should leverage the domain related utilities from crawler-commons instead 
> of duplicating them in the `org.apache.nutch.util.domain` package. For 
> instance we could deprecate TopLevelDomain and call the corresponding class 
> in CC instead. The resources in CC are more up-to-date and it is less code to 
> maintain.
> This would be a good task for someone willing to get to know the Nutch 
> codebase better and impress us all with the extent of his/her skills.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1687) Pick queue in Round Robin

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1687.

Resolution: Implemented

Thanks, [~tiennm]! Closing this issue as picking from the queues round robin 
was implemented along with NUTCH-2767.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1687) Pick queue in Round Robin

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1687.
--

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1538.
--

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
>Priority: Major
> Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1538.

Resolution: Won't Fix

(closing this issue as the 2.x branch isn't maintained anymore)

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
>Priority: Major
> Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1574) Crawling parent directories for http(s) protocol

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1574.
--

> Crawling parent directories for http(s) protocol 
> -
>
> Key: NUTCH-1574
> URL: https://issues.apache.org/jira/browse/NUTCH-1574
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Antoinette
>Priority: Major
>
> I am looking for a fix to prevent indexing the list of files crawled via 
> http(s) protocol. For example: I have 10 files in a directory. Nutch finds 
> and Solr indexes 11, the first being a list of the other 10 files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1574) Crawling parent directories for http(s) protocol

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1574.

Resolution: Not A Bug

(sorry, this issue was left unnoticed since long) I can think of two solutions:
 * use a different URL filter configuration during indexing to filter the file 
listings away (assumed they're identifiable by URL)
 * use exchanges NUTCH-2412

> Crawling parent directories for http(s) protocol 
> -
>
> Key: NUTCH-1574
> URL: https://issues.apache.org/jira/browse/NUTCH-1574
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6
>Reporter: Antoinette
>Priority: Major
>
> I am looking for a fix to prevent indexing the list of files crawled via 
> http(s) protocol. For example: I have 10 files in a directory. Nutch finds 
> and Solr indexes 11, the first being a list of the other 10 files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1813) Use \u.... escapes for non-ASCII chars in TestURLUtil

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1813.
--

> Use \u escapes for non-ASCII chars in TestURLUtil
> -
>
> Key: NUTCH-1813
> URL: https://issues.apache.org/jira/browse/NUTCH-1813
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3
> Environment: java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.9.4, MacBookPro 64 Bit.
>Reporter: Valerio Schiavoni
>Priority: Major
> Attachments: NUTCH-1813-2x-v1.patch, NUTCH-1813-trunk-v1.patch
>
>
> To reproduce, git clone the latest 2.x branch and execute the TestURLUtil 
> tests.
> There are 4 test failures and 1 error.
> Failing tests:
> testToUNICODE:org.junit.ComparisonFailure: expected: 
> but was:
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at org.apache.nutch.util.TestURLUtil.testToUNICODE(TestURLUtil.java:263)
> testChooseRepr:org.junit.ComparisonFailure: expected: but 
> was:
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.nutch.util.TestURLUtil.testChooseRepr(TestURLUtil.java:179)
> testGetDomainName:
> org.junit.ComparisonFailure: expected:<[apache.]org> but was:<[]org>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.nutch.util.TestURLUtil.testGetDomainName(TestURLUtil.java:35)
> testToASCII:
> java.lang.AssertionError: expected: but 
> was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at org.apache.nutch.util.TestURLUtil.testToASCII(TestURLUtil.java:273)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1813) Use \u.... escapes for non-ASCII chars in TestURLUtil

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1813.

Resolution: Won't Fix

Closing this issue as "won't fix" - the properties file "default.properties" 
defines the encoding of Java source files as UTF-8. A search for Java source 
files including non-ASCII characters ({{{}git grep -P '\P\{Ascii}' 
**.java{}}}), shows that they're widely used in the Nutch source code. Using 
escapes makes the code less readable. Thanks anyway!

> Use \u escapes for non-ASCII chars in TestURLUtil
> -
>
> Key: NUTCH-1813
> URL: https://issues.apache.org/jira/browse/NUTCH-1813
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3
> Environment: java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.9.4, MacBookPro 64 Bit.
>Reporter: Valerio Schiavoni
>Priority: Major
> Attachments: NUTCH-1813-2x-v1.patch, NUTCH-1813-trunk-v1.patch
>
>
> To reproduce, git clone the latest 2.x branch and execute the TestURLUtil 
> tests.
> There are 4 test failures and 1 error.
> Failing tests:
> testToUNICODE:org.junit.ComparisonFailure: expected: 
> but was:
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at org.apache.nutch.util.TestURLUtil.testToUNICODE(TestURLUtil.java:263)
> testChooseRepr:org.junit.ComparisonFailure: expected: but 
> was:
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.nutch.util.TestURLUtil.testChooseRepr(TestURLUtil.java:179)
> testGetDomainName:
> org.junit.ComparisonFailure: expected:<[apache.]org> but was:<[]org>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.nutch.util.TestURLUtil.testGetDomainName(TestURLUtil.java:35)
> testToASCII:
> java.lang.AssertionError: expected: but 
> was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at org.apache.nutch.util.TestURLUtil.testToASCII(TestURLUtil.java:273)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-1822) Page outlinks clearance is not appropriate

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1822.

Resolution: Won't Fix

(closing 2.x issue as this version isn't maintained anymore)

> Page outlinks  clearance is not appropriate
> ---
>
> Key: NUTCH-1822
> URL: https://issues.apache.org/jira/browse/NUTCH-1822
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1
> Environment: Nutch-2.1
> Hadoop-0.20.205
> HBase-0.90.6
> hbase-gora-0.2.1
>Reporter: Riyaz Shaik
>Priority: Major
>
> 1. When a page is re-crawled and identified with new outlink urls along with 
> the existing urls, old outlinks are getting removed and only new urls are 
> updated to hbase. 
> Ex:
>  Crawl cycle 1 for www.123.com, identified outlinks are 
> ol  --> abc.com 
> ol --> pqr.com 
> Crawlcyle 2 of same www.123.com, the outlinks are
> (note that abc.com is removed and added with xyz.com) 
> ol --> pqr.com 
> ol --> xyz.com 
> At the end of crawlcycle 2, base has only xyz.com as outlink
> ol -->xyz.com
> Expected:
> ol --> pqr.com 
> ol --> xyz.com 
> 2. If some of the outlinks of the page got removed and no new outlinks are 
> added to the page then page re-crawl is not clearing the obsolete/removed 
> outlinks from hbase.
> Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
> ol -->link1
> ol-->link2
> ol-->link3
> Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
> (Note: only removed the link2 no new links are added)
>  ol-->link1
> ol-->link3
>  but the end of the cycle 2.,it has all the 3 outlinks in hbase
> in habse:
> ol -->link1
> ol-->link2
> ol-->link3
> expected:
>  ol-->link1
> ol-->link3
> As per the code ParseUtil.java, it seems to be removing the old links and 
> insets onlythe new links. 
> if (page.getOutlinks() != null) { page.getOutlinks().clear(); }
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Thanks
> Riyaz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-1822) Page outlinks clearance is not appropriate

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1822.
--

> Page outlinks  clearance is not appropriate
> ---
>
> Key: NUTCH-1822
> URL: https://issues.apache.org/jira/browse/NUTCH-1822
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1
> Environment: Nutch-2.1
> Hadoop-0.20.205
> HBase-0.90.6
> hbase-gora-0.2.1
>Reporter: Riyaz Shaik
>Priority: Major
>
> 1. When a page is re-crawled and identified with new outlink urls along with 
> the existing urls, old outlinks are getting removed and only new urls are 
> updated to hbase. 
> Ex:
>  Crawl cycle 1 for www.123.com, identified outlinks are 
> ol  --> abc.com 
> ol --> pqr.com 
> Crawlcyle 2 of same www.123.com, the outlinks are
> (note that abc.com is removed and added with xyz.com) 
> ol --> pqr.com 
> ol --> xyz.com 
> At the end of crawlcycle 2, base has only xyz.com as outlink
> ol -->xyz.com
> Expected:
> ol --> pqr.com 
> ol --> xyz.com 
> 2. If some of the outlinks of the page got removed and no new outlinks are 
> added to the page then page re-crawl is not clearing the obsolete/removed 
> outlinks from hbase.
> Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
> ol -->link1
> ol-->link2
> ol-->link3
> Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
> (Note: only removed the link2 no new links are added)
>  ol-->link1
> ol-->link3
>  but the end of the cycle 2.,it has all the 3 outlinks in hbase
> in habse:
> ol -->link1
> ol-->link2
> ol-->link3
> expected:
>  ol-->link1
> ol-->link3
> As per the code ParseUtil.java, it seems to be removing the old links and 
> insets onlythe new links. 
> if (page.getOutlinks() != null) { page.getOutlinks().clear(); }
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Thanks
> Riyaz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-08 Thread Sebastian Nagel (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2634.

Resolution: Fixed

> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-08 Thread Sebastian Nagel (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655826#comment-17655826
 ] 

Sebastian Nagel commented on NUTCH-2634:


Thanks, [~markus17]!

> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655827#comment-17655827
 ] 

ASF GitHub Bot commented on NUTCH-2634:
---

sebastian-nagel merged PR #751:
URL: https://github.com/apache/nutch/pull/751




> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [nutch] sebastian-nagel merged pull request #751: NUTCH-2634 Some links marked as "nofollow" are followed anyway

2023-01-08 Thread GitBox



sebastian-nagel merged PR #751:
URL: https://github.com/apache/nutch/pull/751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

Build failed in Jenkins: Nutch » Nutch-trunk #91

[jira] [Closed] (NUTCH-1429) CrawlDBReader to dump on exception and HTTP code

[jira] [Resolved] (NUTCH-1429) CrawlDBReader to dump on exception and HTTP code

[jira] [Resolved] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

[jira] [Closed] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

[jira] [Updated] (NUTCH-1942) Remove TopLevelDomain

[jira] [Resolved] (NUTCH-1687) Pick queue in Round Robin

[jira] [Closed] (NUTCH-1687) Pick queue in Round Robin

[jira] [Closed] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

[jira] [Resolved] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

[jira] [Closed] (NUTCH-1574) Crawling parent directories for http(s) protocol

[jira] [Resolved] (NUTCH-1574) Crawling parent directories for http(s) protocol

[jira] [Closed] (NUTCH-1813) Use \u.... escapes for non-ASCII chars in TestURLUtil

[jira] [Resolved] (NUTCH-1813) Use \u.... escapes for non-ASCII chars in TestURLUtil

[jira] [Resolved] (NUTCH-1822) Page outlinks clearance is not appropriate

[jira] [Closed] (NUTCH-1822) Page outlinks clearance is not appropriate

[jira] [Resolved] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

[GitHub] [nutch] sebastian-nagel merged pull request #751: NUTCH-2634 Some links marked as "nofollow" are followed anyway

21 matches

Site Navigation

Mail list logo

Footer information