[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2018-03-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394940#comment-16394940
 ] 

Sebastian Nagel commented on NUTCH-1465:


The feature is already ported to 2.x, see NUTCH-1741, but using a different 
approach.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2018-03-09 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393274#comment-16393274
 ] 

Ben Vachon commented on NUTCH-1465:
---

Is there any plan to pull this to 2.x?

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209562#comment-16209562
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-337636717
 
 
   Hi @marconett, please ask for help on the [Nutch user mailing 
list](http://nutch.apache.org/mailing_lists.html) or report the problem at 
https://issues.apache.org/jira/projects/NUTCH. That's a closed pull request, 
and nothing will be fixed here. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209547#comment-16209547
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

marconett commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-337633586
 
 
   I'm running into the same problem and am unable to inject sitemap content 
into the db. here's the commands i used (not including output, it's the same as 
above):
   
   ```
   bin/nutch inject crawl/crawldb urls/
   bin/nutch sitemap crawl/crawldb -sitemapUrls sitemaps/ -noStrict -noFilter 
-noNormalize
   bin/nutch readdb crawl/crawldb -stats
   ```
   
   where `urls/seed.txt` contains "https://www.linux.com/; and 
`sitemaps/seed.txt` contains "https://www.linux.com/sitemap.xml;.
   
   I see (tcpdump) that there are https connections being established to 
linux.com while `bin/nutch sitemap` is running. But nothing gets injected into 
the crawldb.
   
   Is there any info on this? Should this be fixed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127556#comment-16127556
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel closed pull request #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127555#comment-16127555
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on issue #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202#issuecomment-322529338
 
 
   Yes, of course.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127483#comment-16127483
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on issue #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202#issuecomment-322518126
 
 
   @sebastian-nagel ping
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101785#comment-16101785
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc closed pull request #195: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/195
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101783#comment-16101783
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on issue #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202#issuecomment-318086684
 
 
   @sebastian-nagel, Markus' patch made it into master branch... is this 
correct? If so then we can close this issue.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101784#comment-16101784
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc closed pull request #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092964#comment-16092964
 ] 

Hudson commented on NUTCH-1465:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See 
[https://builds.apache.org/job/Nutch-trunk/3435/])
NUTCH-1465 (markus: 
[https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d])
* (edit) conf/log4j.properties
* (add) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) ivy/ivy.xml
* (edit) conf/nutch-default.xml
* (edit) src/bin/nutch


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092945#comment-16092945
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Crap! I was probably looking without seeing! Got it!

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092938#comment-16092938
 ] 

Sebastian Nagel commented on NUTCH-1465:


I've modified the description of the properties 
[sitemap.strict.parsing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2555]
 and 
[sitemap.url.overwrite.existing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2589].
 But feel free add your modifications/additions. I just tried to make it 
understandable by anyone who does not know the gory details.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092909#comment-16092909
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Sebastian, your patch has CrawlDatum and IndexingFilterChecker in the patch as 
well, just for the newline at the tail. No problem, but i do miss your updated 
descripton of the properties. Cannot find them in 
https://github.com/apache/nutch/pull/202.patch

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090460#comment-16090460
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Thanks! Will grab 202.patch and see if it fits tomorrow!

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090044#comment-16090044
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel opened a new pull request #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202
 
 
   (applied Markus' patch as of 2017-07-05)
   - add SitemapProcessor
   - upgrade dependency crawler-commons to 0.8
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090038#comment-16090038
 ] 

Sebastian Nagel commented on NUTCH-1465:


Thanks, [~markus.jel...@openindex.io]! Tested on a small set of sitemaps. Looks 
good to me, I've only improved the description of properties and did some code 
clean-up (patch / pull-request to follow). Please, go ahead and commit it! We 
can later improve it to make it more robust or to make sophisticated use of 
last modified time and priorities provided in sitemaps. Thanks!

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078124#comment-16078124
 ] 

Markus Jelsma commented on NUTCH-1465:
--

I think this is committable, anyone to disagree? If not, i'll get this in early 
next week.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074554#comment-16074554
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Hi Lewis, 0.8 doesn't deal with this sitemap at autotrader too.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-04 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074046#comment-16074046
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

[~markus17] can we also update the version of crawler commons to 0.8 which is 
the latest version available in Maven Central? I'll take a look into the 
processing logic once the update has been made. Thanks Markus.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-04 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073338#comment-16073338
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Ah, i see. The autotrader sitemap points to an index of sitemaps. Everything is 
fine except it does not pass if(sitemap.isIndex()). When printing its getType() 
i get null. So something is either wrong with the sitemapindex, crawler 
commons, or myself.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-04 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073288#comment-16073288
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Hello Lewis, I am positive i took the latest pieces. And checking out the GH 
page, that problem wasn't solved in the first place right? Or am i missing 
something? https://github.com/apache/nutch/pull/189#discussion_r113578491

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-03 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072959#comment-16072959
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

[~markus17] when attempting to process the following sitemap - 
http://www.autotrader.com/sitemap.xml, it appears the new processor is not able 
to process anything... although the crawldb data structures are produced, no 
entries are added... can you please rescope the patch and ensure it is the most 
up-to-date one you are working with? Thanks

{code}
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
records rejected by filters: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
sitemaps from HostDb: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
sitemaps from seed urls: 1
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
failed sitemap fetches: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
new sitemap entries added: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: 
Finished at 2017-07-03 15:32:09, elapsed: 00:00:19
{code}

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072953#comment-16072953
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc opened a new pull request #195: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/195
 
 
   Hi folks, this issue is a mirror of Markus' latest patch over on 
https://issues.apache.org/jira/browse/NUTCH-1465, this is merely for improved 
review.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070485#comment-16070485
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Hi Lewis!

It appears to be working fine now and bug-free due to not having the input 
overwrite existing CrawlDb entry interval and modified times because:
* that is messy in Nutch
* websites tend to set bad values, almost always, such as 100k large websites 
signaling to refetch everything daily

We have it deployed but not activated, that's the plan for early next week.

The patch is based on the mess in this thread's latest comments, and most 
recent scraps i found on Github. It should be the most recent contributions you 
guys added.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070468#comment-16070468
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

Fantastic [~markus17] is this working well for you? I am going to try this out. 
Out of curiosity, is this based off the the Github PR or the various patches 
which are associated with this issue? I am curious as I've seen quite a lot of 
variability in the implementations.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070314#comment-16070314
 ] 

Markus Jelsma commented on NUTCH-1465:
--

There is an oddity going on when a sitemap.xml entry is listed twice. It then 
assumes the db_status INJECTED and overwrites existing CrawlDatum completely.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016943#comment-16016943
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703
 
 
   @sebastian-nagel I've addressed all but two of your comments and responded. 
I've also implemented parameterized logging. In addition, I've dropped the 
STATUS_SITEMAP replacing instances with STATUS_INJECTED.
   N.B. when I run this as follows i am not currently able to inject any URLs 
into the CrawlDB
   ```
   //First I inject a random URL to create a CrawlDB
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
inject crawl urls/
   Injector: starting at 2017-05-18 23:01:14
   Injector: crawlDb: crawl
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: overwrite: false
   Injector: update: false
   Injector: Total urls rejected by filters: 0
   Injector: Total urls injected after normalization and filtering: 1
   Injector: Total urls injected but already in CrawlDb: 0
   Injector: Total new urls injected: 1
   Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01
   
   // I then, attempt to process a sitemap at 
http://www.autotrader.com/sitemap.xml which I've added to a file located in a 
'sitemaps' directory
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
sitemap crawl -sitemapUrls sitemaps
   SitemapProcessor: sitemap urls dir: sitemaps
   SitemapProcessor: Starting at 2017-05-18 23:06:38
   robots.txt whitelist not configured.
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 0
   SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10
   
   // Lets read the DB
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
readdb crawl -stats
   CrawlDb statistics start: crawl
   Statistics for CrawlDb: crawl
   TOTAL urls:  1
   shortest fetch interval: 30 days, 00:00:00
   avg fetch interval:  30 days, 00:00:00
   longest fetch interval:  30 days, 00:00:00
   earliest fetch time: Thu May 18 23:01:00 PDT 2017
   avg of fetch times:  Thu May 18 23:01:00 PDT 2017
   latest fetch time:   Thu May 18 23:01:00 PDT 2017
   retry 0: 1
   min score:   1.0
   avg score:   1.0
   max score:   1.0
   status 1 (db_unfetched): 1
   CrawlDb statistics: done
   ```
   As you can see no URLs seem to be processed as the new sitemap entries added 
is zero, this is confirmed by the readdb output.
   I need to do some more debugging and see where the bug(s) are. If anyone is 
able to try this patch out and has an interest in Sitemap support in Nutch 
master it would be highly appreciated.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016914#comment-16016914
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps 
in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r117406027
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
 
 Review comment:
   What is your suggestion here @sebastian-nagel ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis 

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986615#comment-15986615
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support 
sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113689552
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
+  public static final String LOCK_NAME = ".locked";
+  public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
+  public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
+  public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
+
+  private static class SitemapMapper extends Mapper {
+private ProtocolFactory protocolFactory = null;
+private boolean strict = true;
+private boolean filter = true;
+private boolean normalize = true;
+private URLFilters filters = null;
+private URLNormalizers normalizers = null;
+private CrawlDatum datum = 

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986614#comment-15986614
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support 
sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113689082
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
 
 Review comment:
   But in 
[ReadHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/ReadHostDb.java#L182)
 and 
[UpdateHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/UpdateHostDb.java#L107)
 still a String literal `"current"` is used.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> 

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986617#comment-15986617
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support 
sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113684593
 
 

 ##
 File path: conf/nutch-default.xml
 ##
 @@ -2529,7 +2529,33 @@ visit 
https://wiki.apache.org/nutch/SimilarityScoringFilter-->
   
   
 Default is 'fanout.key'
-The routingKey used by publisher to publish messages to specific queues. 
If the exchange type is "fanout", then this property is ignored.
+The routingKey used by publisher to publish messages to specific queues. 
+If the exchange type is "fanout", then this property is ignored.
+  
+
+
+
 
 Review comment:
   These 3 properties are used to transfer command-line options from Hadoop 
client to tasks. The values are always overwritten, it doesn't make sense to 
set them here or in nutch-site.xml.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986618#comment-15986618
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support 
sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113687939
 
 

 ##
 File path: src/java/org/apache/nutch/crawl/CrawlDatum.java
 ##
 @@ -90,6 +90,8 @@
   public static final byte STATUS_LINKED = 0x43;
   /** Page got metadata from a parser */
   public static final byte STATUS_PARSE_META = 0x44;
+  /** Page was discovered from sitemap */
+  public static final byte STATUS_SITEMAP = 0x45;
 
 Review comment:
   Do we really need a new status? STATUS_INJECTED could be also used: both are 
assigned in the mapper (SitemapMapper resp. InjectMapper) and replaced by 
STATUS_DB_UNFETCHED in the reducer (SitemapReducer/InjectReducer).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986616#comment-15986616
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support 
sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113693977
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
+  public static final String LOCK_NAME = ".locked";
+  public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
+  public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
+  public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
+
+  private static class SitemapMapper extends Mapper {
+private ProtocolFactory protocolFactory = null;
+private boolean strict = true;
+private boolean filter = true;
+private boolean normalize = true;
+private URLFilters filters = null;
+private URLNormalizers normalizers = null;
+private CrawlDatum datum = 

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985680#comment-15985680
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-297560764
 
 
   We could also improve with parameterized logging in due course. I wanted to 
post this patch as a mechanism for relighting the interest in Sitemap parsing 
with master branch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985678#comment-15985678
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps 
in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113578673
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
 
 Review comment:
   I also introduced this constant to mimic what is done in CrawlDb and LinkDb 
classes. This is means that represent the current HostDb... of course we don't 
have a HostDb class in the codebase right now so this constant has been 
introduced.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985676#comment-15985676
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps 
in Nutch
URL: https://github.com/apache/nutch/pull/189#discussion_r113578491
 
 

 ##
 File path: src/java/org/apache/nutch/util/SitemapProcessor.java
 ##
 @@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import java.util.Collection;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.hostdb.HostDatum;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+import crawlercommons.sitemaps.AbstractSiteMap;
+import crawlercommons.sitemaps.SiteMap;
+import crawlercommons.sitemaps.SiteMapIndex;
+import crawlercommons.sitemaps.SiteMapParser;
+import crawlercommons.sitemaps.SiteMapURL;
+
+/**
+ * Performs Sitemap processing by fetching sitemap links, parsing the 
content and merging
+ * the urls from Sitemap (with the metadata) with the existing crawldb.
+ *
+ * There are two use cases supported in Nutch's Sitemap processing:
+ * 
+ *  Sitemaps are considered as "remote seed lists". Crawl administrators 
can prepare a
+ * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
+ * crawl of specific hosts.
+ *  For open web crawl, it is not possible to track each host and get the 
sitemap links
+ * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
+ * crawls and inject the urls from sitemap to the crawldb.
+ * 
+ *
+ * For more details see:
+ *  https://wiki.apache.org/nutch/SitemapFeature 
+ */
+public class SitemapProcessor extends Configured implements Tool {
+  public static final Logger LOG = 
LoggerFactory.getLogger(SitemapProcessor.class);
+  public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd 
HH:mm:ss");
+
+  public static final String CURRENT_NAME = "current";
+  public static final String LOCK_NAME = ".locked";
+  public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing";
+  public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter";
+  public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize";
+
+  private static class SitemapMapper extends Mapper {
+private ProtocolFactory protocolFactory = null;
+private boolean strict = true;
+private boolean filter = true;
+private boolean normalize = true;
+private URLFilters filters = null;
+private URLNormalizers normalizers = null;
+private CrawlDatum datum = new 

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985671#comment-15985671
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc opened a new pull request #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189
 
 
   Hi Folks this issue addresses 
[NUTCH-1465](https://issues.apache.org/jira/browse/NUTCH-1465), I have an issue 
with some code which I will point out separately.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-21 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979161#comment-15979161
 ] 

Sebastian Nagel commented on NUTCH-1465:


Hi Lewis, a couple of month ago I've applied the latest patch here 
(NUTCH-1465-trunk.v5.patch) to master, see 
https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port 
this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), 
so I've chosen the 
[SitemapInjector|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]
 from an older patch which was still based on the old maped API.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-04-21 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979078#comment-15979078
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

I'm going to take this on. We want full sitemap support in our current crawlers 
so I am making this my priority. I'll submit a pull request for current patches 
then we can take it from there.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887588#comment-13887588
 ] 

Sebastian Nagel commented on NUTCH-1465:


filters and normalizers: -noFilter is not really an option if sitemaps are 
used and gzipped documents (eg. software packages) shall be excluded. In 
customized crawls URL filter rules are often complex, and I want to avoid to 
have to sets of rules in the end. Sitemaps are different from normal docs/URLs 
(robots.txt is also different): they are not stored in CrawlDb and may require 
other filter rules. What about an option -noFilterSitemap? 

Fetch intervals of 1 second or 1 hour may cause troubles:
 We are blindly accepting user's custom information in inject.
Yes, because the user (crawl administrator) can change the seed list (it's a 
file/directory on local disk or HDFS). Sitemaps are not necessarily under 
control of the user. If we (optionally) adjust fetch interval by (configurable) 
min/max limits that would help to get unreasonable values, and eg. re-fetch a 
bunch of pages every cycle.

SitemapReducer overwriting :
In a continuous crawl we know when pages are modified and have heuristics to 
estimate the change frequency of a page (AdaptiveFetchSchedule). The question 
is whether we trust those values which are achieved from crawling or prefer 
(possibly bogus) values from sitemaps. To use the sitemap values for new URLs 
found in sitemaps is less critical.

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
 provided in sitemap.
Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 
0).

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887763#comment-13887763
 ] 

Tejas Patil commented on NUTCH-1465:


Re filters and normalizers: +1.

Re fetch intervals and reducer overwriting: I have never encountered bogus 
sitemaps but that was for a intranet crawl and it would be better to take care 
of that in this jira. Here is what I conclude from the discussion till now:
(1)  _fetch interval_: For old entries, don't use the value from sitemap. For 
new ones, use the value from sitemap provided 
(db.fetch.schedule.adaptive.min_interval = interval = db.fetch.interval.max)
(2) _score_: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.
(3) _modified time_: Always use the value from sitemap provided its not a date 
in future.

Did I get it right ?
 
Re score: I missed that the jar is old. Would file a jira to upgrade CC to 
v0.3 in Nutch.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13888220#comment-13888220
 ] 

Sebastian Nagel commented on NUTCH-1465:


??(1) fetch interval: ...??
+1, sounds plausible.

??(2) score: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.??
That means use {{ScoringFilter.initialScore(...)}} for new ones?
Why not use the priority for newly found URLs? If the site owner takes it 
seriously the score can be useful. We could make it configurable, eg. by a 
factor {{sitemap.priority.factor}}. If it's 0.0 priority is not used. Usually, 
the factor should be low to avoid that the total score in the web graph (cf. 
[FixingOpicScoring|http://wiki.apache.org/nutch/FixingOpicScoring]) get's too 
high when injecting 50.000 URLs from sitemaps each with 1.0 priority. 
Alternatively, we could just put values from sitemap in CrawlDatum's meta data 
and delegate any actions to set the score to scoring filters or FetchSchedule 
implementations. Users then can more easily adapt any sitemap logic to their 
needs (cf. below).

??(3) modified time: Always use the value from sitemap provided its not a date 
in future.??
Um, seems that this way is conceptually wrong (and was also in SitemapInjector).
The modified time in CrawlDb must indicate the time of the last fetch or the 
modified time sent by the server when a page was fetched. If we overwrite the 
modified time, the server may just answer not-modified on a if-modified-since 
request and we'll never get the current version of a page. So we must not touch 
modified time, even for newly discovered pages, where it must be 0. If it's not 
zero, if-not-modified-since header field is sent although the page never has 
been fetched, cf. HttpResponse.java. 
If we can trust the sitemap the desired behaviour would be to set fetch time 
(in CrawlDb = time when next fetch should happen) to now (or sitemap modified 
time) if (and only if) sitemap.modif  crawldb.modif. This would make sure that 
changed pages are fetched asap. If the sitemap is not 100% trustworthy we 
should be more careful. 
Could we again delegate this decision (trustworthy or not) to scoring filter or 
FetchSchedule implementations? Whether we can trust a sitemap may depend on 
concrete crawler config/project and should be configurable. Would this require 
a new method in scoring/schedule interfaces?

More open questions since before!? Comments are welcome!

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886453#comment-13886453
 ] 

Sebastian Nagel commented on NUTCH-1465:


Thanks, [~tejasp] for the improvements! Testings continued...

Sitemaps are treated same as ordinary URLs/docs. But there are some 
differences. Shouldn't we relax default limits and filters and trust the 
restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix 
filter but still fetch gzipped sitemaps. That's not possible. Is it really 
necessary to normalize/filter sitemap URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small 
even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not 
increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout

Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg., 
one sub-sitemap could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining 
sub-sitemaps are not processed

Fetch intervals are taken unchecked from changefreq. Should we llimit them to 
reasonable values (db.fetch.schedule.adaptive.min_interval = interval = 
db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause 
troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] 
explicitely says that changefreq is considered a hint and not a command.


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886489#comment-13886489
 ] 

Sebastian Nagel commented on NUTCH-1465:


SitemapReducer overwrites score, modified time, and fetch interval of existing 
CrawlDb entries with the values from sitemap. Is this the desired behavior? 
What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in 
the future)?
If a sitemap does not specify one of score, modified time, or fetch interval 
this values is set to zero. In this case, we should definitely not overwrite 
existing values. Newly added entries should get assigned 
db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by 
[[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may 
depend on scoring plugins. Comments?

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886677#comment-13886677
 ] 

Tejas Patil commented on NUTCH-1465:


Interesting comments [~wastl-nagel].

Re filters and normalizers : By default I have kept those ON but can be 
disabled by using -noFilter and -noNormalize.
Re default content limits and fetch timeout: +1. Agree with you.
Re Processing sitemap indexes fails : +1. Nice catch.
Re Fetch intervals of 1 second or 1 hour may cause troubles : Currently, 
Injector allows users to provide a custom fetch interval with any value eg. 1 
sec. It makes sense not the correct it as user wants Nutch use that custom 
fetch interval. If we view sitemaps as custom seed list given by a content 
owner, then it would make sense to follow the intervals. But as you said that 
sitemaps can be wrongly set or outdated, the intervals might be incorrect. The 
question bolis down to: We are blindly accepting user's custom information in 
inject. Should we blindly assume that sitemaps are correct or not ? I have no 
strong opinion about either side of the argument. 

(PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 
1 hr as per db.fetch.schedule.adaptive.min_interval = interval)

Re SitemapReducer overwriting : 
 _If a sitemap does not specify one of score, modified time, or fetch 
 interval this values is set to zero. _
Nope. See 
[SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java]

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
provided in sitemap. 
We can do this: If an old entry has score other than 0.5, it can be preserved 
else update. For new entry, use scoring plugins for score equal to 0.5, else 
preserve the same. 
Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap 
or the default one if changefreq was absent.
 (b) fetch interval : Crawler commons does NOT set fetch interval if there was 
none provided in sitemap. So we are sure that whatever value is used is coming 
from changefreq. Validation might be needed as per comments above.
 (c) modified time : Same as fetch interval, unless parsed from sitemap file, 
modified time is set to NULL. Only possible validation is to drop values 
greater than current time.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
 NUTCH-1465-trunk.v5.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-27 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882968#comment-13882968
 ] 

Sebastian Nagel commented on NUTCH-1465:


Great, looks good and is a really compact providing a lot of functionality. 
I've just started to test SitemapProcessor, here my first comments:
* SitemapProcessor.java has no Apache license header
* would be nice to see counters in log output
* regarding Lewis' point #3: doesn't a comment a hacky way mean: try to 
avoid that? Why not set isHost inside map(...) by {{isHost = (value instanceof 
HostDatum)}} and pass it as parameter to filterNormalize()? This would avoid 
any errors due to incomplete heuristics, here when testing with sitemaps 
accessed per file protocol:
{code}
INFO  api.HttpRobotRulesParser - Couldn't get robots.txt for 
http://file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file
{code}
* concurrency: returning the value of isHost from filterNormalize() to map() 
per member variable is not thread-safe and will cause problems in combination 
with MultithreadedMapper. One argument more to pass it from map() to 
filterNormalize() per parameter.


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883204#comment-13883204
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Thanks a lot for your comments. First two were straight forward and I agree 
with those.

Re hacky way : For hosts from the HostDb, we don't know which protocol they 
below to. In the code I was checking if http:// is a match and if that was a 
bad guess then try with https://. I didn't handle for ftp:// and file:/ 
schemes. By hacky I meant this approach of trial-and-error till a suitable 
match is formed and we create a homepage url for the host. I have thought of 
your comment and would have a better (yet hacky) way in the coming patch.

Re concurrency: I had thought of this and had searched over internet for 
internals of MultithreadedMapper. All I could get is that it has an internal 
thread pool and each input record to handed over to a thread in this pool to 
run map() over it. I wrote this code to check if thread safety was ensured in 
MultithreadedMapper:

{noformat}
  private static class SitemapMapper extends MapperText, Writable, Text, 
CrawlDatum {
private String myurl = null;

public void map(Text key, Writable value, Context context) throws 
IOException, InterruptedException {
  if (value instanceof Text) {
String url = key.toString();
if(foo(url).compareTo(url) != 0) {
  LOG.warn(Race condition found !!!);
}
  }
}

private String foo(String url) {
  myurl = url;
  if(Thread.currentThread().getId() % 2 == 1) {
try {
  Thread.sleep(1);
} catch(InterruptedException e) {
  LOG.warn(e.getMessage());
}
  }
  return myurl;
}
{noformat}

I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never 
hit the race condition in the code. Is the code snippet above a good way to 
reveal any race condition in the code ? Its won't be a formal conclusion and 
more of an experimental conclusion. How do I get a concrete conclusion whether 
MultithreadedMapper is thread safe or not ?

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-27 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883523#comment-13883523
 ] 

Sebastian Nagel commented on NUTCH-1465:


Sorry, you're right: the comment hacky way applies to trying http and https 
to check which host-URL would pass the filters. That's ok, there is no better 
solution for that.
But what about the decision whether a string passed to filterNormalize() is a 
host from HostDb or a URL from a list of sitemaps? This decision could be made 
without any heuristics: inside map() we know the type (host or sitemap Url) 
from the class of the value:
{code}
boolean isHost = (value instanceof HostDatum);
String url = filterNormalize(key.toString(), isHost);
{code}
The method filterNormalize() could be then simplified and the member variable 
isHost would be obsolete.
Regarding concurrency: the javadoc of 
[[MultithreadedMapper.java|http://hadoop.apache.org/docs/stable/api/src-html/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html]]
 states that Mapper implementations using this MapRunnable must be 
thread-safe. In doubt, it may be better to follow this advice and not to look 
at the (current) implementation. If SitemapParser is thread-safe (at a first 
glance, it is) it should be easy to get SitemapMapper safe.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879955#comment-13879955
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used 
format something similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
 * pPerforms Sitemap processing by fetching sitemap links, parsing the 
content and merging
 * the urls from Sitemap (with the metadata) with the existing crawldb./p
 *
 * pThere are two use cases supported in Nutch's Sitemap processing:/p
 * ol
 *  liSitemaps are considered as remote seed lists. Crawl administrators 
can prepare a
 * list of sitemap links and get only those sitemap pages. This suits well 
for targeted
 * crawl of specific hosts./li
 *  liFor open web crawl, it is not possible to track each host and get the 
sitemap links
 * manually. Nutch would automatically get the sitemaps for all the hosts 
seen in the
 * crawls and inject the urls from sitemap to the crawldb./li
 * /ol
 * pFor more details see:
 *  https://wiki.apache.org/nutch/SitemapFeature /o
 */
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+  LOG.info(Exception for url  + key.toString() +  :  + 
StringUtils.stringifyException(e)); 
{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String 
u), could we not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
  if(!u.startsWith(http://;)  !u.startsWith(https://;)) {
// We received a hostname here so let's make a URL
url = http://; + u + /;
isHost = true;
  }
{code}

Thats about it from me mate. This looks like an excellent addition to Nutch 
again. I made a trvial update to the wiki page to drop in some links and 
background to your work on this one.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880295#comment-13880295
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~lewismc],
+1 for the first two suggestions. For #3: I skimmed through the methods inside 
URLUtil.java and nothing came to my notice that I could use in the Sitemap code 
you pointed. Can you please confirm ?

A big thanks mate for trying out the feature. Hopefully we get this into 1.8 
release.
Cheers !!


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880305#comment-13880305
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

hey [~tejasp] no probs. RE: #3, I was just curious to see if we could reuse 
some of the method we had in URLUtil. Now that I've looked I feel you're right. 
This patch reminds me of pushing out to filtering and normalization to crawler 
commons anyway but that is another can of worms :)
I'll let others comments here. Right now I am +1 on this patch. 

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.8

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
 NUTCH-1465-trunk.v3.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862237#comment-13862237
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Yes. I think that it should be there too. I will be working on the patch this 
weekend and update on the same. Thanks for your inputs and suggestions till now 
in, were super helpful in chalking out the right specs for this feature.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-16 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848998#comment-13848998
 ] 

Sebastian Nagel commented on NUTCH-1465:


Let's add use case C:
*(C) inject URLs from given sitemap(s)*
i. user configures list of known and trusted sitemaps
ii. URLs are extracted from sitemaps and injected into CrawlDb
Use case: small/medium size customized crawls

Is C a common use case, worth to be integrated?

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848561#comment-13848561
 ] 

Tejas Patil commented on NUTCH-1465:


Revisited this Jira after a long time and gave a thought how this can be done 
cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This 
would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 
'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time 
limits. Fetch code must have a special case to take into account that the url 
is a sitemap url and use custom limits = leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like 
lastmod, changefreq). Modify it to hold this information too. This would be 
specific for sitemaps only yet we end up making all outlinks to hold this info. 
We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the 
crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got 
all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
  - get the robots page, extract sitemap urls, 
  - get xml content of these sitemap pages
  - create crawl datums with the requried info and write this to a 
sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than 
(A) wherein sitemap urls are sitting in the crawldb and get fetched along with 
normal pages (thus, eating up fetch time of every fetch phase). We can have a 
sitemap_fequency used insdie the crawl script so that users say that after 'x' 
nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. 
Running sitemap job 1-5 times in a month on a production level crawl would work 
out well.

I am inclined towards implementing (B)

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

 do we really need an extra DB?
I should have been clear with the explanation. sitemapDB is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

 NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-07-26 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721045#comment-13721045
 ] 

Brian commented on NUTCH-1465:
--

Is a separate issue needed for support in 2.X? 

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564274#comment-13564274
 ] 

Sebastian Nagel commented on NUTCH-1465:


Hi Tejas,
thanks and a few comments on the patch:

“??for a given host, sitemaps are processed just once??” But they are not 
cached over cycles because the cache is bound to the protocol object. Is this 
correct? So a sitemap is fetched and processed every cycle for every host? If 
yes and sitemaps are large (they can!) this would cause a lot of extra traffic.

Shouldn't sitemap URLs handled the same way as any other URL: add them to 
CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at 
CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I].
 There are some complications:
- due to their size, sitemaps may require larger values regarding size and time 
limits
- sitemaps may require more frequent re-fetching (eg. by 
MimeAdaptiveFetchSchedule)
- the current Outlink class cannot hold extra information contained in sitemaps 
(lastmod, changefreq, etc.)

There is another way which we use it for several customers: A SitemapInjector 
fetches the sitemaps, extracts URLs and injects them with all extra 
information. It's a simple use case for a customized site-search: there is a 
sitemap and it shall be used as seed list or even exclusive list of documents 
to be crawled. Is there any interest in this solution? It's not a general 
solution and not adaptable to a large web crawl. 


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564768#comment-13564768
 ] 

Sebastian Nagel commented on NUTCH-1465:




Yes, SitemapInjector is a map-reduce job. The scenario for its use is the 
following:
- a small set of sites to be crawled (eg, to feed a site-search index)
- you can think of sitemaps as remote seed lists. Because many content 
management systems can generate sitemaps it is convenient for the site owners 
to publish seeds. The URLs contained in the sitemap can be also the complete 
and exclusive set of URLs to be crawled (you can use the plugin scoring-depth 
to limit the crawl to seed URLs).
- because you can trust in the sitemap's content
-* checks for cross submissions are not necessary
-* extra information (lastmod, changefreq, priority) can be used
That's we use sitemaps: remote seed lists, maintained by customers, quite 
convenient if you run a crawler as a service.

For large web crawls there is also another aspect: detection of sitemaps which 
is bound to processing of robots.txt. Processing of sitemaps can (and should?) 
be done the usual Nutch way:
- detection is done in the protocol plugin (see Tejas' patch)
- record in CrawlDb: done by Fetcher (cross submission information can be added)
- fetch (if not yet done), parse (a plugin parse-sitemap based on 
crawler-commons?) and extract outlinks: sitemaps may require special treatment 
here because they can be large in size and usually contain many outlinks. Also 
the Outlink class needs to be extended to deal with the extra info relevant for 
scheduling
To use an extra tool (as the SitemapInjector) for processing the sitemaps has 
the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On 
the contrary, special treatment can easily be realized in a separate map-reduce 
job.

Comments?!

Thanks, Tejas: the feature is moving forward thanks to your initiative!

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564836#comment-13564836
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Thanks all for your interesting comments.

It's a complicated issue. One one hand host data should be stored in NUTCH-1325 
but that would require additional logic and sending each segment output to the 
hostdb, in case there's a sitemap crawled. On the other hand it's ideal to 
store host data. It's also easy to use in jobs such as the indexer and 
generator.

I don't yet favour a specific approach but storing sitemap data in a hostdb may 
be something to think about.

Cheers

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564883#comment-13564883
 ] 

Tejas Patil commented on NUTCH-1465:


Hi Sebastian,

So we are looking at 2 things here:
- a standalone utility for injecting sitemaps to crawldb: 
-# User starts off with urls to sitemap pages
-# SitemapInjector fetches these seeds, parses it (with a parse plugin based on 
CC)
-# SitemapInjector updates the crawldb with the sitemap entries.

- handling of sitemap within the nutch cycle: fetch, parse and update phases
-# Robots parsing will populate a table of host: _list of links to sitemap 
pages_
-# These will be added to the fetcher queue and will be fetched
-# A parser plugin based on CC will parse the sitemap page
-# Outlink class needs to be extended to store the meta obtained from sitemap
-# Write this into the segment
-# Update phase needs to update the crawl frequency of already existing urls in 
crawldb based on what we got from the sitemap. Else just add new entires to the 
crawldb.

I am not clear about the extending outlink thing. The normal outlink extraction 
need not be done as CC will already do that for us. Sitemap parser plugin must 
do this and create objects of our specialized sitemap link. While writing, 
where is CrawlDatum generated from the outlink ?

The mime type that we get is text/xml which can also mean a normal xml file. 
How will nutch identify if its a sitemap page and invoke the correct parser 
plugin ? (I know that this magic is done by feed parser but not sure which part 
of code is doing that. Just point me to that code).


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564019#comment-13564019
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Tejas - I thought the current CC robots parsing code was already extracting 
the sitemap links. Or is the above comment (modified the robots parsing code 
to extract the links to sitemap pages) a change to the current Nutch robots 
parsing code?

I do remember thinking that the CC version would need to change to support 
multiple Sitemap links, even though it wasn't clear whether that was actually 
valid.

-- Ken

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564040#comment-13564040
 ] 

Tejas Patil commented on NUTCH-1465:


Hi Ken, 
As the CC robots integration jira is not closed, I did this change is on the 
current trunk. 

I did not understand this (CC version would need to change to support multiple 
Sitemap links). Do you mean that CC aint allowing multiple sitemap links in a 
robots file (like 
[this|http://stackoverflow.com/questions/2594179/multiple-sitemap-entries-in-robots-txt])
 or sitemap index file ?

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564057#comment-13564057
 ] 

Ken Krugler commented on NUTCH-1465:


Hi Tejas - the original code didn't, but I checked and now remember that I 
added support for multiple sitemap URLs to BaseRobotRules in CC.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira