[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394940#comment-16394940 ] Sebastian Nagel commented on NUTCH-1465: The feature is already ported to 2.x, see NUTCH-1741, but using a different approach. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393274#comment-16393274 ] Ben Vachon commented on NUTCH-1465: --- Is there any plan to pull this to 2.x? > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209562#comment-16209562 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-337636717 Hi @marconett, please ask for help on the [Nutch user mailing list](http://nutch.apache.org/mailing_lists.html) or report the problem at https://issues.apache.org/jira/projects/NUTCH. That's a closed pull request, and nothing will be fixed here. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209547#comment-16209547 ] ASF GitHub Bot commented on NUTCH-1465: --- marconett commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-337633586 I'm running into the same problem and am unable to inject sitemap content into the db. here's the commands i used (not including output, it's the same as above): ``` bin/nutch inject crawl/crawldb urls/ bin/nutch sitemap crawl/crawldb -sitemapUrls sitemaps/ -noStrict -noFilter -noNormalize bin/nutch readdb crawl/crawldb -stats ``` where `urls/seed.txt` contains "https://www.linux.com/; and `sitemaps/seed.txt` contains "https://www.linux.com/sitemap.xml;. I see (tcpdump) that there are https connections being established to linux.com while `bin/nutch sitemap` is running. But nothing gets injected into the crawldb. Is there any info on this? Should this be fixed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127556#comment-16127556 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel closed pull request #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127555#comment-16127555 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on issue #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202#issuecomment-322529338 Yes, of course. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127483#comment-16127483 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on issue #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202#issuecomment-322518126 @sebastian-nagel ping This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101785#comment-16101785 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc closed pull request #195: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/195 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101783#comment-16101783 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on issue #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202#issuecomment-318086684 @sebastian-nagel, Markus' patch made it into master branch... is this correct? If so then we can close this issue. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101784#comment-16101784 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc closed pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092964#comment-16092964 ] Hudson commented on NUTCH-1465: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3435 (See [https://builds.apache.org/job/Nutch-trunk/3435/]) NUTCH-1465 (markus: [https://github.com/apache/nutch/commit/b58d6cd9111b2d25b8f6f009015ac214bac4006d]) * (edit) conf/log4j.properties * (add) src/java/org/apache/nutch/util/SitemapProcessor.java * (edit) ivy/ivy.xml * (edit) conf/nutch-default.xml * (edit) src/bin/nutch > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092945#comment-16092945 ] Markus Jelsma commented on NUTCH-1465: -- Crap! I was probably looking without seeing! Got it! > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092938#comment-16092938 ] Sebastian Nagel commented on NUTCH-1465: I've modified the description of the properties [sitemap.strict.parsing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2555] and [sitemap.url.overwrite.existing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2589]. But feel free add your modifications/additions. I just tried to make it understandable by anyone who does not know the gory details. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092909#comment-16092909 ] Markus Jelsma commented on NUTCH-1465: -- Sebastian, your patch has CrawlDatum and IndexingFilterChecker in the patch as well, just for the newline at the tail. No problem, but i do miss your updated descripton of the properties. Cannot find them in https://github.com/apache/nutch/pull/202.patch > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090460#comment-16090460 ] Markus Jelsma commented on NUTCH-1465: -- Thanks! Will grab 202.patch and see if it fits tomorrow! > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090044#comment-16090044 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel opened a new pull request #202: NUTCH-1465 Support for sitemaps URL: https://github.com/apache/nutch/pull/202 (applied Markus' patch as of 2017-07-05) - add SitemapProcessor - upgrade dependency crawler-commons to 0.8 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090038#comment-16090038 ] Sebastian Nagel commented on NUTCH-1465: Thanks, [~markus.jel...@openindex.io]! Tested on a small set of sitemaps. Looks good to me, I've only improved the description of properties and did some code clean-up (patch / pull-request to follow). Please, go ahead and commit it! We can later improve it to make it more robust or to make sophisticated use of last modified time and priorities provided in sitemaps. Thanks! > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078124#comment-16078124 ] Markus Jelsma commented on NUTCH-1465: -- I think this is committable, anyone to disagree? If not, i'll get this in early next week. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074554#comment-16074554 ] Markus Jelsma commented on NUTCH-1465: -- Hi Lewis, 0.8 doesn't deal with this sitemap at autotrader too. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074046#comment-16074046 ] Lewis John McGibbney commented on NUTCH-1465: - [~markus17] can we also update the version of crawler commons to 0.8 which is the latest version available in Maven Central? I'll take a look into the processing logic once the update has been made. Thanks Markus. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073338#comment-16073338 ] Markus Jelsma commented on NUTCH-1465: -- Ah, i see. The autotrader sitemap points to an index of sitemaps. Everything is fine except it does not pass if(sitemap.isIndex()). When printing its getType() i get null. So something is either wrong with the sitemapindex, crawler commons, or myself. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073288#comment-16073288 ] Markus Jelsma commented on NUTCH-1465: -- Hello Lewis, I am positive i took the latest pieces. And checking out the GH page, that problem wasn't solved in the first place right? Or am i missing something? https://github.com/apache/nutch/pull/189#discussion_r113578491 > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072959#comment-16072959 ] Lewis John McGibbney commented on NUTCH-1465: - [~markus17] when attempting to process the following sitemap - http://www.autotrader.com/sitemap.xml, it appears the new processor is not able to process anything... although the crawldb data structures are produced, no entries are added... can you please rescope the patch and ensure it is the most up-to-date one you are working with? Thanks {code} 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total records rejected by filters: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total sitemaps from HostDb: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total sitemaps from seed urls: 1 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total failed sitemap fetches: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Total new sitemap entries added: 0 2017-07-03 15:32:09,213 INFO util.SitemapProcessor - SitemapProcessor: Finished at 2017-07-03 15:32:09, elapsed: 00:00:19 {code} > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072953#comment-16072953 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc opened a new pull request #195: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/195 Hi folks, this issue is a mirror of Markus' latest patch over on https://issues.apache.org/jira/browse/NUTCH-1465, this is merely for improved review. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070485#comment-16070485 ] Markus Jelsma commented on NUTCH-1465: -- Hi Lewis! It appears to be working fine now and bug-free due to not having the input overwrite existing CrawlDb entry interval and modified times because: * that is messy in Nutch * websites tend to set bad values, almost always, such as 100k large websites signaling to refetch everything daily We have it deployed but not activated, that's the plan for early next week. The patch is based on the mess in this thread's latest comments, and most recent scraps i found on Github. It should be the most recent contributions you guys added. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070468#comment-16070468 ] Lewis John McGibbney commented on NUTCH-1465: - Fantastic [~markus17] is this working well for you? I am going to try this out. Out of curiosity, is this based off the the Github PR or the various patches which are associated with this issue? I am curious as I've seen quite a lot of variability in the implementations. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070314#comment-16070314 ] Markus Jelsma commented on NUTCH-1465: -- There is an oddity going on when a sitemap.xml entry is listed twice. It then assumes the db_status INJECTED and overwrites existing CrawlDatum completely. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016943#comment-16016943 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703 @sebastian-nagel I've addressed all but two of your comments and responded. I've also implemented parameterized logging. In addition, I've dropped the STATUS_SITEMAP replacing instances with STATUS_INJECTED. N.B. when I run this as follows i am not currently able to inject any URLs into the CrawlDB ``` //First I inject a random URL to create a CrawlDB lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch inject crawl urls/ Injector: starting at 2017-05-18 23:01:14 Injector: crawlDb: crawl Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01 // I then, attempt to process a sitemap at http://www.autotrader.com/sitemap.xml which I've added to a file located in a 'sitemaps' directory lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch sitemap crawl -sitemapUrls sitemaps SitemapProcessor: sitemap urls dir: sitemaps SitemapProcessor: Starting at 2017-05-18 23:06:38 robots.txt whitelist not configured. SitemapProcessor: Total records rejected by filters: 0 SitemapProcessor: Total sitemaps from HostDb: 0 SitemapProcessor: Total sitemaps from seed urls: 1 SitemapProcessor: Total failed sitemap fetches: 0 SitemapProcessor: Total new sitemap entries added: 0 SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10 // Lets read the DB lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch readdb crawl -stats CrawlDb statistics start: crawl Statistics for CrawlDb: crawl TOTAL urls: 1 shortest fetch interval: 30 days, 00:00:00 avg fetch interval: 30 days, 00:00:00 longest fetch interval: 30 days, 00:00:00 earliest fetch time: Thu May 18 23:01:00 PDT 2017 avg of fetch times: Thu May 18 23:01:00 PDT 2017 latest fetch time: Thu May 18 23:01:00 PDT 2017 retry 0: 1 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 1 CrawlDb statistics: done ``` As you can see no URLs seem to be processed as the new sitemap entries added is zero, this is confirmed by the readdb output. I need to do some more debugging and see where the bug(s) are. If anyone is able to try this patch out and has an interest in Sitemap support in Nutch master it would be highly appreciated. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016914#comment-16016914 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r117406027 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: What is your suggestion here @sebastian-nagel ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986615#comment-15986615 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113689552 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper{ +private ProtocolFactory protocolFactory = null; +private boolean strict = true; +private boolean filter = true; +private boolean normalize = true; +private URLFilters filters = null; +private URLNormalizers normalizers = null; +private CrawlDatum datum =
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986614#comment-15986614 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113689082 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: But in [ReadHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/ReadHostDb.java#L182) and [UpdateHostDb](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/hostdb/UpdateHostDb.java#L107) still a String literal `"current"` is used. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch >
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986617#comment-15986617 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113684593 ## File path: conf/nutch-default.xml ## @@ -2529,7 +2529,33 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter--> Default is 'fanout.key' -The routingKey used by publisher to publish messages to specific queues. If the exchange type is "fanout", then this property is ignored. +The routingKey used by publisher to publish messages to specific queues. +If the exchange type is "fanout", then this property is ignored. + + + + Review comment: These 3 properties are used to transfer command-line options from Hadoop client to tasks. The values are always overwritten, it doesn't make sense to set them here or in nutch-site.xml. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986618#comment-15986618 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113687939 ## File path: src/java/org/apache/nutch/crawl/CrawlDatum.java ## @@ -90,6 +90,8 @@ public static final byte STATUS_LINKED = 0x43; /** Page got metadata from a parser */ public static final byte STATUS_PARSE_META = 0x44; + /** Page was discovered from sitemap */ + public static final byte STATUS_SITEMAP = 0x45; Review comment: Do we really need a new status? STATUS_INJECTED could be also used: both are assigned in the mapper (SitemapMapper resp. InjectMapper) and replaced by STATUS_DB_UNFETCHED in the reducer (SitemapReducer/InjectReducer). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986616#comment-15986616 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113693977 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper{ +private ProtocolFactory protocolFactory = null; +private boolean strict = true; +private boolean filter = true; +private boolean normalize = true; +private URLFilters filters = null; +private URLNormalizers normalizers = null; +private CrawlDatum datum =
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985680#comment-15985680 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-297560764 We could also improve with parameterized logging in due course. I wanted to post this patch as a mechanism for relighting the interest in Sitemap parsing with master branch. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985678#comment-15985678 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113578673 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; Review comment: I also introduced this constant to mimic what is done in CrawlDb and LinkDb classes. This is means that represent the current HostDb... of course we don't have a HostDb class in the codebase right now so this constant has been introduced. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985676#comment-15985676 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc commented on a change in pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#discussion_r113578491 ## File path: src/java/org/apache/nutch/util/SitemapProcessor.java ## @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import java.util.Collection; +import java.util.LinkedList; +import java.util.List; +import java.util.Random; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; +import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; + +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.hostdb.HostDatum; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.Protocol; +import org.apache.nutch.protocol.ProtocolFactory; +import org.apache.nutch.protocol.ProtocolOutput; +import org.apache.nutch.protocol.ProtocolStatus; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import crawlercommons.robots.BaseRobotRules; +import crawlercommons.sitemaps.AbstractSiteMap; +import crawlercommons.sitemaps.SiteMap; +import crawlercommons.sitemaps.SiteMapIndex; +import crawlercommons.sitemaps.SiteMapParser; +import crawlercommons.sitemaps.SiteMapURL; + +/** + * Performs Sitemap processing by fetching sitemap links, parsing the content and merging + * the urls from Sitemap (with the metadata) with the existing crawldb. + * + * There are two use cases supported in Nutch's Sitemap processing: + * + * Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a + * list of sitemap links and get only those sitemap pages. This suits well for targeted + * crawl of specific hosts. + * For open web crawl, it is not possible to track each host and get the sitemap links + * manually. Nutch would automatically get the sitemaps for all the hosts seen in the + * crawls and inject the urls from sitemap to the crawldb. + * + * + * For more details see: + * https://wiki.apache.org/nutch/SitemapFeature + */ +public class SitemapProcessor extends Configured implements Tool { + public static final Logger LOG = LoggerFactory.getLogger(SitemapProcessor.class); + public static final SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); + + public static final String CURRENT_NAME = "current"; + public static final String LOCK_NAME = ".locked"; + public static final String SITEMAP_STRICT_PARSING = "sitemap.strict.parsing"; + public static final String SITEMAP_URL_FILTERING = "sitemap.url.filter"; + public static final String SITEMAP_URL_NORMALIZING = "sitemap.url.normalize"; + + private static class SitemapMapper extends Mapper{ +private ProtocolFactory protocolFactory = null; +private boolean strict = true; +private boolean filter = true; +private boolean normalize = true; +private URLFilters filters = null; +private URLNormalizers normalizers = null; +private CrawlDatum datum = new
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985671#comment-15985671 ] ASF GitHub Bot commented on NUTCH-1465: --- lewismc opened a new pull request #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189 Hi Folks this issue addresses [NUTCH-1465](https://issues.apache.org/jira/browse/NUTCH-1465), I have an issue with some code which I will point out separately. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979161#comment-15979161 ] Sebastian Nagel commented on NUTCH-1465: Hi Lewis, a couple of month ago I've applied the latest patch here (NUTCH-1465-trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), so I've chosen the [SitemapInjector|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java] from an older patch which was still based on the old maped API. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979078#comment-15979078 ] Lewis John McGibbney commented on NUTCH-1465: - I'm going to take this on. We want full sitemap support in our current crawlers so I am making this my priority. I'll submit a pull request for current patches then we can take it from there. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887588#comment-13887588 ] Sebastian Nagel commented on NUTCH-1465: filters and normalizers: -noFilter is not really an option if sitemaps are used and gzipped documents (eg. software packages) shall be excluded. In customized crawls URL filter rules are often complex, and I want to avoid to have to sets of rules in the end. Sitemaps are different from normal docs/URLs (robots.txt is also different): they are not stored in CrawlDb and may require other filter rules. What about an option -noFilterSitemap? Fetch intervals of 1 second or 1 hour may cause troubles: We are blindly accepting user's custom information in inject. Yes, because the user (crawl administrator) can change the seed list (it's a file/directory on local disk or HDFS). Sitemaps are not necessarily under control of the user. If we (optionally) adjust fetch interval by (configurable) min/max limits that would help to get unreasonable values, and eg. re-fetch a bunch of pages every cycle. SitemapReducer overwriting : In a continuous crawl we know when pages are modified and have heuristics to estimate the change frequency of a page (AdaptiveFetchSchedule). The question is whether we trust those values which are achieved from crawling or prefer (possibly bogus) values from sitemaps. To use the sitemap values for new URLs found in sitemaps is less critical. (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 0). Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887763#comment-13887763 ] Tejas Patil commented on NUTCH-1465: Re filters and normalizers: +1. Re fetch intervals and reducer overwriting: I have never encountered bogus sitemaps but that was for a intranet crawl and it would be better to take care of that in this jira. Here is what I conclude from the discussion till now: (1) _fetch interval_: For old entries, don't use the value from sitemap. For new ones, use the value from sitemap provided (db.fetch.schedule.adaptive.min_interval = interval = db.fetch.interval.max) (2) _score_: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is. (3) _modified time_: Always use the value from sitemap provided its not a date in future. Did I get it right ? Re score: I missed that the jar is old. Would file a jira to upgrade CC to v0.3 in Nutch. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13888220#comment-13888220 ] Sebastian Nagel commented on NUTCH-1465: ??(1) fetch interval: ...?? +1, sounds plausible. ??(2) score: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is.?? That means use {{ScoringFilter.initialScore(...)}} for new ones? Why not use the priority for newly found URLs? If the site owner takes it seriously the score can be useful. We could make it configurable, eg. by a factor {{sitemap.priority.factor}}. If it's 0.0 priority is not used. Usually, the factor should be low to avoid that the total score in the web graph (cf. [FixingOpicScoring|http://wiki.apache.org/nutch/FixingOpicScoring]) get's too high when injecting 50.000 URLs from sitemaps each with 1.0 priority. Alternatively, we could just put values from sitemap in CrawlDatum's meta data and delegate any actions to set the score to scoring filters or FetchSchedule implementations. Users then can more easily adapt any sitemap logic to their needs (cf. below). ??(3) modified time: Always use the value from sitemap provided its not a date in future.?? Um, seems that this way is conceptually wrong (and was also in SitemapInjector). The modified time in CrawlDb must indicate the time of the last fetch or the modified time sent by the server when a page was fetched. If we overwrite the modified time, the server may just answer not-modified on a if-modified-since request and we'll never get the current version of a page. So we must not touch modified time, even for newly discovered pages, where it must be 0. If it's not zero, if-not-modified-since header field is sent although the page never has been fetched, cf. HttpResponse.java. If we can trust the sitemap the desired behaviour would be to set fetch time (in CrawlDb = time when next fetch should happen) to now (or sitemap modified time) if (and only if) sitemap.modif crawldb.modif. This would make sure that changed pages are fetched asap. If the sitemap is not 100% trustworthy we should be more careful. Could we again delegate this decision (trustworthy or not) to scoring filter or FetchSchedule implementations? Whether we can trust a sitemap may depend on concrete crawler config/project and should be configurable. Would this require a new method in scoring/schedule interfaces? More open questions since before!? Comments are welcome! Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886453#comment-13886453 ] Sebastian Nagel commented on NUTCH-1465: Thanks, [~tejasp] for the improvements! Testings continued... Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't we relax default limits and filters and trust the restrictions specified in sitemap protocol? * URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap URLs? If yes, this should be optional. * default content limits {http,ftp,file}.content.limit (64 kB) are quite small even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED? * maybe we want also increase the fetch timeout Processing siitemap indexes fails: * the check sitemap.isIndex() skips all referenced sitemaps * protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap could be https while others are http) * if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed Fetch intervals are taken unchecked from changefreq. Should we llimit them to reasonable values (db.fetch.schedule.adaptive.min_interval = interval = db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] explicitely says that changefreq is considered a hint and not a command. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886489#comment-13886489 ] Sebastian Nagel commented on NUTCH-1465: SitemapReducer overwrites score, modified time, and fetch interval of existing CrawlDb entries with the values from sitemap. Is this the desired behavior? What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in the future)? If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. In this case, we should definitely not overwrite existing values. Newly added entries should get assigned db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by [[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may depend on scoring plugins. Comments? Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886677#comment-13886677 ] Tejas Patil commented on NUTCH-1465: Interesting comments [~wastl-nagel]. Re filters and normalizers : By default I have kept those ON but can be disabled by using -noFilter and -noNormalize. Re default content limits and fetch timeout: +1. Agree with you. Re Processing sitemap indexes fails : +1. Nice catch. Re Fetch intervals of 1 second or 1 hour may cause troubles : Currently, Injector allows users to provide a custom fetch interval with any value eg. 1 sec. It makes sense not the correct it as user wants Nutch use that custom fetch interval. If we view sitemaps as custom seed list given by a content owner, then it would make sense to follow the intervals. But as you said that sitemaps can be wrongly set or outdated, the intervals might be incorrect. The question bolis down to: We are blindly accepting user's custom information in inject. Should we blindly assume that sitemaps are correct or not ? I have no strong opinion about either side of the argument. (PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 1 hr as per db.fetch.schedule.adaptive.min_interval = interval) Re SitemapReducer overwriting : _If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. _ Nope. See [SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java] (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. We can do this: If an old entry has score other than 0.5, it can be preserved else update. For new entry, use scoring plugins for score equal to 0.5, else preserve the same. Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap or the default one if changefreq was absent. (b) fetch interval : Crawler commons does NOT set fetch interval if there was none provided in sitemap. So we are sure that whatever value is used is coming from changefreq. Validation might be needed as per comments above. (c) modified time : Same as fetch interval, unless parsed from sitemap file, modified time is set to NULL. Only possible validation is to drop values greater than current time. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882968#comment-13882968 ] Sebastian Nagel commented on NUTCH-1465: Great, looks good and is a really compact providing a lot of functionality. I've just started to test SitemapProcessor, here my first comments: * SitemapProcessor.java has no Apache license header * would be nice to see counters in log output * regarding Lewis' point #3: doesn't a comment a hacky way mean: try to avoid that? Why not set isHost inside map(...) by {{isHost = (value instanceof HostDatum)}} and pass it as parameter to filterNormalize()? This would avoid any errors due to incomplete heuristics, here when testing with sitemaps accessed per file protocol: {code} INFO api.HttpRobotRulesParser - Couldn't get robots.txt for http://file:/tmp/sitemap1.xml/: java.net.UnknownHostException: file {code} * concurrency: returning the value of isHost from filterNormalize() to map() per member variable is not thread-safe and will cause problems in combination with MultithreadedMapper. One argument more to pass it from map() to filterNormalize() per parameter. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883204#comment-13883204 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Thanks a lot for your comments. First two were straight forward and I agree with those. Re hacky way : For hosts from the HostDb, we don't know which protocol they below to. In the code I was checking if http:// is a match and if that was a bad guess then try with https://. I didn't handle for ftp:// and file:/ schemes. By hacky I meant this approach of trial-and-error till a suitable match is formed and we create a homepage url for the host. I have thought of your comment and would have a better (yet hacky) way in the coming patch. Re concurrency: I had thought of this and had searched over internet for internals of MultithreadedMapper. All I could get is that it has an internal thread pool and each input record to handed over to a thread in this pool to run map() over it. I wrote this code to check if thread safety was ensured in MultithreadedMapper: {noformat} private static class SitemapMapper extends MapperText, Writable, Text, CrawlDatum { private String myurl = null; public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { String url = key.toString(); if(foo(url).compareTo(url) != 0) { LOG.warn(Race condition found !!!); } } } private String foo(String url) { myurl = url; if(Thread.currentThread().getId() % 2 == 1) { try { Thread.sleep(1); } catch(InterruptedException e) { LOG.warn(e.getMessage()); } } return myurl; } {noformat} I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never hit the race condition in the code. Is the code snippet above a good way to reveal any race condition in the code ? Its won't be a formal conclusion and more of an experimental conclusion. How do I get a concrete conclusion whether MultithreadedMapper is thread safe or not ? Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883523#comment-13883523 ] Sebastian Nagel commented on NUTCH-1465: Sorry, you're right: the comment hacky way applies to trying http and https to check which host-URL would pass the filters. That's ok, there is no better solution for that. But what about the decision whether a string passed to filterNormalize() is a host from HostDb or a URL from a list of sitemaps? This decision could be made without any heuristics: inside map() we know the type (host or sitemap Url) from the class of the value: {code} boolean isHost = (value instanceof HostDatum); String url = filterNormalize(key.toString(), isHost); {code} The method filterNormalize() could be then simplified and the member variable isHost would be obsolete. Regarding concurrency: the javadoc of [[MultithreadedMapper.java|http://hadoop.apache.org/docs/stable/api/src-html/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html]] states that Mapper implementations using this MapRunnable must be thread-safe. In doubt, it may be better to follow this advice and not to look at the (current) implementation. If SitemapParser is thread-safe (at a first glance, it is) it should be easy to get SitemapMapper safe. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879955#comment-13879955 ] Lewis John McGibbney commented on NUTCH-1465: - Hey [~tejasp]. Again, great work! Some minor comments * Class level Javadoc in SitemapProcessor would be more legible if it used format something similar to {code:title=SitemapProcessor.java|borderStyle=solid} /** * pPerforms Sitemap processing by fetching sitemap links, parsing the content and merging * the urls from Sitemap (with the metadata) with the existing crawldb./p * * pThere are two use cases supported in Nutch's Sitemap processing:/p * ol * liSitemaps are considered as remote seed lists. Crawl administrators can prepare a * list of sitemap links and get only those sitemap pages. This suits well for targeted * crawl of specific hosts./li * liFor open web crawl, it is not possible to track each host and get the sitemap links * manually. Nutch would automatically get the sitemaps for all the hosts seen in the * crawls and inject the urls from sitemap to the crawldb./li * /ol * pFor more details see: * https://wiki.apache.org/nutch/SitemapFeature /o */ {code} * I think that the following logging line should be changed to WARN or ERROR {code:title=SitemapProcessor.java|borderStyle=solid} } catch (Exception e) { + LOG.info(Exception for url + key.toString() + : + StringUtils.stringifyException(e)); {code} * This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we not use one of the methods from URLUtil.java instead? {code:title=SitemapProcessor.java|borderStyle=solid} if(!u.startsWith(http://;) !u.startsWith(https://;)) { // We received a hostname here so let's make a URL url = http://; + u + /; isHost = true; } {code} Thats about it from me mate. This looks like an excellent addition to Nutch again. I made a trvial update to the wiki page to drop in some links and background to your work on this one. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880295#comment-13880295 ] Tejas Patil commented on NUTCH-1465: Hi [~lewismc], +1 for the first two suggestions. For #3: I skimmed through the methods inside URLUtil.java and nothing came to my notice that I could use in the Sitemap code you pointed. Can you please confirm ? A big thanks mate for trying out the feature. Hopefully we get this into 1.8 release. Cheers !! Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880305#comment-13880305 ] Lewis John McGibbney commented on NUTCH-1465: - hey [~tejasp] no probs. RE: #3, I was just curious to see if we could reuse some of the method we had in URLUtil. Now that I've looked I feel you're right. This patch reminds me of pushing out to filtering and normalization to crawler commons anyway but that is another can of worms :) I'll let others comments here. Right now I am +1 on this patch. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.8 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862237#comment-13862237 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Yes. I think that it should be there too. I will be working on the patch this weekend and update on the same. Thanks for your inputs and suggestions till now in, were super helpful in chalking out the right specs for this feature. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848998#comment-13848998 ] Sebastian Nagel commented on NUTCH-1465: Let's add use case C: *(C) inject URLs from given sitemap(s)* i. user configures list of known and trusted sitemaps ii. URLs are extracted from sitemaps and injected into CrawlDb Use case: small/medium size customized crawls Is C a common use case, worth to be integrated? Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848561#comment-13848561 ] Tejas Patil commented on NUTCH-1465: Revisited this Jira after a long time and gave a thought how this can be done cleanly. Two ways for implementing this: *(A) Do the sitemap stuff in the fetch phase of nutch cycle.* This was my original approach which the (in-progress) patch addresses. This would involve tweaking core nutch classes at several locations. Pros: - Sitemaps are nothing but normal pages with several outlinks. Fits well in the 'fetch' cycle. Cons: - Sitemaps can be very huge in size. Fetching them need large size and time limits. Fetch code must have a special case to take into account that the url is a sitemap url and use custom limits = leads to hacky coding style. - Outlink class cannot hold extra information contained in sitemaps (like lastmod, changefreq). Modify it to hold this information too. This would be specific for sitemaps only yet we end up making all outlinks to hold this info. We could create a special type of outlink and take care of this. *(B) Have separate job for the sitemap stuff and merge its output into the crawldb.* i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got all the hosts to be processed. ii. Run a map-reduce job: for each host, - get the robots page, extract sitemap urls, - get xml content of these sitemap pages - create crawl datums with the requried info and write this to a sitemapDB iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb Pros: - Cleaner code. - Users have control when to perform sitemap extraction. This is better than (A) wherein sitemap urls are sitting in the crawldb and get fetched along with normal pages (thus, eating up fetch time of every fetch phase). We can have a sitemap_fequency used insdie the crawl script so that users say that after 'x' nutch cycles, run sitemap processing. Cons: - Additional map-reduce jobs are needed. I think that this must be reasonable. Running sitemap job 1-5 times in a month on a production level crawl would work out well. I am inclined towards implementing (B) Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848723#comment-13848723 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. It would fit well where users are performing targeted crawling. For a large scale, open web crawl use case: (i) the number of initial hosts can be large : one time burden for users (ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here. do we really need an extra DB? I should have been clear with the explanation. sitemapDB is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does. NUTCH-1622 would enable solution A: outlinks now can hold extra info. I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721045#comment-13721045 ] Brian commented on NUTCH-1465: -- Is a separate issue needed for support in 2.X? Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.9 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564274#comment-13564274 ] Sebastian Nagel commented on NUTCH-1465: Hi Tejas, thanks and a few comments on the patch: “??for a given host, sitemaps are processed just once??” But they are not cached over cycles because the cache is bound to the protocol object. Is this correct? So a sitemap is fetched and processed every cycle for every host? If yes and sitemaps are large (they can!) this would cause a lot of extra traffic. Shouldn't sitemap URLs handled the same way as any other URL: add them to CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I]. There are some complications: - due to their size, sitemaps may require larger values regarding size and time limits - sitemaps may require more frequent re-fetching (eg. by MimeAdaptiveFetchSchedule) - the current Outlink class cannot hold extra information contained in sitemaps (lastmod, changefreq, etc.) There is another way which we use it for several customers: A SitemapInjector fetches the sitemaps, extracts URLs and injects them with all extra information. It's a simple use case for a customized site-search: there is a sitemap and it shall be used as seed list or even exclusive list of documents to be crawled. Is there any interest in this solution? It's not a general solution and not adaptable to a large web crawl. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564768#comment-13564768 ] Sebastian Nagel commented on NUTCH-1465: Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following: - a small set of sites to be crawled (eg, to feed a site-search index) - you can think of sitemaps as remote seed lists. Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs). - because you can trust in the sitemap's content -* checks for cross submissions are not necessary -* extra information (lastmod, changefreq, priority) can be used That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service. For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way: - detection is done in the protocol plugin (see Tejas' patch) - record in CrawlDb: done by Fetcher (cross submission information can be added) - fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job. Comments?! Thanks, Tejas: the feature is moving forward thanks to your initiative! Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564836#comment-13564836 ] Markus Jelsma commented on NUTCH-1465: -- Thanks all for your interesting comments. It's a complicated issue. One one hand host data should be stored in NUTCH-1325 but that would require additional logic and sending each segment output to the hostdb, in case there's a sitemap crawled. On the other hand it's ideal to store host data. It's also easy to use in jobs such as the indexer and generator. I don't yet favour a specific approach but storing sitemap data in a hostdb may be something to think about. Cheers Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564883#comment-13564883 ] Tejas Patil commented on NUTCH-1465: Hi Sebastian, So we are looking at 2 things here: - a standalone utility for injecting sitemaps to crawldb: -# User starts off with urls to sitemap pages -# SitemapInjector fetches these seeds, parses it (with a parse plugin based on CC) -# SitemapInjector updates the crawldb with the sitemap entries. - handling of sitemap within the nutch cycle: fetch, parse and update phases -# Robots parsing will populate a table of host: _list of links to sitemap pages_ -# These will be added to the fetcher queue and will be fetched -# A parser plugin based on CC will parse the sitemap page -# Outlink class needs to be extended to store the meta obtained from sitemap -# Write this into the segment -# Update phase needs to update the crawl frequency of already existing urls in crawldb based on what we got from the sitemap. Else just add new entires to the crawldb. I am not clear about the extending outlink thing. The normal outlink extraction need not be done as CC will already do that for us. Sitemap parser plugin must do this and create objects of our specialized sitemap link. While writing, where is CrawlDatum generated from the outlink ? The mime type that we get is text/xml which can also mean a normal xml file. How will nutch identify if its a sitemap page and invoke the correct parser plugin ? (I know that this magic is done by feed parser but not sure which part of code is doing that. Just point me to that code). Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564019#comment-13564019 ] Ken Krugler commented on NUTCH-1465: Hi Tejas - I thought the current CC robots parsing code was already extracting the sitemap links. Or is the above comment (modified the robots parsing code to extract the links to sitemap pages) a change to the current Nutch robots parsing code? I do remember thinking that the CC version would need to change to support multiple Sitemap links, even though it wasn't clear whether that was actually valid. -- Ken Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564040#comment-13564040 ] Tejas Patil commented on NUTCH-1465: Hi Ken, As the CC robots integration jira is not closed, I did this change is on the current trunk. I did not understand this (CC version would need to change to support multiple Sitemap links). Do you mean that CC aint allowing multiple sitemap links in a robots file (like [this|http://stackoverflow.com/questions/2594179/multiple-sitemap-entries-in-robots-txt]) or sitemap index file ? Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564057#comment-13564057 ] Ken Krugler commented on NUTCH-1465: Hi Tejas - the original code didn't, but I checked and now remember that I added support for multiple sitemap URLs to BaseRobotRules in CC. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira