[jira] [Resolved] (NUTCH-2009) Fetcher does not work with batchID

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2009. - Resolution: Duplicate These MongoDB issues have been resolved in Gora 0.6.1 and

[jira] [Resolved] (NUTCH-2080) Eclipse compilation issue

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2080. - Resolution: Invalid This has to do with ivy/ivy.xml configuration and should be

[jira] [Created] (NUTCH-2101) Upgrade Nutch 2.X to Hadoop 2.4.0

2015-09-15 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2101: --- Summary: Upgrade Nutch 2.X to Hadoop 2.4.0 Key: NUTCH-2101 URL: https://issues.apache.org/jira/browse/NUTCH-2101 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746866#comment-14746866 ] Hudson commented on NUTCH-1679: --- SUCCESS: Integrated in Nutch-nutchgora #1535 (See

[jira] [Resolved] (NUTCH-2029) Mark.checkMark returns empty string when null is expected with mongodb storage

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2029. - Resolution: Fixed This issue has been resolved as it was fixed over in GORA-423.

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746807#comment-14746807 ] Lewis John McGibbney commented on NUTCH-1679: - I've tested this with Nutch 2.X HEAD, Gora 0.5

[jira] [Resolved] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1922. - Resolution: Duplicate This issue is a clone of NUTCH-1679 for which I just

[jira] [Updated] (NUTCH-1572) Nutch 2.x should use o.a.g.mem.store.MemStore for testing

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1572: Fix Version/s: (was: 2.4) 2.3.1 > Nutch 2.x should use

[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1679: Attachment: NUTCH-1679_4.patch Patch which sorts out some trivial formatting and

[jira] [Resolved] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1679. - Resolution: Fixed Committed @revision 1703331 in 2.X HEAD > UpdateDb using

[jira] [Assigned] (NUTCH-1572) Nutch 2.x should use o.a.g.mem.store.MemStore for testing

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1572: --- Assignee: Lewis John McGibbney > Nutch 2.x should use

[jira] [Closed] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kim Whitehall closed NUTCH-2100. Resolution: Invalid The command was used incorrectly. There is no bug. > Nutch dump command

Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sujen Shah
Hi Everyone, I would like to thank the members of the Apache Nutch PMC for bringing me on board and giving me the opportunity to become a member and committer. I am a Graduate student at the University of Southern California, majoring in Computer Science. I have been working with Chris Mattmann

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744364#comment-14744364 ] Lewis John McGibbney edited comment on NUTCH-2097 at 9/15/15 6:51 AM: --

[jira] [Resolved] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2093. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in revision 1703111. >

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744983#comment-14744983 ] Lewis John McGibbney commented on NUTCH-2097: - Hi [~markus17] thanks for initial comments. I

[jira] [Resolved] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2094. - Resolution: Not A Problem This issue is already resolved in 2.X branch

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744959#comment-14744959 ] Markus Jelsma commented on NUTCH-2064: -- I think having it in CC makes sense indeed. I shall commit

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953 ] Markus Jelsma commented on NUTCH-2097: -- Interesting! What does 'Complete Ant + Ivy build system

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744953#comment-14744953 ] Markus Jelsma edited comment on NUTCH-2097 at 9/15/15 6:50 AM: --- Interesting!

[GitHub] nutch pull request: 2.x

2015-09-15 Thread prernasatija
Github user prernasatija closed the pull request at: https://github.com/apache/nutch/pull/57 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[Nutch Wiki] Update of "AdvancedAjaxInteraction" by MichaelJoyce

2015-09-15 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "AdvancedAjaxInteraction" page has been changed by MichaelJoyce: https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff=4=5 Comment: Updates regarding available selenium

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39492460 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Eeh, patch with the scoring filter itself. Apparently it is possible

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745061#comment-14745061 ] Sebastian Nagel commented on NUTCH-2097: Yes, looks promising. - maven could simplify the build,

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch New and much simpler patch. This relies on a scoring filter to mark

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39492479 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-15 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745663#comment-14745663 ] ASF GitHub Bot commented on NUTCH-2099: --- GitHub user sujen1412 opened a pull request:

[jira] [Created] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-15 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2099: - Summary: Refactoring the REST endpoints for integration with webui Key: NUTCH-2099 URL: https://issues.apache.org/jira/browse/NUTCH-2099 Project: Nutch Issue

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745313#comment-14745313 ] Nadeem Douba commented on NUTCH-2097: - I'm not entirely married to the package structure to be honest.

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745330#comment-14745330 ] Nadeem Douba commented on NUTCH-2097: - Re: maven migration Would building each tool into a separate

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch > Automatically remove orphaned pages >

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745330#comment-14745330 ] Nadeem Douba edited comment on NUTCH-2097 at 9/15/15 12:23 PM: --- Re: maven

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745330#comment-14745330 ] Nadeem Douba edited comment on NUTCH-2097 at 9/15/15 12:22 PM: --- Re: maven

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch First proper working patch. Tests pass > Automatically remove

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Description: Orphan scoring filter that determines whether a page has become orphaned, e.g. it

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745322#comment-14745322 ] Markus Jelsma commented on NUTCH-2097: -- Yes, having them as separate mapper and reducer class files,

[jira] [Commented] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-15 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745755#comment-14745755 ] Hudson commented on NUTCH-2093: --- SUCCESS: Integrated in Nutch-trunk #3271 (See

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-15 Thread sujen1412
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/59 Fix for NUTCH-2099 Contributed by Sujen Shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2099 Alternatively you can

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1932: --- Attachment: NUTCH-1932-add.patch > Automatically remove orphaned pages >

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746004#comment-14746004 ] Sebastian Nagel commented on NUTCH-1932: Hi Markus, understood. - didn't we have the problem

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746034#comment-14746034 ] Markus Jelsma commented on NUTCH-1932: -- Hello Sebastian. I am not sure about that being on the list.

[jira] [Created] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
Kim Whitehall created NUTCH-2100: Summary: Nutch dump command doesnt dump anything Key: NUTCH-2100 URL: https://issues.apache.org/jira/browse/NUTCH-2100 Project: Nutch Issue Type: Bug

[ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Sujen Shah has been voted in as committer and member of the Nutch PMC. Sujen, would you mind to introduce yourself to the Nutch community and tell in just a few words about your interests and your plans regarding Nutch?

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746062#comment-14746062 ] Sebastian Nagel commented on NUTCH-1932: Correct, it was about 404 pages not about duplicates, see

[jira] [Assigned] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2100: Assignee: Chris A. Mattmann > Nutch dump command doesnt dump anything >

[jira] [Commented] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746141#comment-14746141 ] Chris A. Mattmann commented on NUTCH-2100: -- Kim I think that the directory expects a path to the

[jira] [Commented] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746210#comment-14746210 ] Kim Whitehall commented on NUTCH-2100: -- LOL! how dumb of me! yeap, it works. Of all the things ... Do

[jira] [Updated] (NUTCH-2098) Add null SeedUrl constructor

2015-09-15 Thread Aron Ahmadia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aron Ahmadia updated NUTCH-2098: Attachment: 0001-Default-SeedURL-constructor.patch > Add null SeedUrl constructor >

[jira] [Created] (NUTCH-2098) Add null SeedUrl constructor

2015-09-15 Thread Aron Ahmadia (JIRA)
Aron Ahmadia created NUTCH-2098: --- Summary: Add null SeedUrl constructor Key: NUTCH-2098 URL: https://issues.apache.org/jira/browse/NUTCH-2098 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745205#comment-14745205 ] Sebastian Nagel commented on NUTCH-1932: Hi Markus, that looks quite simple - do we still need a

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509063 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509273 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509421 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import