[jira] [Updated] (NUTCH-2034) CrawlDB filtered documents counter.

2016-02-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2034: Fix Version/s: 1.12 > CrawlDB filtered documents coun

[jira] [Updated] (NUTCH-2032) Plugin to index the raw content of a readable document.

2016-02-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2032: Fix Version/s: 1.12 > Plugin to index the raw content of a readable docum

[jira] [Updated] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2046: Fix Version/s: 1.12 > The crawl script should be able to skip an initial inject

[jira] [Assigned] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2046: --- Assignee: Lewis John McGibbney > The crawl script should be able to skip

[jira] [Updated] (NUTCH-2005) Implement HTrace'ing in Nutch

2016-02-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2005: Labels: gsoc2016 (was: ) > Implement HTrace'ing

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141296#comment-15141296 ] Lewis John McGibbney commented on NUTCH-2144: - bq. [~chrismattmann] I am

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141213#comment-15141213 ] Lewis John McGibbney commented on NUTCH-2144: - Hi [~thammegowda], limitat

[jira] [Updated] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2144: Fix Version/s: 1.12 > Plugin to override db.ignore.external to exempt interest

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137375#comment-15137375 ] Lewis John McGibbney commented on NUTCH-1314: - Committed @ revisions 172

[jira] [Assigned] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1314: --- Assignee: Lewis John McGibbney > Impose a limit on the length of outl

Fwd: private Digest 5 Feb 2016 18:05:43 -0000 Issue 354

2016-02-05 Thread Lewis John Mcgibbney
Assistance Applications now open! 1271 by: lewis john mcgibbney Administrivia: - To post to the list, e-mail: priv...@nutch.apache.org To unsubscribe, e-mail: private-digest-unsubscr...@nutch.apache.org For additional

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-02 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129575#comment-15129575 ] Lewis John McGibbney commented on NUTCH-1314: - Yep, if someone

Re: need suggestion for GSoC 2016

2016-01-26 Thread Lewis John Mcgibbney
Hi Ammar, I've given you write permissions for the wiki. Feel free to create a page for your proposed work at the URL below https://wiki.apache.org/nutch/GoogleSummerOfCode#A2016 On Fri, Jan 22, 2016 at 4:49 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi A

[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118286#comment-15118286 ] Lewis John McGibbney commented on NUTCH-2206: - +1 [~sujenshah], th

[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117800#comment-15117800 ] Lewis John McGibbney commented on NUTCH-2206: - We should most likely

[jira] [Resolved] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-01-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1741. - Resolution: Fixed Committed revision 1726853 in 2.X Thank you to everyone that

[jira] [Updated] (NUTCH-2208) Fix 4 skipped tests in TestGenerator

2016-01-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2208: Attachment: TEST-org.apache.nutch.crawl.TestGenerator.txt Attached is full test log

[jira] [Created] (NUTCH-2208) Fix 4 skipped tests in TestGenerator

2016-01-26 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2208: --- Summary: Fix 4 skipped tests in TestGenerator Key: NUTCH-2208 URL: https://issues.apache.org/jira/browse/NUTCH-2208 Project: Nutch Issue Type

[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-01-26 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1741: Attachment: NUTCH-1741v7.patch Managed to update this at the weekend and forgot to

[jira] [Created] (NUTCH-2207) Remove class duplication and smarten-up scoring-similarity plugin

2016-01-25 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2207: --- Summary: Remove class duplication and smarten-up scoring-similarity plugin Key: NUTCH-2207 URL: https://issues.apache.org/jira/browse/NUTCH-2207

[jira] [Created] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-25 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2206: --- Summary: Provide example scoring.similarity.stopword.file Key: NUTCH-2206 URL: https://issues.apache.org/jira/browse/NUTCH-2206 Project: Nutch

[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-25 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116491#comment-15116491 ] Lewis John McGibbney commented on NUTCH-2206: - CC [~sujenshah] >

[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-25 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2184: Attachment: NUTCH-2184v2.patch Updated patch for trunk. [~markus17], working to

[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-01-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113423#comment-15113423 ] Lewis John McGibbney commented on NUTCH-1741: - I'm nearly finished

[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-01-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1741: Assignee: cihad güzel > Support of Sitemaps in Nutch

Re: need suggestion for GSoC 2016

2016-01-22 Thread Lewis John Mcgibbney
83.html) and > doesn't have any reply so far. > I would appreciate use your suggestion. > > Warmest regards > Ammar Shadiq > > On Tue, Nov 3, 2015 at 3:28 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> Hi Ammar, >> I have a few s

[jira] [Commented] (NUTCH-2171) Upgrade Nutch Trunk to Java 1.8

2016-01-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113380#comment-15113380 ] Lewis John McGibbney commented on NUTCH-2171: - Hey [~jorgelbg] feel fre

[ANNOUNCE] Apache Nutch 2.3.1 Release

2016-01-21 Thread lewis john mcgibbney
Hi Folks, !!Apologies for cross posting!! The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v2.3.1, we advise all current users and developers of the 2.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 2.X branch is

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-01-21 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110867#comment-15110867 ] Lewis John McGibbney commented on NUTCH-2202: - I agree [~robertmeusel],

[RESULT] WAS Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-21 Thread Lewis John Mcgibbney
Hi Folks, I am bringing this VOTE to a close with the following results [3] +1 Release this package as Apache Nutch 2.3.1. Lewis John McGibbney* Sebastian Nagel* Chris Mattmann* [0] -1 Do not release this package because… *Nutch PMC Member I am really happy to therefore announce that the VOTE

[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110733#comment-15110733 ] Lewis John McGibbney commented on NUTCH-1325: - Nice Markus, the conversa

[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110702#comment-15110702 ] Lewis John McGibbney commented on NUTCH-1325: - What a patch. Real nic

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-20 Thread Lewis John Mcgibbney
Hi user@, dev@, PING on the Nutch 2.3.1 RC#2 Would really appreciate anyone who is able to review this release candidate. It would mean a lot for our 2.X user base. Thank you Lewis On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, >

[jira] [Created] (NUTCH-2200) Establish process for publishing Docker containers

2016-01-16 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2200: --- Summary: Establish process for publishing Docker containers Key: NUTCH-2200 URL: https://issues.apache.org/jira/browse/NUTCH-2200 Project: Nutch

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-13 Thread Lewis John Mcgibbney
Any others above to review please? On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > > A second candidate for the Nutch 2.3.1 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ > >

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-13 Thread Lewis John Mcgibbney
Hi Seb, Thanks for taking the time to review the release candidate. Replies inline On Tue, Jan 12, 2016 at 10:17 AM, wrote: > +1 > > - good signatures > - tests pass > - I've successfully run a test crawl (bin/crawl) using HBase 0.98.8 > > Two minor points: > > - CHANGES.txt mentions the rc1 rel

[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes

2016-01-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091300#comment-15091300 ] Lewis John McGibbney commented on NUTCH-1186: - Hi [~markus17] I have sc

[VOTE] Release Apache Nutch 2.3.1rc2

2016-01-10 Thread Lewis John Mcgibbney
Hi Folks, A second candidate for the Nutch 2.3.1 release is available at: https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ The release candidate is a zip and tar.gz sources archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/ In addition, a staged maven

[jira] [Created] (NUTCH-2199) Documentation for Nutch 2.X REST API

2016-01-10 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2199: --- Summary: Documentation for Nutch 2.X REST API Key: NUTCH-2199 URL: https://issues.apache.org/jira/browse/NUTCH-2199 Project: Nutch Issue Type

[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X REST API

2016-01-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1800: Summary: Documentation for Nutch 1.X REST API (was: Documentation for Nutch 1.X

[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X REST API

2016-01-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1800: Fix Version/s: (was: 2.3.1) > Documentation for Nutch 1.X REST

[jira] [Updated] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2016-01-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2094: Fix Version/s: (was: 2.4) 2.3.1 > Stopping and Restartin

[jira] [Updated] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2016-01-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2165: Fix Version/s: (was: 2.4) > FileDumper Util hard codes part-# folder n

[jira] [Updated] (NUTCH-2166) Add reverse URL format to dump tool

2016-01-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2166: Fix Version/s: (was: 2.4) > Add reverse URL format to dump t

[jira] [Comment Edited] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090337#comment-15090337 ] Lewis John McGibbney edited comment on NUTCH-2168 at 1/9/16 2:0

[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090337#comment-15090337 ] Lewis John McGibbney commented on NUTCH-2168: - +1 for commit [~wastl-n

[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087804#comment-15087804 ] Lewis John McGibbney commented on NUTCH-2143: - Tested v3 and confirmed to

[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes

2016-01-05 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083138#comment-15083138 ] Lewis John McGibbney commented on NUTCH-1186: - Will scope and test [~mark

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-29 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074453#comment-15074453 ] Lewis John McGibbney commented on NUTCH-2184: - [~markus17] coming bac

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-12-29 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074319#comment-15074319 ] Lewis John McGibbney commented on NUTCH-1946: - Hi [~kalanya] bq. Hey

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060155#comment-15060155 ] Lewis John McGibbney commented on NUTCH-2184: - Ack On Wednesday, Decembe

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060023#comment-15060023 ] Lewis John McGibbney commented on NUTCH-2184: - Excellent points Markus th

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059489#comment-15059489 ] Lewis John McGibbney commented on NUTCH-2184: - No, just the following h

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059459#comment-15059459 ] Lewis John McGibbney commented on NUTCH-2184: - I've tested this on

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058977#comment-15058977 ] Lewis John McGibbney commented on NUTCH-2184: - To describe what this p

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058962#comment-15058962 ] Lewis John McGibbney commented on NUTCH-2184: - Issue is logged at NUTCH-

[jira] [Created] (NUTCH-2186) -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob

2015-12-15 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2186: --- Summary: -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob Key: NUTCH-2186 URL: https://issues.apache.org/j

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058955#comment-15058955 ] Lewis John McGibbney commented on NUTCH-2184: - I am going to open ano

[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2184: Attachment: NUTCH-2184.patch Patch for trrunk. During testing this patch against

[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2184: Flags: Patch Patch Info: Patch Available > Enable IndexingJob to funct

[jira] [Work stopped] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2184 stopped by Lewis John McGibbney. --- > Enable IndexingJob to function with no craw

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-14 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056690#comment-15056690 ] Lewis John McGibbney commented on NUTCH-2184: - This issue also impr

[jira] [Created] (NUTCH-2185) protocol-soda-consumer plugin

2015-12-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2185: --- Summary: protocol-soda-consumer plugin Key: NUTCH-2185 URL: https://issues.apache.org/jira/browse/NUTCH-2185 Project: Nutch Issue Type: Bug

[jira] [Work started] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2184 started by Lewis John McGibbney. --- > Enable IndexingJob to function with no craw

[jira] [Created] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-11 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2184: --- Summary: Enable IndexingJob to function with no crawldb Key: NUTCH-2184 URL: https://issues.apache.org/jira/browse/NUTCH-2184 Project: Nutch

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-11 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053975#comment-15053975 ] Lewis John McGibbney commented on NUTCH-2184: - Working on this right

[jira] [Resolved] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory

2015-12-09 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2183. - Resolution: Fixed Committed @revision 1719006 in trunk. Thank you [~mjoyce] for

[jira] [Resolved] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments

2015-12-09 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2180. - Resolution: Fixed Committed @revision 1719004 in trunk > FileDumper dumps d

[jira] [Commented] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory

2015-12-09 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049698#comment-15049698 ] Lewis John McGibbney commented on NUTCH-2183: - Would like to commit toda

[jira] [Commented] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments

2015-12-09 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048999#comment-15048999 ] Lewis John McGibbney commented on NUTCH-2180: - Harsha do you know

[jira] [Updated] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory

2015-12-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2183: Description: The scenario is that you have a bunch of Nutch data which has been

[jira] [Updated] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory

2015-12-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2183: Attachment: NUTCH-2183.patch Patch for trunk. > Improvement to SegmentChecker

[jira] [Created] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory

2015-12-08 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2183: --- Summary: Improvement to SegmentChecker for skipping non-segments present in segments directory Key: NUTCH-2183 URL: https://issues.apache.org/jira/browse/NUTCH-2183

Fwd: ApacheCon NA 2015 Travel Assistance Applications now open!

2015-12-07 Thread Lewis John Mcgibbney
-- -- Forwarded message -- From: lewis john mcgibbney To: Cc: "travel-assista...@apache.org" Date: Mon, 7 Dec 2015 20:15:50 -0800 Subject: ApacheCon NA 2015 Travel Assistance Applications now open! Hi pmcs@,

[jira] [Updated] (NUTCH-2181) Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch

2015-12-07 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2181: Issue Type: Task (was: Bug) > Add Webpage for 3rd Party Connectors/Libraries

[jira] [Created] (NUTCH-2181) Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch

2015-12-07 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2181: --- Summary: Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch Key: NUTCH-2181 URL: https://issues.apache.org/jira/browse/NUTCH-2181 Project

[RELEASE] Apache Nutch 1.11

2015-12-07 Thread lewis john mcgibbney
Hello Folks, 07 December 2015 - Nutch 1.11 Release The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release. What is Apache Nutch? Nutch is a well matured, production ready W

[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-07 Thread Lewis John Mcgibbney
Hi user@ dev@, 72hrs has lapsed so I would like to bring this thread to a close! VOTE's wee cast with the following RESULT [7] +1 Release this package as Apache Nutch 1.11 Lewis John Mcgibbney* Roannel Fernández Hernández Sujen Shah* Chris A Mattmann* Julien Nioche* Sebastian Nagel* Jorge

[VOTE] Release Apache Nutch 1.11 RC#2

2015-12-04 Thread Lewis John Mcgibbney
-1.11-rc2/ All artifacts have been signed with the following signature as present within KEYS 48BAEBF6 2013-10-28 Lewis John McGibbney (CODE SIGNING KEY) < lewi...@apache.org> In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapach

Dropping Nutch 1.11RC#1 Artifacts

2015-12-03 Thread Lewis John Mcgibbney
Hi Chris, Can you please drop the Nutch 1.11RC#1 artifacts from repository.a.o and from https://dist.apache.org/repos/dist/dev/nutch/1.11/ Thanks very much Lewis -- *Lewis*

[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2178: Fix Version/s: (was: 1.11) 1.12 > DeduplicationJob

[jira] [Updated] (NUTCH-2128) Refactor configuration end point

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2128: Fix Version/s: (was: 1.12) 1.11 > Refactor configuration

[jira] [Updated] (NUTCH-2149) REST endpoint to read Nutch sequence files

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2149: Fix Version/s: (was: 1.12) 1.11 > REST endpoint to r

[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038801#comment-15038801 ] Lewis John McGibbney commented on NUTCH-2172: - +1 > Parsing whitesp

[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037471#comment-15037471 ] Lewis John McGibbney commented on NUTCH-2172: - [~wastl-nagel] this is a

[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-03 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037470#comment-15037470 ] Lewis John McGibbney commented on NUTCH-2172: - I think that is the point

[jira] [Comment Edited] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023141#comment-15023141 ] Lewis John McGibbney edited comment on NUTCH-2158 at 11/23/15 9:4

[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023141#comment-15023141 ] Lewis John McGibbney commented on NUTCH-2158: - I am +1 for this. If we

[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-20 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018544#comment-15018544 ] Lewis John McGibbney commented on NUTCH-2158: - Hi [~jnioche], I repro

[jira] [Resolved] (NUTCH-2058) Indexer plugin that allows RegEx replacements on the NutchDocument field values

2015-11-20 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2058. - Resolution: Fixed Tests are not failing as per recent local builds https

[DISCUSS] Release Nutch 1.11?

2015-11-20 Thread Lewis John Mcgibbney
Hi Folks, Title says it all. There is only one pending issue for 1.11. https://issues.apache.org/jira/browse/NUTCH-2158 I am testing our the Tika 1.11 patch right now. Do you guys want me to push a release if we can get the Tika committed? I can do this tonight when I get home. Ta Lewis -- *Lewis

[jira] [Updated] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2015-11-20 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2162: Fix Version/s: (was: 1.11) 1.12 > Nutch Webapp Crawl fa

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2069: Fix Version/s: (was: 1.12) 1.11 > Ignore external li

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015387#comment-15015387 ] Lewis John McGibbney commented on NUTCH-2069: - +1 for patch. Sorry a

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2069: Fix Version/s: 1.12 > Ignore external links based on dom

[jira] [Created] (NUTCH-2171) Upgrade Nutch Trunk to Java 1.8

2015-11-16 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2171: --- Summary: Upgrade Nutch Trunk to Java 1.8 Key: NUTCH-2171 URL: https://issues.apache.org/jira/browse/NUTCH-2171 Project: Nutch Issue Type: Task

Upgrade of mapred --> mapreduce in trunk e.g. Nutch 3.X

2015-11-14 Thread Lewis John Mcgibbney
Hi Folks, Mike Joyce and myself have been working on a Tinkerpop implementation of Node and NodeDB (generated through WebGraph) which builds a Vertex input, used by Tinkerpop, subsequently Gremlin and persisted into a graph database such as TitanDB. We have analyzed the problem quite a bit and cam

[jira] [Commented] (NUTCH-2157) Parent Issue for Addressing Miredot REST API Warnings

2015-11-13 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005130#comment-15005130 ] Lewis John McGibbney commented on NUTCH-2157: - +1 commit, this looks

[jira] [Closed] (NUTCH-2170) When i am crawling the URL http://www.aossama.com/. it is crawling url like this com.aossama.www.http/

2015-11-13 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2170. --- Resolution: Fixed Hi prabhakar please go to our mailing lists and we can help you

[jira] [Updated] (NUTCH-2130) copyField rawcontent creates error within schema.xml

2015-11-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2130: Fix Version/s: (was: 2.4) 2.3.1 > copyField rawcont

<    4   5   6   7   8   9   10   11   12   13   >