[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284.patch Patch for the fix Add site fetcher.max.crawl.delay as

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538725#comment-13538725 ] Tejas Patil commented on NUTCH-1284: I searched for the relevant mail thread[0] to get

[jira] [Comment Edited] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538725#comment-13538725 ] Tejas Patil edited comment on NUTCH-1284 at 12/22/12 10:54 AM:

[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1118: --- Attachment: NUTCH-1118.patch Wrote a test case which checks following: 1. basic searchable fields

[jira] [Updated] (NUTCH-1119) JUnit test for index-static

2012-12-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1119: --- Attachment: NUTCH-1119.patch Wrote a test case which checks following: 1. static data fields are

[jira] [Updated] (NUTCH-1224) Migrate FreeGenerator to MapReduce API

2012-12-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1224: --- Attachment: NUTCH-1224.1.patch First attempt. Only remaining question is: Should I create a separate

[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2012-12-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1127: --- Attachment: NUTCH-1127.patch Wrote test case capturing few scenarios. Attached the patch. Please let

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542798#comment-13542798 ] Tejas Patil commented on NUTCH-1494: I was working on

[jira] [Commented] (NUTCH-1053) Parsing of RSS feeds fails

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542805#comment-13542805 ] Tejas Patil commented on NUTCH-1053: The exception seen by Lewis wrt command line way

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1274: --- Attachment: NUTCH-1274-trunk.patch NUTCH-1274-2.x.patch PFA the patches for trunk

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-04 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543720#comment-13543720 ] Tejas Patil commented on NUTCH-1513: For this has to be supported I have 2 approaches:

[jira] [Created] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1514: -- Summary: Phase out the deprecated configuration properties (if possible) Key: NUTCH-1514 URL: https://issues.apache.org/jira/browse/NUTCH-1514 Project: Nutch

[jira] [Updated] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1514: --- Attachment: NUTCH-1514.patch Attached the patch for changes in nutch trunk. Please let me know your

[jira] [Updated] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1514: --- Attachment: NUTCH-1514-v2.patch Thanks Sebastian !! I removed those references in nutch-default.xml

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545691#comment-13545691 ] Tejas Patil commented on NUTCH-1513: Hi Lewis, Thanks for your suggestion. I think

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546454#comment-13546454 ] Tejas Patil commented on NUTCH-1494: Hi Lewis, I have could not run nutch with rome

[jira] [Updated] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1494: --- Attachment: NUTCH-1494.3.patch @Lewis: it worked :) I have attached the patch. Please let me know

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546639#comment-13546639 ] Tejas Patil commented on NUTCH-1031: The current nutch robots parsing logic is uses

[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551809#comment-13551809 ] Tejas Patil commented on NUTCH-1274: Hi Lewis, I will do those changes. You can assign

[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551815#comment-13551815 ] Tejas Patil commented on NUTCH-1274: Hi Lewis, I took a fresh checkout of trunk and

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1274: --- Attachment: NUTCH-1274-2.x.v2.patch NUTCH-1274-trunk.v2.patch Hi Lewis, PFA the

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Assignee: Tejas Patil Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551857#comment-13551857 ] Tejas Patil commented on NUTCH-1284: Can anyone kindly review the patch ?

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-18 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557930#comment-13557930 ] Tejas Patil commented on NUTCH-1031: After waiting for more than a week, I think that

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.patch I looked at the source code of CC to understand how it

[jira] [Assigned] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1031: -- Assignee: Tejas Patil (was: Julien Nioche) Delegate parsing of robots.txt to

[jira] [Assigned] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1513: -- Assignee: Tejas Patil Support Robots.txt for Ftp urls ---

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284-trunk.v1.patch Hi Lewis, If I recall correctly, we want the crawl delay for

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1042: -- Assignee: Tejas Patil Fetcher.max.crawl.delay property not taken into account correctly

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558225#comment-13558225 ] Tejas Patil commented on NUTCH-1042: The patch for

[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558228#comment-13558228 ] Tejas Patil commented on NUTCH-1329: I am not able to reproduce this bug with the

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558321#comment-13558321 ] Tejas Patil commented on NUTCH-1042: linked with NUTCH-1284

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558349#comment-13558349 ] Tejas Patil commented on NUTCH-1031: Hi Ken, Thanks for reviewing the patch. I will

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284-2.x.v1.patch Hi Lewis, Thanks for reminding about 2.x. I have attached the

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.v2.patch Hi Ken, I have added a test case to CC for the

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v2.patch Added a patch for nutch trunk (NUTCH-1031-trunk.v2.patch). If

[jira] [Comment Edited] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559332#comment-13559332 ] Tejas Patil edited comment on NUTCH-1031 at 1/22/13 3:18 AM: -

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Attachment: NUTCH-1465-trunk.v1.patch This is a work in progress. So far I have done following: -

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564040#comment-13564040 ] Tejas Patil commented on NUTCH-1465: Hi Ken, As the CC robots integration jira is not

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564095#comment-13564095 ] Tejas Patil commented on NUTCH-1047: Hi Lufeng, You are right. There was a problem

[jira] [Resolved] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1284. Resolution: Fixed Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564107#comment-13564107 ] Tejas Patil commented on NUTCH-1284: Committed @revision 1439289 in trunk Committed

[jira] [Resolved] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1042. Resolution: Fixed The fix for NUTCH-1284 takes care of this.

[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1465: -- Assignee: Tejas Patil Support sitemaps in Nutch -

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187 ] Tejas Patil commented on NUTCH-1047: Hi Julien, After reply from @lufeng, I was able

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252 ] Tejas Patil commented on NUTCH-1047: Hi Julien, The solrindex commmand and crawl

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564883#comment-13564883 ] Tejas Patil commented on NUTCH-1465: Hi Sebastian, So we are looking at 2 things

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565202#comment-13565202 ] Tejas Patil commented on NUTCH-1047: Hi Julien, As you suggested, I tried to run

[jira] [Commented] (NUTCH-1521) CrawlDbFilter pass null url to urlNormailzers

2013-02-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570050#comment-13570050 ] Tejas Patil commented on NUTCH-1521: Hi Lufeng, In 2.x, some classes are given

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-19 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581964#comment-13581964 ] Tejas Patil commented on NUTCH-1047: Hi Julien, The crawl command (with solr option)

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582163#comment-13582163 ] Tejas Patil commented on NUTCH-1047: Hey Julien, While running the solrclean command,

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583011#comment-13583011 ] Tejas Patil commented on NUTCH-1047: Hi Julien, One small change in Java class will

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583013#comment-13583013 ] Tejas Patil commented on NUTCH-1031: Hey Ken, A gentle reminder for releasing CC.

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583662#comment-13583662 ] Tejas Patil commented on NUTCH-1031: Hi Lewis, I should have checked on the main page

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-21 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583664#comment-13583664 ] Tejas Patil commented on NUTCH-1031: @Dev: I am planning to commit this change in

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585467#comment-13585467 ] Tejas Patil commented on NUTCH-1031: Hi Sebastian, Thanks for your time and suggesting

[jira] [Commented] (NUTCH-1529) Port nutch-mongdb-parser to trunk

2013-02-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589946#comment-13589946 ] Tejas Patil commented on NUTCH-1529: There is no harm in adding such support. Mongodb

[jira] [Commented] (NUTCH-1529) Port nutch-mongdb-parser to trunk

2013-02-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589977#comment-13589977 ] Tejas Patil commented on NUTCH-1529: @Lewis, I am a rookie in terms of mongodb so

[jira] [Commented] (NUTCH-1529) Port nutch-mongdb-parser to trunk

2013-02-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590243#comment-13590243 ] Tejas Patil commented on NUTCH-1529: @Lufeng The earlier patch had some references to

[jira] [Commented] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-03-04 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593073#comment-13593073 ] Tejas Patil commented on NUTCH-1447: This error is possibly due to code refactoring

[jira] [Commented] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-03-04 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593076#comment-13593076 ] Tejas Patil commented on NUTCH-1447: As per discussion over user group [0], I agree

[jira] [Commented] (NUTCH-1454) parsing chm failed

2013-03-05 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594411#comment-13594411 ] Tejas Patil commented on NUTCH-1454: Few observations about this issue: 1. Nutch is

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-05 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v4.patch Hey Lewis, Thanks for pointing that out :) I have updated the

[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code

2013-03-05 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594442#comment-13594442 ] Tejas Patil commented on NUTCH-842: --- Hi Lewis, Can you kindly upload the changes that you

[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1542: --- Summary: adddays param for generator not present in 2.x (was: -adddays param for generator not

[jira] [Created] (NUTCH-1542) -adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1542: -- Summary: -adddays param for generator not present in 2.x Key: NUTCH-1542 URL: https://issues.apache.org/jira/browse/NUTCH-1542 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1542: --- Attachment: NUTCH-1542.patch Patch for changes in GeneratorJob and the crawl script.

[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-08 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1542: --- Attachment: NUTCH-1542.v2.patch Updated the patch as per NUTCH-1393 adddays param

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-03-08 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-trunk.v5.patch Thanks Lewis :) I have corrected the usage message.

[jira] [Commented] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-10 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598482#comment-13598482 ] Tejas Patil commented on NUTCH-1542: Committed @revision 1454974

[jira] [Resolved] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-10 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1542. Resolution: Fixed adddays param for generator not present in 2.x

[jira] [Commented] (NUTCH-1544) Nutch crawls only first site from seed list

2013-03-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600295#comment-13600295 ] Tejas Patil commented on NUTCH-1544: I don't know what seeds you had and what urls you

[jira] [Comment Edited] (NUTCH-1544) Nutch crawls only first site from seed list

2013-03-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600295#comment-13600295 ] Tejas Patil edited comment on NUTCH-1544 at 3/12/13 6:58 PM: -

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-05 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624194#comment-13624194 ] Tejas Patil commented on NUTCH-1031: I have removed the @author tag and ported the

[jira] [Commented] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-04-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641243#comment-13641243 ] Tejas Patil commented on NUTCH-1447: I agree with you [~lewismc]. There are and will

[jira] [Commented] (NUTCH-1565) Proper downloads page for Nutch

2013-04-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641349#comment-13641349 ] Tejas Patil commented on NUTCH-1565: Hi Lewis, I tried to build the docs using steps

[jira] [Commented] (NUTCH-1565) Proper downloads page for Nutch

2013-04-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641380#comment-13641380 ] Tejas Patil commented on NUTCH-1565: No problem bro.. i am close enough to get around

[jira] [Updated] (NUTCH-1565) Proper downloads page for Nutch

2013-04-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1565: --- Attachment: NUTCH-1565.v2.patch downloads.html So far I could fix the errors, but

[jira] [Resolved] (NUTCH-1565) Proper downloads page for Nutch

2013-04-24 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1565. Resolution: Fixed Changes pushed to SVN @ revision 1475631. Here is the [new downloads

[jira] [Updated] (NUTCH-829) duplicate hadoop temp files

2013-04-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-829: -- Attachment: NUTCH-829.v2.patch Hi Lewis, There was one more place in Generator where this change could

[jira] [Resolved] (NUTCH-829) duplicate hadoop temp files

2013-04-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-829. --- Resolution: Fixed Thanks Lewis for pointing that out. Committed @ revision 1476702

[jira] [Commented] (NUTCH-1528) Port nutch-mongodb-indexer to Nutch

2013-04-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643858#comment-13643858 ] Tejas Patil commented on NUTCH-1528: As this change ain't going to the repo, should we

[jira] [Commented] (NUTCH-346) Improve readability of logs/hadoop.log

2013-04-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643863#comment-13643863 ] Tejas Patil commented on NUTCH-346: --- I think that this will be a good addition as

[jira] [Commented] (NUTCH-1528) Port nutch-mongodb-indexer to Nutch

2013-04-27 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643879#comment-13643879 ] Tejas Patil commented on NUTCH-1528: I think that [~jnioche] and [~wastl-nagel] might

[jira] [Closed] (NUTCH-1528) Port nutch-mongodb-indexer to Nutch

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-1528. -- Resolution: Won't Fix Port nutch-mongodb-indexer to Nutch ---

[jira] [Closed] (NUTCH-346) Improve readability of logs/hadoop.log

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-346. - Resolution: Fixed Pushed to svn. (trunk: rev 1476859, 2.x: 1476861) Improve readability

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031-2.x.v1.patch Patch for 2.x. If there are no objections, would commit in

[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-649: -- Attachment: NUTCH-649.trunk.patch NUTCH-649.2.x.patch patches for trunk and 2.x

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644269#comment-13644269 ] Tejas Patil commented on NUTCH-1314: Hi Lewis, I tried to test both the patches.

[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites

2013-04-28 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644272#comment-13644272 ] Tejas Patil commented on NUTCH-1329: Should we close this one ? I had tried to

[jira] [Resolved] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-04-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1031. Resolution: Fixed Fix Version/s: 2.2 Thanks Lewis :) Changes committed to 2.x (revision

[jira] [Closed] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

2013-04-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-1455. -- Resolution: Fixed Fix Version/s: 2.2 Assignee: Tejas Patil We have migrated to CC for

[jira] [Closed] (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-342. - Resolution: Won't Fix I agree with Lewis wrt closing this issue as won't fix. Nutch

[jira] [Commented] (NUTCH-213) checkstyle

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645400#comment-13645400 ] Tejas Patil commented on NUTCH-213: --- IMHO, I dont think that we are in dire need to have

[jira] [Commented] (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implm

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645403#comment-13645403 ] Tejas Patil commented on NUTCH-427: --- As [~ab] mentioned earlier This plugin uses an LGPL

[jira] [Closed] (NUTCH-449) Format of junit output should be configurable

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-449. - Resolution: Implemented I have verified that current trunk and 2.x build files already have this change.

[jira] [Updated] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1514: --- Attachment: NUTCH-1514.2.x.patch Here is a corresponding patch for 2.x. Unless there are any

[jira] [Resolved] (NUTCH-802) Problems managing outlinks with large url length

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-802. --- Resolution: Won't Fix Agree with Markus and Lewis. Hence marking this one as wont fix. If someone

[jira] [Updated] (NUTCH-1543) Display consistent usage of DBUpdaterJob with 1.X

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1543: --- Attachment: NUTCH-1543.v2.patch Hi [~amuseme], The patch will kill the current behavior wherein if

[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings

2013-04-30 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13645434#comment-13645434 ] Tejas Patil commented on NUTCH-1273: Hey [~lewismc], I still see deprecation warnings

  1   2   3   >