[jira] Created: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-06-25 Thread Sebastian Nagel (JIRA)
document deduplication (exact duplicates) failed using MD5Signature --- Key: NUTCH-835 URL: https://issues.apache.org/jira/browse/NUTCH-835 Project: Nutch Issue Type: Bug

[jira] Created: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)
HttpClient null pointer exception - Key: NUTCH-862 URL: https://issues.apache.org/jira/browse/NUTCH-862 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0

[jira] Updated: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-862: -- Attachment: NUTCH-862.patch patch HttpClient null pointer exception

[jira] Commented: (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2010-11-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930588#action_12930588 ] Sebastian Nagel commented on NUTCH-933: --- The modifiedTime stored in a CrawlDatum

[jira] Updated: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-01-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-962: -- Attachment: Fetcher_redir.patch patch for 1.3 to respect count of redirects literally:

[jira] [Created] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1344: -- Summary: BasicURLNormalizer to normalize https same as http Key: NUTCH-1344 URL: https://issues.apache.org/jira/browse/NUTCH-1344 Project: Nutch Issue

[jira] [Updated] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1344: --- Attachment: NUTCH-1344.patch BasicURLNormalizer to normalize https same as http

[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely

2012-04-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258827#comment-13258827 ] Sebastian Nagel commented on NUTCH-1339: BasicURLNormalizer does not remove the

[jira] [Commented] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263124#comment-13263124 ] Sebastian Nagel commented on NUTCH-1293: The content type should be added to

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

2012-05-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273954#comment-13273954 ] Sebastian Nagel commented on NUTCH-1323: After a small test crawl on

[jira] [Created] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1383: -- Summary: IndexingFiltersChecker to show error message instead of null pointer exception Key: NUTCH-1383 URL: https://issues.apache.org/jira/browse/NUTCH-1383

[jira] [Updated] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1383: --- Attachment: NUTCH-1383.patch patch for both null pointer exceptions

[jira] [Created] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2012-06-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1389: -- Summary: parsechecker and indexchecker to report truncated content Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch

[jira] [Created] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1415: -- Summary: release packages to contain top level folder apache-nutch-x.x Key: NUTCH-1415 URL: https://issues.apache.org/jira/browse/NUTCH-1415 Project: Nutch

[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415.patch Fix ant targets tar-src, tar-bin, zip-src, zip-bin Also set

[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-07-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415-2.patch Hi Lewis, you are completely right: the tarfileset /

[jira] [Updated] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-07-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1421: --- Attachment: NUTCH-1421-1.patch RegexURLNormalizer to only skip rules with invalid

[jira] [Created] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1422: -- Summary: reset signature for redirects Key: NUTCH-1422 URL: https://issues.apache.org/jira/browse/NUTCH-1422 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1422: --- Attachment: NUTCH-1422_redir_notmodified_log.txt reset signature for redirects

[jira] [Commented] (NUTCH-1328) a problem with regex-normalize.xml

2012-07-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410905#comment-13410905 ] Sebastian Nagel commented on NUTCH-1328: Duplicate of NUTCH-706

[jira] [Created] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1436: -- Summary: bin/nutch absent in zip package Key: NUTCH-1436 URL: https://issues.apache.org/jira/browse/NUTCH-1436 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1436: --- Attachment: NUTCH-1436.patch Patch for branch-1.5.1 (if a new bin package is desired). For

[jira] [Created] (NUTCH-1454) parsing chm failed

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1454: -- Summary: parsing chm failed Key: NUTCH-1454 URL: https://issues.apache.org/jira/browse/NUTCH-1454 Project: Nutch Issue Type: Bug Components:

[jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1455: -- Summary: RobotRulesParser to match multi-word user-agent names Key: NUTCH-1455 URL: https://issues.apache.org/jira/browse/NUTCH-1455 Project: Nutch

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-09-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13454282#comment-13454282 ] Sebastian Nagel commented on NUTCH-1467: Since nutch.metadata.Metadata,

[jira] [Assigned] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1415: -- Assignee: Sebastian Nagel release packages to contain top level folder

[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457753#comment-13457753 ] Sebastian Nagel commented on NUTCH-1415: This has been fixed only for 1.5.1 and

[jira] [Resolved] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1415. Resolution: Fixed Fix Version/s: 2.1 1.6 committed to trunk

[jira] [Commented] (NUTCH-706) Url regex normalizer

2012-10-02 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13467990#comment-13467990 ] Sebastian Nagel commented on NUTCH-706: --- Are there objections to apply and commit the

[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1476: -- Summary: SegmentReader getStats should set parsed = -1 if no parsing took place Key: NUTCH-1476 URL: https://issues.apache.org/jira/browse/NUTCH-1476 Project:

[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1476: --- Attachment: NUTCH-1476.patch SegmentReader getStats should set parsed = -1 if no

[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1252: -- Assignee: Sebastian Nagel SegmentReader -get shows wrong data

[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471915#comment-13471915 ] Sebastian Nagel commented on NUTCH-1344: Is there any reason why https should be

[jira] [Updated] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match newsId

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Fix Version/s: 2.2 Summary: Url regex normalizer: default pattern for session id

[jira] [Resolved] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match newsId

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-706. --- Resolution: Fixed committed to trunk (revision 1396796) and 2.x (revision 1396795)

[jira] [Resolved] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1344. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk

[jira] [Commented] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match newsId

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13473599#comment-13473599 ] Sebastian Nagel commented on NUTCH-706: --- First commit erroneously with wrong patch.

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474460#comment-13474460 ] Sebastian Nagel commented on NUTCH-1475: Indeed, a modified time in the future is

[jira] [Resolved] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1252. Resolution: Fixed committed to trunk (revision 1397281) SegmentReader

[jira] [Resolved] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1476. Resolution: Fixed committed to trunk (revision 1397298) SegmentReader

[jira] [Resolved] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1383. Resolution: Fixed committed to trunk (revision 1397308)

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482644#comment-13482644 ] Sebastian Nagel commented on NUTCH-1467: Hi Kiran, thanks for the patch. After a

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1467: --- Attachment: NUTCH-1467-TEST-1.patch nutch 1.5.1 not able to parse mutliValued metatags

[jira] [Resolved] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1421. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486290#comment-13486290 ] Sebastian Nagel commented on NUTCH-1482: Markus, you are right: I remember the API

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486484#comment-13486484 ] Sebastian Nagel commented on NUTCH-578: --- NUTCH-1245 provides a test to catch this

[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-578: -- Attachment: NUTCH-578_v5.patch URL fetched with 403 is generated over and over again

[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-10-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487316#comment-13487316 ] Sebastian Nagel commented on NUTCH-1370: +1 Would be nice to see also the number

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487318#comment-13487318 ] Sebastian Nagel commented on NUTCH-578: --- Resetting the retry counter in

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488146#comment-13488146 ] Sebastian Nagel commented on NUTCH-1483: Confirmed. The problem is caused by the

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Affects Version/s: 1.6 Can't crawl filesystem with protocol-file plugin

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488200#comment-13488200 ] Sebastian Nagel commented on NUTCH-1483: I tried with 1.x/trunk. For 2.x URLs with

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Attachment: NUTCH-1483.patch StringUtils.split(String, char) does not preserve empty parts:

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488254#comment-13488254 ] Sebastian Nagel commented on NUTCH-1483: Rogério, can you apply the patch,

[jira] [Created] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1484: -- Summary: TableUtil unreverseURL fails on file:// URLs Key: NUTCH-1484 URL: https://issues.apache.org/jira/browse/NUTCH-1484 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488558#comment-13488558 ] Sebastian Nagel commented on NUTCH-1483: Thanks! Issue with un-reversing URLs

[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488558#comment-13488558 ] Sebastian Nagel edited comment on NUTCH-1483 at 11/1/12 8:55 AM:

[jira] [Created] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1485: -- Summary: TableUtil reverseURL to keep userinfo part Key: NUTCH-1485 URL: https://issues.apache.org/jira/browse/NUTCH-1485 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488585#comment-13488585 ] Sebastian Nagel commented on NUTCH-1461: Cf. NUTCH-1484: same error with file://

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488935#comment-13488935 ] Sebastian Nagel commented on NUTCH-1245: They are not duplicates but the effects

[jira] [Created] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1488: -- Summary: bin/nutch to run junit from any directory Key: NUTCH-1488 URL: https://issues.apache.org/jira/browse/NUTCH-1488 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1488: --- Attachment: NUTCH-1488.patch bin/nutch to run junit from any directory

[jira] [Updated] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1484: --- Attachment: NUTCH-1484.patch Revised patch: replaced

[jira] [Comment Edited] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494952#comment-13494952 ] Sebastian Nagel edited comment on NUTCH-1484 at 11/11/12 7:56 PM:

[jira] [Resolved] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1484. Resolution: Fixed Committed to 2.x (rev. 1408465) TableUtil unreverseURL

[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-1.x.patch Ferdy is right: custom counters are more transparent. Patch

[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-2.x-v3.patch Hi Lewis, yes, the 1.x patch is not easily transferred

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-11-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504136#comment-13504136 ] Sebastian Nagel commented on NUTCH-1499: Short and precise patch. However, is

[jira] [Created] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1500: -- Summary: bin/crawl fails on step solrindex with wrong path to segment Key: NUTCH-1500 URL: https://issues.apache.org/jira/browse/NUTCH-1500 Project: Nutch

[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1500: --- Attachment: NUTCH-1500.patch bin/crawl fails on step solrindex with wrong path to

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507944#comment-13507944 ] Sebastian Nagel commented on NUTCH-1499: Thanks! That's a plausible reason: (let's

[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038.patch Port IndexingFiltersChecker to 2.0

[jira] [Created] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2012-12-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1501: -- Summary: Harmonize behavior of parsechecker and indexchecker Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13525439#comment-13525439 ] Sebastian Nagel commented on NUTCH-1245: @kiran: yes, 2.x is affected since fetch

[jira] [Commented] (NUTCH-1503) Configuration properties not in sync between FetcherReducer and nutch-default.xml

2012-12-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529497#comment-13529497 ] Sebastian Nagel commented on NUTCH-1503: Hi Lewis, both time limit properties are

[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038v2.patch Hi Lewis, it's a problem of the patch: the fetch time of a

[jira] [Commented] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545480#comment-13545480 ] Sebastian Nagel commented on NUTCH-1514: +1 But do we need a reference to the

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552028#comment-13552028 ] Sebastian Nagel commented on NUTCH-1499: So, a vote for won't fix. Comments?

[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-813. --- Resolution: Duplicate The described problem is identical to that of NUTCH-578. The provided

[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552082#comment-13552082 ] Sebastian Nagel commented on NUTCH-1345: JAVA_HOME (or NUTCH_JAVA_HOME) is

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554353#comment-13554353 ] Sebastian Nagel commented on NUTCH-1087: Hi Tristan, thanks for the patch! The

[jira] [Resolved] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2013-01-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1500. Resolution: Fixed committed to trunk (rev. 1433658) bin/crawl fails on

[jira] [Commented] (NUTCH-1520) SegmentMerger looses records

2013-01-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556093#comment-13556093 ] Sebastian Nagel commented on NUTCH-1520: Hi Markus, have a look at NUTCH-1113. An

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564274#comment-13564274 ] Sebastian Nagel commented on NUTCH-1465: Hi Tejas, thanks and a few comments on

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564768#comment-13564768 ] Sebastian Nagel commented on NUTCH-1465: Yes, SitemapInjector is a map-reduce

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827 ] Sebastian Nagel commented on NUTCH-1047: As some test for the interface started to

[jira] [Commented] (NUTCH-1535) Crawl crashes with java.io.exception

2013-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584550#comment-13584550 ] Sebastian Nagel commented on NUTCH-1535: Presumably, this is caused by

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584824#comment-13584824 ] Sebastian Nagel commented on NUTCH-1031: Hi Tejas, a test of

[jira] [Resolved] (NUTCH-1535) Crawl crashes with java.io.exception

2013-02-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1535. Resolution: Not A Problem Great! Crawl crashes with java.io.exception

[jira] [Closed] (NUTCH-1535) Crawl crashes with java.io.exception

2013-02-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1535. -- Crawl crashes with java.io.exception

[jira] [Commented] (NUTCH-1537) Legacy metadata package needs to take advantage of Apache Tika metadata package more.

2013-03-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591045#comment-13591045 ] Sebastian Nagel commented on NUTCH-1537: Removing stuff could be done in a few

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2013-03-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591062#comment-13591062 ] Sebastian Nagel commented on NUTCH-1467: Hi Kiran, any updates regarding the unit

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2013-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591773#comment-13591773 ] Sebastian Nagel commented on NUTCH-1467: Hi Kiran, my suggestion was only about

[jira] [Created] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1541: -- Summary: Indexer plugin to write CSV Key: NUTCH-1541 URL: https://issues.apache.org/jira/browse/NUTCH-1541 Project: Nutch Issue Type: New Feature

[jira] [Updated] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1541: --- Attachment: NUTCH-1541-v1.patch First version. NOTE: NUTCH-1047 is required, the targets for

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252 ] Sebastian Nagel commented on NUTCH-1047: Hi Julien, in overall, all looks good. A

[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595263#comment-13595263 ] Sebastian Nagel commented on NUTCH-1541: Yes, the fields dumped are configurable.

  1   2   3   4   5   6   7   8   9   10   >