[jira] Created: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-06-25 Thread Sebastian Nagel (JIRA)
document deduplication (exact duplicates) failed using MD5Signature --- Key: NUTCH-835 URL: https://issues.apache.org/jira/browse/NUTCH-835 Project: Nutch Issue Type: Bug Af

[jira] Created: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)
HttpClient null pointer exception - Key: NUTCH-862 URL: https://issues.apache.org/jira/browse/NUTCH-862 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environ

[jira] Updated: (NUTCH-862) HttpClient null pointer exception

2010-07-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-862: -- Attachment: NUTCH-862.patch patch > HttpClient null pointer exception > ---

[jira] Commented: (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2010-11-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930588#action_12930588 ] Sebastian Nagel commented on NUTCH-933: --- The modifiedTime stored in a CrawlDatum recor

[jira] Created: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-01-26 Thread Sebastian Nagel (JIRA)
max. redirects not handled correctly: fetcher stops at max-1 redirects -- Key: NUTCH-962 URL: https://issues.apache.org/jira/browse/NUTCH-962 Project: Nutch Issue Type: Bug

[jira] Updated: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-01-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-962: -- Attachment: Fetcher_redir.patch patch for 1.3 to respect count of redirects literally: http.red

[jira] [Created] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1344: -- Summary: BasicURLNormalizer to normalize https same as http Key: NUTCH-1344 URL: https://issues.apache.org/jira/browse/NUTCH-1344 Project: Nutch Issue T

[jira] [Updated] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-04-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1344: --- Attachment: NUTCH-1344.patch > BasicURLNormalizer to normalize https same as http >

[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely

2012-04-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258827#comment-13258827 ] Sebastian Nagel commented on NUTCH-1339: BasicURLNormalizer does not remove the an

[jira] [Commented] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263124#comment-13263124 ] Sebastian Nagel commented on NUTCH-1293: The content type should be added to metad

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

2012-05-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954 ] Sebastian Nagel commented on NUTCH-1323: After a small test crawl on http://si.dra

[jira] [Created] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1383: -- Summary: IndexingFiltersChecker to show error message instead of null pointer exception Key: NUTCH-1383 URL: https://issues.apache.org/jira/browse/NUTCH-1383 Proj

[jira] [Updated] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-06-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1383: --- Attachment: NUTCH-1383.patch patch for both null pointer exceptions > Indexi

[jira] [Created] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2012-06-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1389: -- Summary: parsechecker and indexchecker to report truncated content Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch I

[jira] [Created] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1415: -- Summary: release packages to contain top level folder apache-nutch-x.x Key: NUTCH-1415 URL: https://issues.apache.org/jira/browse/NUTCH-1415 Project: Nutch

[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-06-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415.patch Fix ant targets tar-src, tar-bin, zip-src, zip-bin Also set appr

[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-07-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415-2.patch Hi Lewis, you are completely right: the tarfileset / zipfilese

[jira] [Created] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2012-07-03 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1419: -- Summary: parsechecker and indexchecker to report protocol status Key: NUTCH-1419 URL: https://issues.apache.org/jira/browse/NUTCH-1419 Project: Nutch Iss

[jira] [Updated] (NUTCH-1419) parsechecker and indexchecker to report protocol status

2012-07-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1419: --- Attachment: NUTCH-1419-1.patch Simple patch: in case of a protocol status other than 200 (suc

[jira] [Created] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-07-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1421: -- Summary: RegexURLNormalizer to only skip rules with invalid patterns Key: NUTCH-1421 URL: https://issues.apache.org/jira/browse/NUTCH-1421 Project: Nutch

[jira] [Updated] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-07-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1421: --- Attachment: NUTCH-1421-1.patch > RegexURLNormalizer to only skip rules with invalid patte

[jira] [Created] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1422: -- Summary: reset signature for redirects Key: NUTCH-1422 URL: https://issues.apache.org/jira/browse/NUTCH-1422 Project: Nutch Issue Type: Bug Com

[jira] [Updated] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1422: --- Attachment: NUTCH-1422_redir_notmodified_log.txt > reset signature for redirects > --

[jira] [Commented] (NUTCH-1328) a problem with regex-normalize.xml

2012-07-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410905#comment-13410905 ] Sebastian Nagel commented on NUTCH-1328: Duplicate of NUTCH-706 >

[jira] [Updated] (NUTCH-706) Url regex normalizer

2012-07-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Attachment: NUTCH-706.patch - fix the pattern by adding an anchor prohibiting inner-word matches

[jira] [Created] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1436: -- Summary: bin/nutch absent in zip package Key: NUTCH-1436 URL: https://issues.apache.org/jira/browse/NUTCH-1436 Project: Nutch Issue Type: Bug C

[jira] [Updated] (NUTCH-1436) bin/nutch absent in zip package

2012-07-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1436: --- Attachment: NUTCH-1436.patch Patch for branch-1.5.1 (if a new bin package is desired). For tr

[jira] [Updated] (NUTCH-706) Url regex normalizer

2012-08-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Attachment: NUTCH-706-2.patch Second trial for patch. The first one does not remove: {code} ?_se

[jira] [Created] (NUTCH-1454) parsing chm failed

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1454: -- Summary: parsing chm failed Key: NUTCH-1454 URL: https://issues.apache.org/jira/browse/NUTCH-1454 Project: Nutch Issue Type: Bug Components: pa

[jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names

2012-08-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1455: -- Summary: RobotRulesParser to match multi-word user-agent names Key: NUTCH-1455 URL: https://issues.apache.org/jira/browse/NUTCH-1455 Project: Nutch Issue

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-09-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454282#comment-13454282 ] Sebastian Nagel commented on NUTCH-1467: Since nutch.metadata.Metadata, NutchField

[jira] [Assigned] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1415: -- Assignee: Sebastian Nagel > release packages to contain top level folder apache-nut

[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457753#comment-13457753 ] Sebastian Nagel commented on NUTCH-1415: This has been fixed only for 1.5.1 and 2.

[jira] [Resolved] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1415. Resolution: Fixed Fix Version/s: 2.1 1.6 committed to trunk (revi

[jira] [Commented] (NUTCH-706) Url regex normalizer

2012-10-02 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467990#comment-13467990 ] Sebastian Nagel commented on NUTCH-706: --- Are there objections to apply and commit the

[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1476: -- Summary: SegmentReader getStats should set parsed = -1 if no parsing took place Key: NUTCH-1476 URL: https://issues.apache.org/jira/browse/NUTCH-1476 Project: Nut

[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1476: --- Attachment: NUTCH-1476.patch > SegmentReader getStats should set parsed = -1 if no parsin

[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1252: -- Assignee: Sebastian Nagel > SegmentReader -get shows wrong data > -

[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471915#comment-13471915 ] Sebastian Nagel commented on NUTCH-1344: Is there any reason why https should be t

[jira] [Updated] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Fix Version/s: 2.2 Summary: Url regex normalizer: default pattern for session id remova

[jira] [Resolved] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-706. --- Resolution: Fixed committed to trunk (revision 1396796) and 2.x (revision 1396795)

[jira] [Resolved] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1344. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk (revi

[jira] [Commented] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"

2012-10-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473599#comment-13473599 ] Sebastian Nagel commented on NUTCH-706: --- First commit erroneously with wrong patch. C

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474460#comment-13474460 ] Sebastian Nagel commented on NUTCH-1475: Indeed, a modified time in the future is

[jira] [Resolved] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1252. Resolution: Fixed committed to trunk (revision 1397281) > SegmentReader -g

[jira] [Resolved] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1476. Resolution: Fixed committed to trunk (revision 1397298) > SegmentReader ge

[jira] [Resolved] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception

2012-10-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1383. Resolution: Fixed committed to trunk (revision 1397308) > IndexingFiltersC

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482644#comment-13482644 ] Sebastian Nagel commented on NUTCH-1467: Hi Kiran, thanks for the patch. After a l

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1467: --- Attachment: NUTCH-1467-TEST-1.patch > nutch 1.5.1 not able to parse mutliValued metatags

[jira] [Resolved] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns

2012-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1421. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk (rev.

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a lar

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486144#comment-13486144 ] Sebastian Nagel commented on NUTCH-1482: +1 > Rename HTMLParseFil

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486290#comment-13486290 ] Sebastian Nagel commented on NUTCH-1482: Markus, you are right: I remember the API

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486484#comment-13486484 ] Sebastian Nagel commented on NUTCH-578: --- NUTCH-1245 provides a test to catch this pro

[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-578: -- Attachment: NUTCH-578_v5.patch > URL fetched with 403 is generated over and over again > ---

[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-10-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487316#comment-13487316 ] Sebastian Nagel commented on NUTCH-1370: +1 Would be nice to see also the number o

[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-10-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487318#comment-13487318 ] Sebastian Nagel commented on NUTCH-578: --- Resetting the retry counter in setPageGoneSc

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488146#comment-13488146 ] Sebastian Nagel commented on NUTCH-1483: Confirmed. The problem is caused by the r

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Affects Version/s: 1.6 > Can't crawl filesystem with protocol-file plugin > -

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488200#comment-13488200 ] Sebastian Nagel commented on NUTCH-1483: I tried with 1.x/trunk. For 2.x URLs with

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Attachment: NUTCH-1483.patch StringUtils.split(String, char) does not preserve empty parts: h

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488254#comment-13488254 ] Sebastian Nagel commented on NUTCH-1483: Rogério, can you apply the patch, re-comp

[jira] [Created] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1484: -- Summary: TableUtil unreverseURL fails on file:// URLs Key: NUTCH-1484 URL: https://issues.apache.org/jira/browse/NUTCH-1484 Project: Nutch Issue Type: Bu

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558 ] Sebastian Nagel commented on NUTCH-1483: Thanks! Issue with un-reversing URLs pull

[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558 ] Sebastian Nagel edited comment on NUTCH-1483 at 11/1/12 8:55 AM: ---

[jira] [Created] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1485: -- Summary: TableUtil reverseURL to keep userinfo part Key: NUTCH-1485 URL: https://issues.apache.org/jira/browse/NUTCH-1485 Project: Nutch Issue Type: Impr

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488585#comment-13488585 ] Sebastian Nagel commented on NUTCH-1461: Cf. NUTCH-1484: same error with file:// U

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488935#comment-13488935 ] Sebastian Nagel commented on NUTCH-1245: They are not duplicates but the effects a

[jira] [Created] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1488: -- Summary: bin/nutch to run junit from any directory Key: NUTCH-1488 URL: https://issues.apache.org/jira/browse/NUTCH-1488 Project: Nutch Issue Type: Impro

[jira] [Updated] (NUTCH-1488) bin/nutch to run junit from any directory

2012-11-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1488: --- Attachment: NUTCH-1488.patch > bin/nutch to run junit from any directory > --

[jira] [Commented] (NUTCH-1496) ParserJob logs skipped urls with level info

2012-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494950#comment-13494950 ] Sebastian Nagel commented on NUTCH-1496: +1 > ParserJob logs skip

[jira] [Updated] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1484: --- Attachment: NUTCH-1484.patch Revised patch: replaced StringUtils.splitByWholeSeparatorPreserv

[jira] [Comment Edited] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494952#comment-13494952 ] Sebastian Nagel edited comment on NUTCH-1484 at 11/11/12 7:56 PM: --

[jira] [Resolved] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1484. Resolution: Fixed Committed to 2.x (rev. 1408465) > TableUtil unreverseURL

[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-1.x.patch Ferdy is right: custom counters are more transparent. Patch

[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-2.x-v3.patch Hi Lewis, yes, the 1.x patch is not easily transferred fo

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-11-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504136#comment-13504136 ] Sebastian Nagel commented on NUTCH-1499: Short and precise patch. However, is ther

[jira] [Created] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1500: -- Summary: bin/crawl fails on step solrindex with wrong path to segment Key: NUTCH-1500 URL: https://issues.apache.org/jira/browse/NUTCH-1500 Project: Nutch

[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2012-11-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1500: --- Attachment: NUTCH-1500.patch > bin/crawl fails on step solrindex with wrong path to segme

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2012-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507944#comment-13507944 ] Sebastian Nagel commented on NUTCH-1499: Thanks! That's a plausible reason: (let's

[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Patch Info: Patch Available > Port IndexingFiltersChecker to 2.0 > --

[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038.patch > Port IndexingFiltersChecker to 2.0 > -

[jira] [Created] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2012-12-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1501: -- Summary: Harmonize behavior of parsechecker and indexchecker Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue T

[jira] [Created] (NUTCH-1502) Test for CrawlDatum state transitions

2012-12-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1502: -- Summary: Test for CrawlDatum state transitions Key: NUTCH-1502 URL: https://issues.apache.org/jira/browse/NUTCH-1502 Project: Nutch Issue Type: Improveme

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525439#comment-13525439 ] Sebastian Nagel commented on NUTCH-1245: @kiran: yes, 2.x is affected since fetch

[jira] [Commented] (NUTCH-1503) Configuration properties not in sync between FetcherReducer and nutch-default.xml

2012-12-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529497#comment-13529497 ] Sebastian Nagel commented on NUTCH-1503: Hi Lewis, both time limit properties are

[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-12-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038v2.patch Hi Lewis, it's a problem of the patch: the fetch time of a Web

[jira] [Commented] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545480#comment-13545480 ] Sebastian Nagel commented on NUTCH-1514: +1 But do we need a reference to the remo

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552028#comment-13552028 ] Sebastian Nagel commented on NUTCH-1499: So, a vote for "won't fix". Comments?

[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-813. --- Resolution: Duplicate The described problem is identical to that of NUTCH-578. The provided pa

[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552082#comment-13552082 ] Sebastian Nagel commented on NUTCH-1345: JAVA_HOME (or NUTCH_JAVA_HOME) is current

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554353#comment-13554353 ] Sebastian Nagel commented on NUTCH-1087: Hi Tristan, thanks for the patch! The seg

[jira] [Resolved] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2013-01-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1500. Resolution: Fixed committed to trunk (rev. 1433658) > bin/crawl fails on s

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554381#comment-13554381 ] Sebastian Nagel commented on NUTCH-1087: yes, of course, but currently there is al

[jira] [Commented] (NUTCH-1520) SegmentMerger looses records

2013-01-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556093#comment-13556093 ] Sebastian Nagel commented on NUTCH-1520: Hi Markus, have a look at NUTCH-1113. An

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564274#comment-13564274 ] Sebastian Nagel commented on NUTCH-1465: Hi Tejas, thanks and a few comments on th

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768 ] Sebastian Nagel commented on NUTCH-1465: Yes, SitemapInjector is a map-reduce jo

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564827#comment-13564827 ] Sebastian Nagel commented on NUTCH-1047: As some test for the interface started to

  1   2   3   4   5   6   7   8   9   10   >