[jira] [Updated] (NUTCH-1646) IndexerMapReduce to consider DB status

2014-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1646: --- Fix Version/s: (was: 1.9) 1.8 IndexerMapReduce to consider DB status

[jira] [Updated] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

2014-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1706: --- Fix Version/s: (was: 1.9) 1.8 IndexerMapReduce does not remove

[jira] [Reopened] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-1113: Tests fail also on Jenkins Merging segments causes URLs to vanish from crawldb/index?

[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

2014-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917958#comment-13917958 ] Sebastian Nagel commented on NUTCH-1706: Must be included into 1.8, will commit

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: NUTCH-1113-trunk-junit-fail.patch Fixed also second problem in junit test:

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-03-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1113: --- Attachment: (was: NUTCH-1113-trunk-junit-fail.patch) Merging segments causes URLs to

[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

2014-03-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924148#comment-13924148 ] Sebastian Nagel commented on NUTCH-1706: Committed to trunk r1575351

[jira] [Resolved] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

2014-03-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1706. Resolution: Fixed IndexerMapReduce does not remove db_redir_temp etc

[jira] [Comment Edited] (NUTCH-1646) IndexerMapReduce to consider DB status

2014-03-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924149#comment-13924149 ] Sebastian Nagel edited comment on NUTCH-1646 at 3/7/14 6:18 PM:

[jira] [Resolved] (NUTCH-1646) IndexerMapReduce to consider DB status

2014-03-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1646. Resolution: Fixed fix with NUTCH-1706 IndexerMapReduce to consider DB status

[jira] [Created] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-11 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1733: -- Summary: parse-html to support HTML5 charset definitions Key: NUTCH-1733 URL: https://issues.apache.org/jira/browse/NUTCH-1733 Project: Nutch Issue

[jira] [Updated] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1733: --- Attachment: charset_html5.html charset_bom_html5.html parse-html to support

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-03-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930573#comment-13930573 ] Sebastian Nagel commented on NUTCH-1253: Also committed patch to trunk r1576422.

[jira] [Updated] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1733: --- Attachment: NUTCH-1733-trunk.patch patch for trunk including unit test parse-html to

[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-03-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935823#comment-13935823 ] Sebastian Nagel commented on NUTCH-1712: Thanks. Looks good in general, +1 for the

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

2014-03-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-685: -- Fix Version/s: 1.9 Content-level redirect status lost in ParseSegment

[jira] [Updated] (NUTCH-1735) code dedup fetcher queue redirects

2014-03-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1735: --- Attachment: NUTCH-1735.patch code dedup fetcher queue redirects

[jira] [Created] (NUTCH-1735) code dedup fetcher queue redirects

2014-03-14 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1735: -- Summary: code dedup fetcher queue redirects Key: NUTCH-1735 URL: https://issues.apache.org/jira/browse/NUTCH-1735 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-03-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13936163#comment-13936163 ] Sebastian Nagel commented on NUTCH-1645: Committed to 2.x, r1577834. Port to trunk

[jira] [Updated] (NUTCH-1737) Upgrade to recent JUnit 4.x

2014-03-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1737: --- Attachment: NUTCH-1737-trivial.patch Upgrade to recent JUnit 4.x

[jira] [Created] (NUTCH-1737) Upgrade to recent JUnit 4.x

2014-03-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1737: -- Summary: Upgrade to recent JUnit 4.x Key: NUTCH-1737 URL: https://issues.apache.org/jira/browse/NUTCH-1737 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-03-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1645: --- Attachment: NUTCH-1645-trunk-v1.patch Patch for trunk, requires NUTCH-1737. Junit Test

[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2014-03-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937257#comment-13937257 ] Sebastian Nagel commented on NUTCH-1647: Seems to be a duplicate of NUTCH-1736:

[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding´╝Üchunked

2014-03-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937269#comment-13937269 ] Sebastian Nagel commented on NUTCH-1736: Thanks, [~yangshangchuan] for taking the

[jira] [Updated] (NUTCH-1671) indexchecker to add digest field

2014-03-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1671: --- Fix Version/s: 2.3 indexchecker to add digest field

[jira] [Resolved] (NUTCH-1671) indexchecker to add digest field

2014-03-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1671. Resolution: Fixed Committed to trunk r1578616 and 2.x r1578620. indexchecker to add

[jira] [Comment Edited] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938930#comment-13938930 ] Sebastian Nagel edited comment on NUTCH-1739 at 3/18/14 7:43 AM:

[jira] [Commented] (NUTCH-1739) ExecutorService field in ParseUtil.java not be right use and cause memory leak

2014-03-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938930#comment-13938930 ] Sebastian Nagel commented on NUTCH-1739: Thanks, [~yangshangchuan]. But isn't this

[jira] [Updated] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1733: --- Attachment: charset_bom_utf16_html5.html Hi [~jlafitte], yes that's true: in a hex-editor it

[jira] [Updated] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-03-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1605: --- Attachment: NUTCH-1605-trunk-v2.patch Improved patch (also applies to 2.x): - simplified

[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-03-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943491#comment-13943491 ] Sebastian Nagel commented on NUTCH-1605: Patch also applies to 2.x mime type

[jira] [Updated] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-03-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1605: --- Affects Version/s: 2.2.1 mime type detector recognizes xlsx as zip file

[jira] [Updated] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-03-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1605: --- Fix Version/s: 1.9 2.3 mime type detector recognizes xlsx as zip file

[jira] [Updated] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-03-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1733: --- Attachment: NUTCH-1733-2.x.patch Committed to trunk r1580046. Attached patch for 2.x.

[jira] [Assigned] (NUTCH-1742) Please delete old releases from mirroring system

2014-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1742: -- Assignee: Sebastian Nagel Please delete old releases from mirroring system

[jira] [Commented] (NUTCH-1742) Please delete old releases from mirroring system

2014-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944167#comment-13944167 ] Sebastian Nagel commented on NUTCH-1742: Links and references on web site

[jira] [Resolved] (NUTCH-1742) Please delete old releases from mirroring system

2014-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1742. Resolution: Fixed Updated web site ([Downloads|http://nutch.apache.org/downloads.html]).

[jira] [Commented] (NUTCH-1742) Please delete old releases from mirroring system

2014-03-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944191#comment-13944191 ] Sebastian Nagel commented on NUTCH-1742: Added to release instructions in wiki

[jira] [Created] (NUTCH-1743) parsechecker to show outlinks

2014-03-26 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1743: -- Summary: parsechecker to show outlinks Key: NUTCH-1743 URL: https://issues.apache.org/jira/browse/NUTCH-1743 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1743) parsechecker to show outlinks

2014-03-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1743: --- Attachment: NUTCH-1743.patch parsechecker to show outlinks -

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2014-03-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Attachment: test_nutch_797.html Tested using parsechecker (cf. NUTCH-1743) with attached sample

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2014-03-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Attachment: NUTCH-797-2x.patch Patch for 2.x: - port URLUtil.resolveURL() from 1.x (including

[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-03-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951566#comment-13951566 ] Sebastian Nagel commented on NUTCH-1708: HI [~markus17], another way would be to

[jira] [Commented] (NUTCH-1321) IDNNormalizer

2014-03-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951585#comment-13951585 ] Sebastian Nagel commented on NUTCH-1321: In BasicURLNormalizer URLs are already

[jira] [Commented] (NUTCH-1737) Upgrade to recent JUnit 4.x

2014-03-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954446#comment-13954446 ] Sebastian Nagel commented on NUTCH-1737: Thanks a lot, [~lewismc], for taking on

[jira] [Resolved] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-03-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1645. Resolution: Fixed Fix Version/s: 1.9 Committed to trunk r1583193. Thanks! Junit

[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2014-03-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954836#comment-13954836 ] Sebastian Nagel commented on NUTCH-1741: Hi [~alparslan.avci], the plan for

[jira] [Resolved] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1735. Resolution: Fixed Committed to trunk r1585144. code dedup fetcher queue redirects

[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Fix Version/s: 1.9 fetcher should track and shut down hung threads

[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Attachment: NUTCH-1182-trunk-v1.patch From time to time this problem is reported by users

[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961211#comment-13961211 ] Sebastian Nagel commented on NUTCH-1747: +1 Looks like inProgress was intended to

[jira] [Commented] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump

2014-04-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961376#comment-13961376 ] Sebastian Nagel commented on NUTCH-1615: No question, reading an entire [Wikimedia

[jira] [Commented] (NUTCH-1750) Improvement of Fetcher's reportStatus

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963921#comment-13963921 ] Sebastian Nagel commented on NUTCH-1750: +1 Improvement of Fetcher's

[jira] [Created] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1752: -- Summary: cache robots.txt rules per protocol:host:port Key: NUTCH-1752 URL: https://issues.apache.org/jira/browse/NUTCH-1752 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1752: --- Attachment: NUTCH-1752-v1.patch Patch for trunk and 2.x cache robots.txt rules per

[jira] [Updated] (NUTCH-1748) urlfilter-validator to allow .. (two dots) inside file names (path elements)

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1748: --- Summary: urlfilter-validator to allow .. (two dots) inside file names (path elements) (was:

[jira] [Commented] (NUTCH-1748) urlfilter-validator to allow .. (two dots) inside file names (path elements)

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964127#comment-13964127 ] Sebastian Nagel commented on NUTCH-1748: Hi [~alexmc], you'r absolutely right: the

[jira] [Commented] (NUTCH-710) Support for rel=canonical attribute

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964202#comment-13964202 ] Sebastian Nagel commented on NUTCH-710: --- Thanks, [~Sertac Turkel]! My comments: *

[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964632#comment-13964632 ] Sebastian Nagel commented on NUTCH-1752: Yep: Apache httpd and Tomcat on same host

[jira] [Commented] (NUTCH-1751) Empty anchors should not index

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964672#comment-13964672 ] Sebastian Nagel commented on NUTCH-1751: +1 Trunk is not affected:

[jira] [Created] (NUTCH-1754) remove BOM from extracted plain text

2014-04-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1754: -- Summary: remove BOM from extracted plain text Key: NUTCH-1754 URL: https://issues.apache.org/jira/browse/NUTCH-1754 Project: Nutch Issue Type: Bug

[jira] [Resolved] (NUTCH-1733) parse-html to support HTML5 charset definitions

2014-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1733. Resolution: Fixed Committed to 2.x r1586162. Opened NUTCH-1754 to remove the leading BOM

[jira] [Closed] (NUTCH-1454) parsing chm failed

2014-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1454. -- parsing chm failed -- Key: NUTCH-1454 URL:

[jira] [Resolved] (NUTCH-1454) parsing chm failed

2014-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1454. Resolution: Fixed Hi [~tejasp], confirmed: fixed with Tika 1.5 and NUTCH-1729. parsing

[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-04-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968762#comment-13968762 ] Sebastian Nagel commented on NUTCH-1708: ??need to get rid of the repr_url?? Not

[jira] [Commented] (NUTCH-1748) urlfilter-validator to allow .. (two dots) inside file names (path elements)

2014-04-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968882#comment-13968882 ] Sebastian Nagel commented on NUTCH-1748: Hi [~Sertac Turkel], thanks, +1 for the

[jira] [Updated] (NUTCH-1566) bin/nutch to allow whitespace in paths

2014-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1566: --- Attachment: NUTCH-1566-2x.patch NUTCH-1566-v3-trunk.patch * patch updated

[jira] [Updated] (NUTCH-1308) Unnecessary truncate content configuration, and logging in parse-zip/ZipParser

2014-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1308: --- Attachment: NUTCH-1308-ZipParser-main-trunk.patch Hi [~lewismc], is this fixed with

[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972053#comment-13972053 ] Sebastian Nagel commented on NUTCH-1605: Changes to MIME magic may result in

[jira] [Commented] (NUTCH-1762) project web site's search (provided by lucid) is broken

2014-04-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13978810#comment-13978810 ] Sebastian Nagel commented on NUTCH-1762: +1 Thanks! If no objections: will commit

[jira] [Assigned] (NUTCH-1762) project web site's search (provided by lucid) is broken

2014-04-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1762: -- Assignee: Sebastian Nagel project web site's search (provided by lucid) is broken

[jira] [Updated] (NUTCH-1762) project web site's search (provided by lucid) is broken

2014-04-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1762: --- Attachment: NUTCH-1762-v2.patch completed patch: * contains changes to generated HTML files

[jira] [Updated] (NUTCH-1762) project web site's search (provided by lucid) is broken

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1762: --- Fix Version/s: (was: 1.9) (was: 2.3) project web site's search

[jira] [Resolved] (NUTCH-1762) project web site's search (provided by lucid) is broken

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1762. Resolution: Fixed committed r1589810. Thanks, [~iorixxx]! project web site's search

[jira] [Assigned] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1182: -- Assignee: Sebastian Nagel fetcher should track and shut down hung threads

[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Attachment: NUTCH-1182-2x.patch Patch for 2.x. fetcher should track and shut down hung

[jira] [Commented] (NUTCH-1182) fetcher to log hung threads

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980389#comment-13980389 ] Sebastian Nagel commented on NUTCH-1182: Changed title: shutting down hung threads

[jira] [Updated] (NUTCH-1182) fetcher to log hung threads

2014-04-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Summary: fetcher to log hung threads (was: fetcher should track and shut down hung threads)

[jira] [Assigned] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1752: -- Assignee: Sebastian Nagel cache robots.txt rules per protocol:host:port

[jira] [Updated] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-04-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1752: --- Attachment: NUTCH-1752-v2.patch Attached reviewed patch v2. Changed/fixed caching of robot

[jira] [Resolved] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-566. --- Resolution: Fixed Was fixed by NUTCH-797 with version 1.4 (2.x will be patched soon), the

[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-952: -- Attachment: test_nutch_952.html Was fixed by NUTCH-797 for v 1.4 (2.x will follow soon).

[jira] [Resolved] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-952. --- Resolution: Fixed fix outlink which started with '?' in html parser

[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-566: -- Fix Version/s: (was: 1.9) Sun's URL class has bug in creation of relative query URLs

[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-952: -- Fix Version/s: (was: 1.9) fix outlink which started with '?' in html parser

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982116#comment-13982116 ] Sebastian Nagel commented on NUTCH-797: --- Hi [~jnioche], is there anything left

[jira] [Updated] (NUTCH-1764) readdb to show command-line help if no action (-stats, -dump, etc.) given

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1764: --- Summary: readdb to show command-line help if no action (-stats, -dump, etc.) given (was:

[jira] [Resolved] (NUTCH-1764) readdb to show command-line help if no action (-stats, -dump, etc.) given

2014-04-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1764. Resolution: Fixed Fix Version/s: (was: 1.8) +1 Thanks, [~diaa_abdallah]!

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2014-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982271#comment-13982271 ] Sebastian Nagel commented on NUTCH-797: --- Ok, then I'll take over to patch 2.x and

[jira] [Updated] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Summary: URL not properly constructed when link target begins with a ? (was: parse-tika is not

[jira] [Commented] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982272#comment-13982272 ] Sebastian Nagel commented on NUTCH-797: --- Changed title: it's also a problem of

[jira] [Created] (NUTCH-1767) remove special treatment of params in relative links

2014-04-27 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1767: -- Summary: remove special treatment of params in relative links Key: NUTCH-1767 URL: https://issues.apache.org/jira/browse/NUTCH-1767 Project: Nutch Issue

[jira] [Updated] (NUTCH-1767) remove special treatment of params in relative links

2014-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1767: --- Attachment: test_nutch_1767-2.html test_nutch_1767-1.html Test documents.

[jira] [Updated] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Attachment: NUTCH-797-2x-v2.patch Simplified patch for 2.x, without changes to

[jira] [Updated] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Fix Version/s: (was: 1.9) 2.3 URL not properly constructed when link

[jira] [Resolved] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-797. --- Resolution: Fixed Committed to 2.x, r1590796. URL not properly constructed when link target

[jira] [Assigned] (NUTCH-797) URL not properly constructed when link target begins with a ?

2014-04-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-797: - Assignee: Sebastian Nagel (was: Julien Nioche) URL not properly constructed when link

[jira] [Updated] (NUTCH-1767) remove special treatment of params in relative links

2014-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1767: --- Attachment: NUTCH-1767-2x.patch NUTCH-1767-1x.patch patches for trunk and

[jira] [Updated] (NUTCH-1767) remove special treatment of params in relative links

2014-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1767: --- Patch Info: Patch Available remove special treatment of params in relative links

[jira] [Commented] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-05-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988655#comment-13988655 ] Sebastian Nagel commented on NUTCH-207: --- Looks good. - there are some

<    1   2   3   4   5   6   7   8   9   10   >