[jira] [Commented] (NUTCH-1872) enables control over how injected metadata is propagated

2014-10-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168776#comment-14168776 ] Sebastian Nagel commented on NUTCH-1872: Thanks, [~jcoopere]! Nice patch! A

[jira] [Commented] (NUTCH-1872) enables control over how injected metadata is propagated

2014-10-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169195#comment-14169195 ] Sebastian Nagel commented on NUTCH-1872: Hi Jonathan, you are right. Sorry :) For

[jira] [Commented] (NUTCH-1872) enables control over how injected metadata is propagated

2014-10-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171632#comment-14171632 ] Sebastian Nagel commented on NUTCH-1872: The way the injected URL is set for the

[jira] [Commented] (NUTCH-1870) Generic xsl parser plugin

2014-10-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174212#comment-14174212 ] Sebastian Nagel commented on NUTCH-1870: 1. I'll regenerate one patch soon. If

[jira] [Commented] (NUTCH-1877) Suffix URL filter doesn't ignore query strings

2014-10-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174819#comment-14174819 ] Sebastian Nagel commented on NUTCH-1877: From conf/suffix-urlfilter.txt.template:

[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488558#comment-13488558 ] Sebastian Nagel edited comment on NUTCH-1483 at 10/18/14 7:24 PM:

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176111#comment-14176111 ] Sebastian Nagel commented on NUTCH-1483: You'll get it working if (1) the above

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Fix Version/s: 2.3 Can't crawl filesystem with protocol-file plugin

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Attachment: (was: NUTCH-1483.patch) Can't crawl filesystem with protocol-file plugin

[jira] [Created] (NUTCH-1878) urlnormalizer-regex to keep third slash in file:///path/index.html

2014-10-18 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1878: -- Summary: urlnormalizer-regex to keep third slash in file:///path/index.html Key: NUTCH-1878 URL: https://issues.apache.org/jira/browse/NUTCH-1878 Project: Nutch

[jira] [Updated] (NUTCH-1878) urlnormalizer-regex to keep third slash in file:///path/index.html

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1878: --- Attachment: NUTCH-1878-v1.patch Patch which adds additional negative context/look-behind to

[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176111#comment-14176111 ] Sebastian Nagel edited comment on NUTCH-1483 at 10/18/14 9:56 PM:

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176160#comment-14176160 ] Sebastian Nagel commented on NUTCH-1483: The reason for (2) is best explained with

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176172#comment-14176172 ] Sebastian Nagel commented on NUTCH-1483: But URI.toString(),

[jira] [Created] (NUTCH-1879) Regex URL normalizer should remove multiple slashes after file: protocol

2014-10-19 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1879: -- Summary: Regex URL normalizer should remove multiple slashes after file: protocol Key: NUTCH-1879 URL: https://issues.apache.org/jira/browse/NUTCH-1879 Project:

[jira] [Updated] (NUTCH-1879) Regex URL normalizer should remove multiple slashes after file: protocol

2014-10-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1879: --- Attachment: NUTCH-1879-v1.patch Patch which adds regex rule and tests. Regex URL normalizer

[jira] [Created] (NUTCH-1880) URLUtil should not add additional slashes for file URLs

2014-10-19 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1880: -- Summary: URLUtil should not add additional slashes for file URLs Key: NUTCH-1880 URL: https://issues.apache.org/jira/browse/NUTCH-1880 Project: Nutch

[jira] [Updated] (NUTCH-1880) URLUtil should not add additional slashes for file URLs

2014-10-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1880: --- Attachment: NUTCH-1880-2x-v1.patch NUTCH-1880-trunk-v1.patch URLUtil should

[jira] [Created] (NUTCH-1881) ant target resolve-default to keep test libs

2014-10-19 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1881: -- Summary: ant target resolve-default to keep test libs Key: NUTCH-1881 URL: https://issues.apache.org/jira/browse/NUTCH-1881 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-1881) ant target resolve-default to keep test libs

2014-10-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1881: --- Attachment: NUTCH-1881-v1.patch Patch which splits clean-lib into clean-default-lib and

[jira] [Created] (NUTCH-1882) ant eclipse target to add output path to src/test

2014-10-19 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1882: -- Summary: ant eclipse target to add output path to src/test Key: NUTCH-1882 URL: https://issues.apache.org/jira/browse/NUTCH-1882 Project: Nutch Issue

[jira] [Updated] (NUTCH-1882) ant eclipse target to add output path to src/test

2014-10-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1882: --- Fix Version/s: 2.3 ant eclipse target to add output path to src/test

[jira] [Updated] (NUTCH-1882) ant eclipse target to add output path to src/test

2014-10-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1882: --- Attachment: NUTCH-1882-trunk-v1.patch NUTCH-1882-2x-v1.patch ant eclipse

[jira] [Resolved] (NUTCH-1827) Port NUTCH-1467 and NUTCH-1561 to 2.x

2014-10-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1827. Resolution: Fixed Committed to 2.x, r1633222. Thanks, [~lewismc]! Port NUTCH-1467 and

[jira] [Resolved] (NUTCH-1882) ant eclipse target to add output path to src/test

2014-10-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1882. Resolution: Fixed Committed to trunk and 2.x, r1633426. ant eclipse target to add output

[jira] [Commented] (NUTCH-1644) Should have a parser that uses xpath

2014-10-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181079#comment-14181079 ] Sebastian Nagel commented on NUTCH-1644: What's the exact objective? # extract

[jira] [Commented] (NUTCH-1644) Should have a parser that uses xpath

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182671#comment-14182671 ] Sebastian Nagel commented on NUTCH-1644: NUTCH-1870 definitely does extract

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182960#comment-14182960 ] Sebastian Nagel commented on NUTCH-1342: Hi [~angela_wang], I also get a timeout

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183389#comment-14183389 ] Sebastian Nagel commented on NUTCH-1483: Hi, the log looks not really wrong. In

[jira] [Commented] (NUTCH-1825) protocol-http may hang for certain web pages

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183490#comment-14183490 ] Sebastian Nagel commented on NUTCH-1825: Comments and reviews welcome! The problem

[jira] [Updated] (NUTCH-1825) protocol-http may hang for certain web pages

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1825: --- Fix Version/s: 1.10 2.3 protocol-http may hang for certain web pages

[jira] [Updated] (NUTCH-1825) protocol-http may hang for certain web pages

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1825: --- Affects Version/s: 2.2.1 protocol-http may hang for certain web pages

[jira] [Updated] (NUTCH-1825) protocol-http may hang for certain web pages

2014-10-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1825: --- Attachment: NUTCH-1825-2x-v3.patch protocol-http may hang for certain web pages

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184038#comment-14184038 ] Sebastian Nagel commented on NUTCH-1483: Not everything is ok: the url appears in

[jira] [Created] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-10-25 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1883: -- Summary: bin/crawl: use function to run bin/nutch and check exit value Key: NUTCH-1883 URL: https://issues.apache.org/jira/browse/NUTCH-1883 Project: Nutch

[jira] [Updated] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-10-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1883: --- Attachment: NUTCH-1883-trunk-v1.patch bin/crawl: use function to run bin/nutch and check

[jira] [Updated] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-10-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1883: --- Attachment: NUTCH-1883-2x-v1.patch bin/crawl: use function to run bin/nutch and check exit

[jira] [Resolved] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-10-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1883. Resolution: Fixed Committed to trunk and 2.x, r1634694. Thanks! bin/crawl: use function

[jira] [Updated] (NUTCH-1873) Solr IndexWriter/Job to report number of docs indexed.

2014-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1873: --- Summary: Solr IndexWriter/Job to report number of docs indexed. (was: Solr IndexWriter/Job

[jira] [Updated] (NUTCH-1873) Solr IndexWriter/Job to report numbe of docs indexed.

2014-10-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1873: --- Attachment: NUTCH-1873-trunk-v1.patch Patch for trunk: - incrementing counters and logging is

[jira] [Updated] (NUTCH-1873) Solr IndexWriter/Job to report number of docs indexed.

2014-10-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1873: --- Patch Info: Patch Available Solr IndexWriter/Job to report number of docs indexed.

[jira] [Created] (NUTCH-1885) Protocol-file should treat symbolic links as redirects

2014-10-31 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1885: -- Summary: Protocol-file should treat symbolic links as redirects Key: NUTCH-1885 URL: https://issues.apache.org/jira/browse/NUTCH-1885 Project: Nutch

[jira] [Updated] (NUTCH-1884) Java.lang.NullPointerException when using the parsechecker and indexchecker

2014-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1884: --- Attachment: NUTCH-1884-trunk-v1.patch There is rarely a good excuse for a NPE (opened

[jira] [Updated] (NUTCH-1884) NullPointerException in parsechecker and indexchecker with symlinks in file URL

2014-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1884: --- Affects Version/s: 2.2.1 Fix Version/s: 1.10 2.3

[jira] [Updated] (NUTCH-1885) Protocol-file should treat symbolic links as redirects

2014-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1885: --- Attachment: NUTCH-1885-trunk-v1.patch Patch for trunk (works probably also for 2.x): old

[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-10-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192405#comment-14192405 ] Sebastian Nagel commented on NUTCH-1483: Opened subtasks for better treatment of

[jira] [Updated] (NUTCH-1885) Protocol-file should treat symbolic links as redirects

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1885: --- Attachment: NUTCH-1885-2x-v1.patch equivalent patch for 2.x Protocol-file should treat

[jira] [Resolved] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1483. Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed

[jira] [Resolved] (NUTCH-1879) Regex URL normalizer should remove multiple slashes after file: protocol

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1879. Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Fixed, see

[jira] [Resolved] (NUTCH-1878) urlnormalizer-regex to keep third slash in file:///path/index.html

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1878. Resolution: Won't Fix Solution from NUTCH-1879 is preferred because URL.toString() returns

[jira] [Resolved] (NUTCH-1880) URLUtil should not add additional slashes for file URLs

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1880. Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed,

[jira] [Resolved] (NUTCH-1885) Protocol-file should treat symbolic links as redirects

2014-11-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1885. Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed to

[jira] [Resolved] (NUTCH-1825) protocol-http may hang for certain web pages

2014-11-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1825. Resolution: Fixed Patch v3 tested for 1 week on a production server: no regressions seen.

[jira] [Updated] (NUTCH-1884) NullPointerException in parsechecker and indexchecker with symlinks in file URL

2014-11-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1884: --- Affects Version/s: (was: 2.2.1) NullPointerException in parsechecker and indexchecker

[jira] [Updated] (NUTCH-1884) NullPointerException in parsechecker and indexchecker with symlinks in file URL

2014-11-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1884: --- Fix Version/s: (was: 2.4) NullPointerException in parsechecker and indexchecker with

[jira] [Resolved] (NUTCH-1884) NullPointerException in parsechecker and indexchecker with symlinks in file URL

2014-11-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1884. Resolution: Fixed Committed to trunk/1.x, r1637237. Nutch 2.x is not affected because there

[jira] [Commented] (NUTCH-1887) Specify HTMLMapper to use in TikaParser

2014-11-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201069#comment-14201069 ] Sebastian Nagel commented on NUTCH-1887: +1 good idea: could be used to make

[jira] [Reopened] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-11-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-1883: Assignee: Sebastian Nagel Hi [~jnioche], you're definitely right. Thanks! Patch for the

[jira] [Updated] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-11-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1883: --- Attachment: NUTCH-1883-trunk-v2.patch NUTCH-1883-trunk-v2.patch bin/crawl:

[jira] [Updated] (NUTCH-1829) Generator : unable to distinguish real errors

2014-11-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1829: --- Attachment: NUTCH-1829-2x-v2.patch Generator : unable to distinguish real errors

[jira] [Comment Edited] (NUTCH-1829) Generator : unable to distinguish real errors

2014-11-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204012#comment-14204012 ] Sebastian Nagel edited comment on NUTCH-1829 at 11/9/14 5:36 PM:

[jira] [Updated] (NUTCH-1870) Generic xsl parser plugin

2014-11-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1870: --- Attachment: NUTCH-1870-trunk-v3.patch Generic xsl parser plugin -

[jira] [Commented] (NUTCH-1870) Generic xsl parser plugin

2014-11-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205385#comment-14205385 ] Sebastian Nagel commented on NUTCH-1870: Hi [~Albinscode], simple and funny

[jira] [Resolved] (NUTCH-1883) bin/crawl: use function to run bin/nutch and check exit value

2014-11-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1883. Resolution: Fixed Patch v2 committed to trunk and 2.x, r1638203. bin/crawl: use function

[jira] [Updated] (NUTCH-1877) Suffix URL to ignore query string by default

2014-11-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1877: --- Attachment: NUTCH-1877.patch Patch for 1. x and 2.x Suffix URL to ignore query string by

[jira] [Updated] (NUTCH-1877) Suffix URL to ignore query string by default

2014-11-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1877: --- Patch Info: Patch Available Suffix URL to ignore query string by default

[jira] [Updated] (NUTCH-1877) Suffix URL to ignore query string by default

2014-11-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1877: --- Fix Version/s: 2.3 Suffix URL to ignore query string by default

[jira] [Commented] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233572#comment-14233572 ] Sebastian Nagel commented on NUTCH-1778: The value of the static variable

[jira] [Updated] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1778: --- Attachment: NUTCH-1778.patch Generator not logging number of URLs in batch correctly

[jira] [Updated] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1778: --- Patch Info: Patch Available Generator not logging number of URLs in batch correctly

[jira] [Updated] (NUTCH-1877) Suffix URL filter to ignore query string by default

2014-12-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1877: --- Summary: Suffix URL filter to ignore query string by default (was: Suffix URL to ignore

[jira] [Updated] (NUTCH-1877) Suffix URL filter to ignore query string by default

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1877: --- Affects Version/s: 2.2.1 Suffix URL filter to ignore query string by default

[jira] [Resolved] (NUTCH-1877) Suffix URL filter to ignore query string by default

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1877. Resolution: Fixed Committed to trunk and 2.x, r1643412. Suffix URL filter to ignore query

[jira] [Updated] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1778: --- Attachment: (was: NUTCH-1778.patch) Generator not logging number of URLs in batch

[jira] [Updated] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1778: --- Attachment: NUTCH-1778-v2.patch Clean patch without (not working) fixes for NUTCH-1829.

[jira] [Comment Edited] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236064#comment-14236064 ] Sebastian Nagel edited comment on NUTCH-1778 at 12/5/14 8:37 PM:

[jira] [Updated] (NUTCH-1829) Generator : unable to distinguish real errors

2014-12-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1829: --- Attachment: NUTCH-1829-2x-v3.patch New patch which depends on fix of NUTCH-1778. Tested

[jira] [Assigned] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1778: -- Assignee: Sebastian Nagel Generator not logging number of URLs in batch correctly

[jira] [Resolved] (NUTCH-1778) Generator not logging number of URLs in batch correctly

2014-12-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1778. Resolution: Fixed Committed to 2.x, r1643899. Generator not logging number of URLs in

[jira] [Resolved] (NUTCH-1829) Generator : unable to distinguish real errors

2014-12-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1829. Resolution: Fixed Committed to 2.x, r1643899. Generator : unable to distinguish real

[jira] [Commented] (NUTCH-1895) run() method in Crawler.java doesnt put Nutch.ARG_BATCH in argMap

2014-12-10 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240889#comment-14240889 ] Sebastian Nagel commented on NUTCH-1895: Hi [~FeiTian], usage of the class

[jira] [Commented] (NUTCH-1895) run() method in Crawler.java doesnt put Nutch.ARG_BATCH in argMap

2014-12-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243112#comment-14243112 ] Sebastian Nagel commented on NUTCH-1895: To run Nutch on Windows requires Cygwin

[jira] [Comment Edited] (NUTCH-1895) run() method in Crawler.java doesnt put Nutch.ARG_BATCH in argMap

2014-12-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243112#comment-14243112 ] Sebastian Nagel edited comment on NUTCH-1895 at 12/11/14 8:50 PM:

[jira] [Commented] (NUTCH-1898) Add -dumpRawHTML prameter to parsechecker tool

2014-12-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243194#comment-14243194 ] Sebastian Nagel commented on NUTCH-1898: The raw document could be also viewed by

[jira] [Commented] (NUTCH-1797) remove unused package o.a.n.html

2014-12-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246427#comment-14246427 ] Sebastian Nagel commented on NUTCH-1797: Yes, of course, you are welcome to

[jira] [Created] (NUTCH-1899) upgrade restlet lib to prevent build failure

2014-12-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1899: -- Summary: upgrade restlet lib to prevent build failure Key: NUTCH-1899 URL: https://issues.apache.org/jira/browse/NUTCH-1899 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1899) upgrade restlet lib to prevent build failure

2014-12-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247234#comment-14247234 ] Sebastian Nagel commented on NUTCH-1899: Upgrade to org.restlet dependencies from

[jira] [Commented] (NUTCH-1899) upgrade restlet lib to prevent build failure

2014-12-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248871#comment-14248871 ] Sebastian Nagel commented on NUTCH-1899: +1 (tests pass, tested also REST API and

[jira] [Assigned] (NUTCH-1797) remove unused package o.a.n.html

2014-12-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1797: -- Assignee: Sebastian Nagel remove unused package o.a.n.html

[jira] [Resolved] (NUTCH-1797) remove unused package o.a.n.html

2014-12-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1797. Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed to

[jira] [Comment Edited] (NUTCH-1797) remove unused package o.a.n.html

2014-12-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248929#comment-14248929 ] Sebastian Nagel edited comment on NUTCH-1797 at 12/16/14 9:01 PM:

[jira] [Commented] (NUTCH-1903) Resolve-default failed with branch 2.x

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254042#comment-14254042 ] Sebastian Nagel commented on NUTCH-1903: Not reproducible (2.x, r1646875):

[jira] [Commented] (NUTCH-1902) Missing nekohtml.jar

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254055#comment-14254055 ] Sebastian Nagel commented on NUTCH-1902: The nekohtml jar must not be placed in

[jira] [Resolved] (NUTCH-1831) compiling against gora-0.5 fails

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1831. Resolution: Not a Problem Compilation with Gora 0.5 and recent 2.x works. Please, reopen if

[jira] [Updated] (NUTCH-1810) Duplicate jdom dependency

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1810: --- Fix Version/s: 2.4 Duplicate jdom dependency --

[jira] [Updated] (NUTCH-1810) Duplicate jdom dependency

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1810: --- Attachment: NUTCH-1810-2x.patch Only affects 2.x. Trivial fix: jdom is mentioned twice in

[jira] [Updated] (NUTCH-1810) Duplicate jdom dependency

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1810: --- Priority: Trivial (was: Major) Duplicate jdom dependency --

[jira] [Updated] (NUTCH-1810) Duplicate jdom dependency

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1810: --- Patch Info: Patch Available Duplicate jdom dependency --

[jira] [Resolved] (NUTCH-1638) SolrWriter Bad String comparision

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1638. Resolution: Not a Problem Resolve, it's not a problem. SolrWriter Bad String comparision

[jira] [Resolved] (NUTCH-1895) run() method in Crawler.java doesnt put Nutch.ARG_BATCH in argMap

2014-12-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1895. Resolution: Won't Fix We cannot fix removed code. run() method in Crawler.java doesnt put

<    1   2   3   4   5   6   7   8   9   10   >