[jira] [Updated] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-09-28 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2124: --- Priority: Blocker (was: Major) > redirect following same link again and again , max redirect

[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899934#comment-14899934 ] Sebastian Nagel commented on NUTCH-2110: Hi Asitang, the Injector is already able to store

[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903524#comment-14903524 ] Sebastian Nagel commented on NUTCH-2110: Ok, understood. One point to consider: shall all

[jira] [Commented] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14847281#comment-14847281 ] Sebastian Nagel commented on NUTCH-2106: Avoiding conflicting dependencies is the reason for the

[jira] [Assigned] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2106: -- Assignee: Sebastian Nagel > Runtime to contain Selenium and dependencies only once >

[jira] [Resolved] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2106. Resolution: Fixed Committed to trunk, r1704425. Thanks, Lewis! > Runtime to contain

[jira] [Commented] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-10-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943609#comment-14943609 ] Sebastian Nagel commented on NUTCH-2124: I've tested the patch with the mentioned URL as only seed

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943637#comment-14943637 ] Sebastian Nagel commented on NUTCH-2132: No question, this is a significant improvement over

[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034511#comment-15034511 ] Sebastian Nagel commented on NUTCH-2179: +1: SolrIndexWriter should queue the deletions the same

[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Attachment: NUTCH-2172-1.patch Patch to add a template for conf/contenttype-mapping.txt

[jira] [Assigned] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2107: -- Assignee: Sebastian Nagel > plugin.xml to validate against plugin.dtd >

[jira] [Resolved] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2107. Resolution: Fixed Fix Version/s: (was: 1.12) (was: 2.4)

[jira] [Updated] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Component/s: indexer > index-more: document format of contenttype-mapping.txt >

[jira] [Assigned] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2172: -- Assignee: Sebastian Nagel > Parsing whitespace not just tabs in

[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Fix Version/s: 1.12 > Parsing whitespace not just tabs in contenttype-mapping.txt >

[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Issue Type: Improvement (was: Bug) > Parsing whitespace not just tabs in

[jira] [Updated] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Summary: index-more: document format of contenttype-mapping.txt (was: Parsing whitespace not

[jira] [Resolved] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2172. Resolution: Fixed Committed to trunk, r1718223. Thanks, [~nicola.tonellotto]! Although

[jira] [Commented] (NUTCH-2076) exceptions are not handled when using method waitForCompletion in a try block

2015-12-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047481#comment-15047481 ] Sebastian Nagel commented on NUTCH-2076: After a second look: the problem is the return statement

[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2172: --- Attachment: NUTCH-2172-2.patch It is about MIME types which are already normalized either by

[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034352#comment-15034352 ] Sebastian Nagel commented on NUTCH-2172: This could be an improvement if we assume that MIME types

[jira] [Updated] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5

2016-01-04 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2193: --- Attachment: NUTCH-2193.patch > Upgrade feed parser plugin to use rome 1.5 >

[jira] [Created] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5

2016-01-04 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2193: -- Summary: Upgrade feed parser plugin to use rome 1.5 Key: NUTCH-2193 URL: https://issues.apache.org/jira/browse/NUTCH-2193 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-06 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085327#comment-15085327 ] Sebastian Nagel commented on NUTCH-2143: Excellent! Please, attach a

[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285 ] Sebastian Nagel commented on NUTCH-2168: Hi [~kalanya], looks like the indexed raw content of the

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083603#comment-15083603 ] Sebastian Nagel commented on NUTCH-2191: As [~haraldk] mentioned in [this

[jira] [Updated] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2143: --- Attachment: NUTCH-2143-v3.patch Ok, with the patch applied the unit testFetch() fails because

[jira] [Resolved] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2168. Resolution: Fixed Committed to 2.x, r1723851. Opened NUTCH-2198 to track the problem when

[jira] [Created] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2198: -- Summary: Indexing binary content by index-html causes Solr Exception Key: NUTCH-2198 URL: https://issues.apache.org/jira/browse/NUTCH-2198 Project: Nutch

[jira] [Commented] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090625#comment-15090625 ] Sebastian Nagel commented on NUTCH-2198: Tried to reproduce the Solr exception by indexing on of

[jira] [Updated] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2198: --- Description: (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin

[jira] [Resolved] (NUTCH-2169) Integrate index-html into Nutch build

2016-01-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2169. Resolution: Fixed Assignee: Sebastian Nagel Committed to 2.x, r1723794. > Integrate

[jira] [Resolved] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2143. Resolution: Fixed Committed to 2.x, r1723626. Thanks! > GeneratorJob ignores batch id

[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068723#comment-15068723 ] Sebastian Nagel commented on NUTCH-2189: +1 makes the urlfilter-domain more robust, patch looks

[jira] [Commented] (NUTCH-2065) Domain URL filter to support protocols

2015-12-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068757#comment-15068757 ] Sebastian Nagel commented on NUTCH-2065: * in general: wouldn't a URL normalizer be preferable? If

[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071661#comment-15071661 ] Sebastian Nagel commented on NUTCH-2189: Yes, you're right! > Domain filter must deactivate if no

[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071662#comment-15071662 ] Sebastian Nagel commented on NUTCH-2189: Yes, you're right! > Domain filter must deactivate if no

[jira] [Comment Edited] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023131#comment-15023131 ] Sebastian Nagel edited comment on NUTCH-2158 at 11/26/15 7:28 AM: -- Patch

[jira] [Assigned] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2175: -- Assignee: Sebastian Nagel > Typos in property descriptions in nutch-default.xml >

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-30 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032108#comment-15032108 ] Sebastian Nagel commented on NUTCH-2177: Rely on {{mapred.job.tracker}}, cf.

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033595#comment-15033595 ] Sebastian Nagel commented on NUTCH-2177: Yes, of course, I was just unable to copy-paste the right

[jira] [Resolved] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2158. Resolution: Fixed Thanks! Committed to trunk, r1716573. > Upgrade to Tika 1.11 >

[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023123#comment-15023123 ] Sebastian Nagel commented on NUTCH-2158: We need to the pass the rendered HTML, returned by the

[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2158: --- Attachment: NUTCH-2158-test-protocol-http.patch Patch to adjust tests of protocol-http: -

[jira] [Updated] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2175: --- Issue Type: Improvement (was: Bug) > Typos in property descriptions in nutch-default.xml >

[jira] [Resolved] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2175. Resolution: Fixed And a spell checker detected some more obvious misspellings... Committed

[jira] [Updated] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2175: --- Summary: Typos in property descriptions in nutch-default.xml (was: Misspelling at word

[jira] [Work started] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1712 started by Sebastian Nagel. -- > Use MultipleInputs in Injector to make it a single mapreduce job >

[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092924#comment-15092924 ] Sebastian Nagel commented on NUTCH-1712: The merging is done together with minor improvements

[jira] [Commented] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331434#comment-15331434 ] Sebastian Nagel commented on NUTCH-2272: Not included in [1.12 release

[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2016-06-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328204#comment-15328204 ] Sebastian Nagel commented on NUTCH-827: --- Hi [~stevegy], would you mind to open a new Jira for this

[jira] [Created] (NUTCH-2281) Support non-default FileSystem

2016-06-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2281: -- Summary: Support non-default FileSystem Key: NUTCH-2281 URL: https://issues.apache.org/jira/browse/NUTCH-2281 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-2281) Support non-default FileSystem

2016-06-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341680#comment-15341680 ] Sebastian Nagel commented on NUTCH-2281: I tried to fix all tools but haven't tested all of them

[jira] [Created] (NUTCH-2286) CrawlDbReader -stats fetch time and interval

2016-06-23 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2286: -- Summary: CrawlDbReader -stats fetch time and interval Key: NUTCH-2286 URL: https://issues.apache.org/jira/browse/NUTCH-2286 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2272: --- Fix Version/s: (was: 1.12) 1.13 > Index checker server to optionally

[jira] [Updated] (NUTCH-2286) CrawlDbReader -stats to show fetch time and interval

2016-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2286: --- Summary: CrawlDbReader -stats to show fetch time and interval (was: CrawlDbReader -stats

[jira] [Commented] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346585#comment-15346585 ] Sebastian Nagel commented on NUTCH-2272: Not included in released 1.12: removed from CHANGES.txt,

[jira] [Commented] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351824#comment-15351824 ] Sebastian Nagel commented on NUTCH-2269: Thanks for reporting the problems. Afaics, they can be

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2269: --- Comment: was deleted (was: The message {noformat} WARN output.FileOutputCommitter - Output

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2269: --- Comment: was deleted (was: The message {noformat} WARN output.FileOutputCommitter - Output

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2269: --- Comment: was deleted (was: The message {noformat} WARN output.FileOutputCommitter - Output

[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1314: --- Fix Version/s: 1.12 > Impose a limit on the length of outlink target urls >

[jira] [Commented] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157655#comment-15157655 ] Sebastian Nagel commented on NUTCH-2228: The name of the failing test "testInvalidPatterns"

[jira] [Updated] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2228: --- Attachment: NUTCH-2228.patch > index-replace unit test fails > -

[jira] [Comment Edited] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157655#comment-15157655 ] Sebastian Nagel edited comment on NUTCH-2228 at 2/22/16 8:38 PM: - The name

[jira] [Updated] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2228: --- Patch Info: Patch Available > index-replace unit test fails > - >

[jira] [Commented] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157632#comment-15157632 ] Sebastian Nagel commented on NUTCH-2228: That's only a problem if Nutch is built with Java 8.

[jira] [Commented] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157831#comment-15157831 ] Sebastian Nagel commented on NUTCH-2220: 0 / +1 Since this breaks existing crawl configurations: a

[jira] [Commented] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157816#comment-15157816 ] Sebastian Nagel commented on NUTCH-2221: +1 Just to consider: the additional argument to

[jira] [Commented] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1515#comment-1515 ] Sebastian Nagel commented on NUTCH-2216: * this was the case before, but shouldn't

[jira] [Resolved] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1712. Resolution: Fixed Fix Version/s: 1.12 Committed to trunk (f5e430e). > Use

[jira] [Updated] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2204: --- Attachment: NUTCH-2204.patch > remove junit lib from runtime > -

[jira] [Created] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2204: -- Summary: remove junit lib from runtime Key: NUTCH-2204 URL: https://issues.apache.org/jira/browse/NUTCH-2204 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2204) Remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2204: --- Summary: Remove junit lib from runtime (was: remove junit lib from runtime) > Remove junit

[jira] [Resolved] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2204. Resolution: Fixed Committed to trunk, r1726318. > remove junit lib from runtime >

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146685#comment-15146685 ] Sebastian Nagel commented on NUTCH-2144: Hi [~thammegowda], thanks! Everything looks good with the

[jira] [Commented] (NUTCH-2060) dedup is removing entries with status db_gone

2016-03-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174628#comment-15174628 ] Sebastian Nagel commented on NUTCH-2060: Afaics from the mentioned thread on the user mailing

[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-03-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210136#comment-15210136 ] Sebastian Nagel commented on NUTCH-2242: Hi Jurian, thanks for reporting this problem. This is

[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

2016-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178587#comment-15178587 ] Sebastian Nagel commented on NUTCH-2237: Good idea! Nice patch, including unit tests. A few

[jira] [Updated] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

2016-03-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2237: --- Fix Version/s: 1.12 > DeduplicationJob: Add extra order criteria based on slug >

[jira] [Assigned] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2256: -- Assignee: Sebastian Nagel > Inconsistent log level practice >

[jira] [Commented] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264274#comment-15264274 ] Sebastian Nagel commented on NUTCH-2256: Good catch, will fix right now. Thanks, [~songwang]! >

[jira] [Updated] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2256: --- Fix Version/s: 2.3.2 1.12 2.4 > Inconsistent log level

[jira] [Updated] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2256: --- Affects Version/s: 1.11 > Inconsistent log level practice > --- >

[jira] [Resolved] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2254. Resolution: Fixed Committed, r6d2bfa9. Thanks, [~fedechicco]! > Charset issues when using

[jira] [Commented] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256225#comment-15256225 ] Sebastian Nagel commented on NUTCH-2254: Hi [~fedechicco], the patch should work. Thanks! I'll add

[jira] [Assigned] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2254: -- Assignee: Sebastian Nagel > Charset issues when using -addBinaryContent and -base64

[jira] [Resolved] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2256. Resolution: Fixed Fix Version/s: (was: 2.3.2) Fixed and committed to 1.x

[jira] [Closed] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2256. -- Also did a grep on all Java files for errors of the same kind - nothing found. Thanks,

[jira] [Updated] (NUTCH-2164) Inconsistent 'Modified Time' in crawl db

2016-05-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2164: --- Fix Version/s: 1.13 > Inconsistent 'Modified Time' in crawl db >

[jira] [Commented] (NUTCH-1858) Migrate Nutch documentation from Moin Moin to Confluence

2016-05-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291591#comment-15291591 ] Sebastian Nagel commented on NUTCH-1858: It's hardly a work for a single person. First steps could

[jira] [Reopened] (NUTCH-2252) Allow phantomjs as a browser for selenium options

2016-05-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2252: Tests fail to compile [[1|https://builds.apache.org/job/Nutch-trunk/3365/console]]: {noformat}

[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-05-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280076#comment-15280076 ] Sebastian Nagel commented on NUTCH-2242: Opened pull request

[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-05-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279942#comment-15279942 ] Sebastian Nagel commented on NUTCH-2242: [~markus17]: Sorry, I didn't upload a final patch, simply

[jira] [Commented] (NUTCH-1785) Ability to index raw content

2016-04-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250812#comment-15250812 ] Sebastian Nagel commented on NUTCH-1785: The class o.a.n.indexer.NutchField supports only a couple

[jira] [Resolved] (NUTCH-2191) Add protocol-htmlunit

2016-04-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2191. Resolution: Fixed Merged pull request #105. Build should succeed now. Thanks,

[jira] [Reopened] (NUTCH-2191) Add protocol-htmlunit

2016-04-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2191: Build fails because protocol-htmlunit's build.xml claims to have unit tests but there aren't

[jira] [Commented] (NUTCH-2297) CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

2016-08-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411716#comment-15411716 ] Sebastian Nagel commented on NUTCH-2297: The wrong values are already in the temporary output of

[jira] [Created] (NUTCH-2297) CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

2016-08-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2297: -- Summary: CrawlDbReader -stats wrong values for earliest fetch time and shortest interval Key: NUTCH-2297 URL: https://issues.apache.org/jira/browse/NUTCH-2297

[jira] [Created] (NUTCH-2291) Fix mrunit dependencies

2016-06-30 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2291: -- Summary: Fix mrunit dependencies Key: NUTCH-2291 URL: https://issues.apache.org/jira/browse/NUTCH-2291 Project: Nutch Issue Type: Bug

<    4   5   6   7   8   9   10   11   12   13   >