[jira] [Commented] (SOLR-4451) Upgrade to httpclient 4.2.x and take advantage of SystemDefaultHttpClient

2013-08-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739687#comment-13739687 ] Ken Krugler commented on SOLR-4451: --- Grant Ingersoll yes we got it to work (this was in

[jira] [Commented] (SOLR-4451) Upgrade to httpclient 4.2.x and take advantage of SystemDefaultHttpClient

2013-04-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631321#comment-13631321 ] Ken Krugler commented on SOLR-4451: --- One of my developers also ran into what seems like

[jira] [Commented] (SOLR-4451) Upgrade to httpclient 4.2.x and take advantage of SystemDefaultHttpClient

2013-04-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631326#comment-13631326 ] Ken Krugler commented on SOLR-4451: --- One related question - if I'm using embedded Solr,

[jira] Assigned: (TIKA-420) [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages

2010-05-07 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-420: Assignee: Ken Krugler [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext

[jira] Commented: (TIKA-420) [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages

2010-05-07 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865302#action_12865302 ] Ken Krugler commented on TIKA-420: -- Hi Christian, I'll take a look at the patch, and also

[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923 ] Ken Krugler commented on NUTCH-706: --- Two comments about this: 1. From my experiences with

[jira] Commented: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

2010-03-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852075#action_12852075 ] Ken Krugler commented on TIKA-359: -- Hi Chris, Sorry for the delay - yes, go ahead and defer

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846424#action_12846424 ] Ken Krugler commented on NUTCH-797: --- I thought this same issue (relative URL with leading

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846459#action_12846459 ] Ken Krugler commented on NUTCH-797: --- Agreed re crawler-commons...feels like there's a

[jira] Assigned: (TIKA-387) htmlparser throws IllegalCharsetNameException

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-387: Assignee: Ken Krugler htmlparser throws IllegalCharsetNameException

[jira] Closed: (TIKA-387) htmlparser throws IllegalCharsetNameException

2010-03-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-387. Resolution: Duplicate I knew I'd seen this before :) It's a dup of the issue I'd previously filed...see

[jira] Updated: (TIKA-387) htmlparser throws IllegalCharsetNameException

2010-03-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-387: - Attachment: CharsetUtils.java Piotr - thanks for reporting this. I'd run into the same issue, and created

[jira] Updated: (TIKA-354) ProfilingHandler should take a length-limiting parameter

2010-02-22 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-354: - Attachment: TIKA-354-2.patch Additional improvement for language identification. This patch has to be

[jira] Updated: (TIKA-354) ProfilingHandler should take a length-limiting parameter

2010-02-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-354: - Attachment: TIKA-354.patch ProfilingHandler should take a length-limiting parameter

[jira] Commented: (TIKA-381) HtmlParser should strip linefeeds out of links

2010-02-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835976#action_12835976 ] Ken Krugler commented on TIKA-381: -- Things have changed w/the switch to TagSoup. Now the

[jira] Commented: (TIKA-379) Lang attribute on html tag skipped

2010-02-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834228#action_12834228 ] Ken Krugler commented on TIKA-379: -- I think this is part of a bigger issue re attributes

[jira] Commented: (TIKA-378) TikaConfig should notify users if it cannot initialize some parser

2010-02-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834464#action_12834464 ] Ken Krugler commented on TIKA-378: -- Would it be sufficient to add a method that forces

[jira] Commented: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830109#action_12830109 ] Ken Krugler commented on NUTCH-786: --- Is this something that should also be applied to

[jira] Updated: (TIKA-369) Improve accuracy of language detection

2010-01-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-369: - Attachment: (was: dunning94-trimmed.pdf) Improve accuracy of language detection

[jira] Updated: (TIKA-369) Improve accuracy of language detection

2010-01-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-369: - Description: Currently the LanguageProfile code uses 3-grams to find the best language profile using

[jira] Updated: (TIKA-369) Improve accuracy of language detection

2010-01-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-369: - Attachment: Surprise and Coincidence.pdf Attaching another paper from Ted that makes it clearer why the

[jira] Commented: (TIKA-370) Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox

2010-01-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804704#action_12804704 ] Ken Krugler commented on TIKA-370: -- On the list, Jukka said: {quote} Yep. I think the

[jira] Created: (TIKA-370) Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox

2010-01-25 Thread Ken Krugler (JIRA)
Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox -- Key: TIKA-370 URL: https://issues.apache.org/jira/browse/TIKA-370 Project: Tika Issue

[jira] Issue Comment Edited: (TIKA-370) Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox

2010-01-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804704#action_12804704 ] Ken Krugler edited comment on TIKA-370 at 1/25/10 8:53 PM: --- On the

[jira] Commented: (LUCENE-826) Language detector

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804285#action_12804285 ] Ken Krugler commented on LUCENE-826: I think Nutch (and eventually Mahout) plan to use

[jira] Assigned: (TIKA-354) ProfilingHandler should take a length-limiting parameter

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-354: Assignee: Ken Krugler ProfilingHandler should take a length-limiting parameter

[jira] Created: (TIKA-369) Improve accuracy of language detection

2010-01-24 Thread Ken Krugler (JIRA)
Improve accuracy of language detection -- Key: TIKA-369 URL: https://issues.apache.org/jira/browse/TIKA-369 Project: Tika Issue Type: Improvement Components: languageidentifier Affects

[jira] Updated: (TIKA-369) Improve accuracy of language detection

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-369: - Attachment: dunning94-trimmed.pdf Improve accuracy of language detection

[jira] Issue Comment Edited: (TIKA-369) Improve accuracy of language detection

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288 ] Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM: --- Karl

[jira] Issue Comment Edited: (TIKA-369) Improve accuracy of language detection

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288 ] Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM: --- Karl

[jira] Commented: (TIKA-369) Improve accuracy of language detection

2010-01-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288 ] Ken Krugler commented on TIKA-369: -- Karl Wettin had contributed a language detector to

[jira] Updated: (TIKA-357) Increase buffer size for meta tag sniffing

2010-01-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-357: - Attachment: big-preamble.html TIKA-357-2.patch TIKA-357-2.patch should be applied on top

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798890#action_12798890 ] Ken Krugler commented on NUTCH-751: --- i agree that this should be in crawler-commons. E.g.

[jira] Commented: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

2010-01-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798090#action_12798090 ] Ken Krugler commented on TIKA-359: -- Given the junk that can be found inside of meta

[jira] Created: (TIKA-359) Calls to Charset.isSupported() will throw exceptions for invalid charset names

2010-01-06 Thread Ken Krugler (JIRA)
Calls to Charset.isSupported() will throw exceptions for invalid charset names -- Key: TIKA-359 URL: https://issues.apache.org/jira/browse/TIKA-359 Project: Tika

[jira] Updated: (TIKA-357) Increase buffer size for meta tag sniffing

2009-12-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-357: - Attachment: makler.html From http://www.makler.su/ - example of file with meta tags more than 4K into

[jira] Created: (TIKA-357) Increase buffer size for meta tag sniffing

2009-12-21 Thread Ken Krugler (JIRA)
Increase buffer size for meta tag sniffing -- Key: TIKA-357 URL: https://issues.apache.org/jira/browse/TIKA-357 Project: Tika Issue Type: Improvement Affects Versions: 0.6 Reporter:

[jira] Created: (TIKA-349) HtmlParser's http-equiv code needs to be more flexible

2009-12-15 Thread Ken Krugler (JIRA)
HtmlParser's http-equiv code needs to be more flexible -- Key: TIKA-349 URL: https://issues.apache.org/jira/browse/TIKA-349 Project: Tika Issue Type: Improvement Affects Versions: 0.6

[jira] Created: (TIKA-350) HtmlParser's content-type handling code needs to be more flexible

2009-12-15 Thread Ken Krugler (JIRA)
HtmlParser's content-type handling code needs to be more flexible - Key: TIKA-350 URL: https://issues.apache.org/jira/browse/TIKA-350 Project: Tika Issue Type: Improvement

[jira] Created: (TIKA-351) MediaType.parse should be more forgiving of broken input

2009-12-15 Thread Ken Krugler (JIRA)
MediaType.parse should be more forgiving of broken input Key: TIKA-351 URL: https://issues.apache.org/jira/browse/TIKA-351 Project: Tika Issue Type: Improvement Reporter:

[jira] Updated: (TIKA-351) MediaType.parse should be more forgiving of broken input

2009-12-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-351: - Attachment: TIKA-351.patch This patch also moves MediaTypeTest.java from tika-parsers to tika-core, since

[jira] Created: (TIKA-352) Use MediaType.parse when extracting charset from content-type metadata in parsers

2009-12-15 Thread Ken Krugler (JIRA)
Use MediaType.parse when extracting charset from content-type metadata in parsers - Key: TIKA-352 URL: https://issues.apache.org/jira/browse/TIKA-352 Project: Tika

[jira] Updated: (TIKA-352) Use MediaType.parse when extracting charset from content-type metadata in parsers

2009-12-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-352: - Attachment: TIKA-352.patch Use MediaType.parse when extracting charset from content-type metadata in

[jira] Commented: (TIKA-344) Charset hint in metadata

2009-12-12 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789806#action_12789806 ] Ken Krugler commented on TIKA-344: -- It would be useful for various detectors of charset

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2009-12-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712 ] Ken Krugler commented on LUCENE-1343: - Just to make sure this point doesn't get lost

[jira] Commented: (TIKA-340) Provide full Tika bundle

2009-12-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784839#action_12784839 ] Ken Krugler commented on TIKA-340: -- Funny, I was just looking at the size of the Hadoop job

[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

2009-12-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-332: - Attachment: TIKA-332-2.patch Additional cleanup to new test, plus others - include head tags around

[jira] Updated: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

2009-12-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-341: - Priority: Minor (was: Major) Use charset in CONTENT_TYPE metadata when detecting the character encoding

[jira] Created: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

2009-12-02 Thread Ken Krugler (JIRA)
Use charset in CONTENT_TYPE metadata when detecting the character encoding -- Key: TIKA-341 URL: https://issues.apache.org/jira/browse/TIKA-341 Project: Tika Issue

[jira] Updated: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

2009-12-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-341: - Attachment: TIKA-341.patch Use charset in CONTENT_TYPE metadata when detecting the character encoding

[jira] Commented: (TIKA-339) HtmlParser TXTParser should not use language returned by CharsetDetector if language hint has been provided

2009-12-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784442#action_12784442 ] Ken Krugler commented on TIKA-339: -- There's another issue here. If you add the detected

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

2009-11-30 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-335: - Attachment: TIKA-335-2.patch Minor improvement to test case - avoid use of UTF-8 chars in strings (use

[jira] Created: (TIKA-333) Improve accuracy of charset detection for HTML pages

2009-11-25 Thread Ken Krugler (JIRA)
Improve accuracy of charset detection for HTML pages Key: TIKA-333 URL: https://issues.apache.org/jira/browse/TIKA-333 Project: Tika Issue Type: Improvement Affects Versions: 0.5

[jira] Commented: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

2009-11-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782550#action_12782550 ] Ken Krugler commented on TIKA-332: -- It turns out the HtmlParser code doesn't even use the

[jira] Closed: (TIKA-333) Improve accuracy of charset detection for HTML pages

2009-11-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-333. Resolution: Not A Problem In actually walking the parse code, I see that the real problem is that the

[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

2009-11-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-332: - Description: Currently Tika doesn't use the charset info that's optionally present in HTML documents, via

[jira] Created: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

2009-11-25 Thread Ken Krugler (JIRA)
HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag -- Key: TIKA-334 URL: https://issues.apache.org/jira/browse/TIKA-334

[jira] Created: (TIKA-335) TXTParser use of CharsetDetector has several bugs

2009-11-25 Thread Ken Krugler (JIRA)
TXTParser use of CharsetDetector has several bugs - Key: TIKA-335 URL: https://issues.apache.org/jira/browse/TIKA-335 Project: Tika Issue Type: Bug Affects Versions: 0.5

[jira] Updated: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

2009-11-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-334: - Attachment: TIKA-334.patch HtmlParser should use CharsetDetector whenever no charset is specified via

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

2009-11-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-335: - Attachment: TIKA-335.patch This patch also cleans up some generics warnings (sorry about mixing the two, I

[jira] Commented: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

2009-11-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782100#action_12782100 ] Ken Krugler commented on TIKA-331: -- I believe this is an issue for the PDF parser (PDFBox)

[jira] Commented: (TIKA-295) Rough cut of mbox parser

2009-10-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765578#action_12765578 ] Ken Krugler commented on TIKA-295: -- Hi Thilo - I also looked at mstor, but trying to figure

[jira] Commented: (TIKA-295) Rough cut of mbox parser

2009-10-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765579#action_12765579 ] Ken Krugler commented on TIKA-295: -- Hi Alex - thanks for looking into the formatting issues.

[jira] Commented: (TIKA-298) CompositeParser.getParser() should use mimetype hierarchy when falling back

2009-10-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764471#action_12764471 ] Ken Krugler commented on TIKA-298: -- Jukka said on the mailing list:

[jira] Commented: (TIKA-295) Rough cut of mbox parser

2009-10-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764475#action_12764475 ] Ken Krugler commented on TIKA-295: -- Hi Jukka, Is there an Eclipse formatter file that

[jira] Commented: (TIKA-288) Support override parsers in AutoDetectParser

2009-10-11 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764477#action_12764477 ] Ken Krugler commented on TIKA-288: -- Hi Jukka, If overriding in TikaConfig, would you

[jira] Updated: (TIKA-295) Rough cut of mbox parser

2009-09-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-295: - Attachment: tika-295.patch Rough cut of mbox parser Key:

[jira] Updated: (TIKA-296) Automatically set the supertype for +xml mimetypes

2009-09-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-296: - Attachment: tika-296.patch Automatically set the supertype for +xml mimetypes

[jira] Commented: (TIKA-285) Update media type registry to the latest httpd mime type database

2009-09-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760023#action_12760023 ] Ken Krugler commented on TIKA-285: -- The file command line utility also has a pretty good set

[jira] Created: (TIKA-287) HtmlParser should resolve relative paths in a href=xxx elements

2009-09-27 Thread Ken Krugler (JIRA)
HtmlParser should resolve relative paths in a href=xxx elements --- Key: TIKA-287 URL: https://issues.apache.org/jira/browse/TIKA-287 Project: Tika Issue Type: Improvement

[jira] Commented: (NUTCH-751) Upgrade version of HttpClient

2009-09-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753069#action_12753069 ] Ken Krugler commented on NUTCH-751: --- I'm using HttpClient 4.0 in Bixo, and I agree that

[jira] Commented: (SOLR-1301) Solr + Hadoop

2009-07-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737568#action_12737568 ] Ken Krugler commented on SOLR-1301: --- Hi Jason, Re Katta, you're right that it doesn't

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242 ] Ken Krugler commented on NUTCH-731: --- This is definitely an issue - I've been pinging

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277 ] Ken Krugler commented on NUTCH-739: --- There's another approach that works well here, and

[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678108#action_12678108 ] Ken Krugler commented on SOLR-1044: --- I agree with both of Yonik's points: # We'd first

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, So given that you and the Unicode

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

2008-08-13 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432 ] Ken Krugler commented on LUCENE-1343: - Hi Robert, FWIW, the issues being discussed

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.

[jira] Commented: (SOLR-69) PATCH:MoreLikeThis support

2007-05-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493198 ] Ken Krugler commented on SOLR-69: - Ryan Brian's comments above are (I think) indicative of how most people want to

[jira] Commented: (SOLR-214) deficit of InputStreamReader support in anonymous class of ContentStream

2007-04-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491746 ] Ken Krugler commented on SOLR-214: -- There's some complex interplay of the content-type in the request, the charset

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260 ] Ken Krugler commented on NUTCH-353: --- Another small note about this (see NUTCH-411 for a related but different

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261 ] Ken Krugler commented on NUTCH-353: --- Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-23 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ] Ken Krugler commented on NUTCH-385: --- There is a middle ground, though we don't know yet how important it is to address. When we crawl partner sites, we

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] Ken Krugler commented on NUTCH-353: --- +1 that the redirect target is not always the real URL that we want to keep. For example,

[jira] Created: (SOLR-9) In would be handy if StandardRequestHandler returned an error when a query requested sorting on a non-indexed field

2006-04-13 Thread Ken Krugler (JIRA)
In would be handy if StandardRequestHandler returned an error when a query requested sorting on a non-indexed field --- Key: SOLR-9 URL:

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ] Ken Krugler commented on NUTCH-230: --- So Doug beat me to this comment :) I was going to describe the two cases we'd run into... 1. There's a great page, but most of the

[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-13 Thread Ken Krugler (JIRA)
OPIC score for outlinks should be based on # of valid links, not total # of links. -- Key: NUTCH-230 URL: http://issues.apache.org/jira/browse/NUTCH-230 Project: Nutch Type: Improvement