[
https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739687#comment-13739687
]
Ken Krugler commented on SOLR-4451:
---
Grant Ingersoll yes we got it to work (this was in
[
https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631321#comment-13631321
]
Ken Krugler commented on SOLR-4451:
---
One of my developers also ran into what seems like
[
https://issues.apache.org/jira/browse/SOLR-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631326#comment-13631326
]
Ken Krugler commented on SOLR-4451:
---
One related question - if I'm using embedded Solr,
[
https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler reassigned TIKA-420:
Assignee: Ken Krugler
[PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext
[
https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865302#action_12865302
]
Ken Krugler commented on TIKA-420:
--
Hi Christian,
I'll take a look at the patch, and also
[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923
]
Ken Krugler commented on NUTCH-706:
---
Two comments about this:
1. From my experiences with
[
https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852075#action_12852075
]
Ken Krugler commented on TIKA-359:
--
Hi Chris,
Sorry for the delay - yes, go ahead and defer
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846424#action_12846424
]
Ken Krugler commented on NUTCH-797:
---
I thought this same issue (relative URL with leading
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846459#action_12846459
]
Ken Krugler commented on NUTCH-797:
---
Agreed re crawler-commons...feels like there's a
[
https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler reassigned TIKA-387:
Assignee: Ken Krugler
htmlparser throws IllegalCharsetNameException
[
https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler closed TIKA-387.
Resolution: Duplicate
I knew I'd seen this before :)
It's a dup of the issue I'd previously filed...see
[
https://issues.apache.org/jira/browse/TIKA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-387:
-
Attachment: CharsetUtils.java
Piotr - thanks for reporting this. I'd run into the same issue, and created
[
https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-354:
-
Attachment: TIKA-354-2.patch
Additional improvement for language identification. This patch has to be
[
https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-354:
-
Attachment: TIKA-354.patch
ProfilingHandler should take a length-limiting parameter
[
https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835976#action_12835976
]
Ken Krugler commented on TIKA-381:
--
Things have changed w/the switch to TagSoup. Now the
[
https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834228#action_12834228
]
Ken Krugler commented on TIKA-379:
--
I think this is part of a bigger issue re attributes
[
https://issues.apache.org/jira/browse/TIKA-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834464#action_12834464
]
Ken Krugler commented on TIKA-378:
--
Would it be sufficient to add a method that forces
[
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830109#action_12830109
]
Ken Krugler commented on NUTCH-786:
---
Is this something that should also be applied to
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-369:
-
Attachment: (was: dunning94-trimmed.pdf)
Improve accuracy of language detection
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-369:
-
Description:
Currently the LanguageProfile code uses 3-grams to find the best language
profile using
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-369:
-
Attachment: Surprise and Coincidence.pdf
Attaching another paper from Ted that makes it clearer why the
[
https://issues.apache.org/jira/browse/TIKA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804704#action_12804704
]
Ken Krugler commented on TIKA-370:
--
On the list, Jukka said:
{quote}
Yep. I think the
Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox
--
Key: TIKA-370
URL: https://issues.apache.org/jira/browse/TIKA-370
Project: Tika
Issue
[
https://issues.apache.org/jira/browse/TIKA-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804704#action_12804704
]
Ken Krugler edited comment on TIKA-370 at 1/25/10 8:53 PM:
---
On the
[
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804285#action_12804285
]
Ken Krugler commented on LUCENE-826:
I think Nutch (and eventually Mahout) plan to use
[
https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler reassigned TIKA-354:
Assignee: Ken Krugler
ProfilingHandler should take a length-limiting parameter
Improve accuracy of language detection
--
Key: TIKA-369
URL: https://issues.apache.org/jira/browse/TIKA-369
Project: Tika
Issue Type: Improvement
Components: languageidentifier
Affects
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-369:
-
Attachment: dunning94-trimmed.pdf
Improve accuracy of language detection
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288
]
Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM:
---
Karl
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288
]
Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM:
---
Karl
[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804288#action_12804288
]
Ken Krugler commented on TIKA-369:
--
Karl Wettin had contributed a language detector to
[
https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-357:
-
Attachment: big-preamble.html
TIKA-357-2.patch
TIKA-357-2.patch should be applied on top
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798890#action_12798890
]
Ken Krugler commented on NUTCH-751:
---
i agree that this should be in crawler-commons. E.g.
[
https://issues.apache.org/jira/browse/TIKA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798090#action_12798090
]
Ken Krugler commented on TIKA-359:
--
Given the junk that can be found inside of meta
Calls to Charset.isSupported() will throw exceptions for invalid charset names
--
Key: TIKA-359
URL: https://issues.apache.org/jira/browse/TIKA-359
Project: Tika
[
https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-357:
-
Attachment: makler.html
From http://www.makler.su/ - example of file with meta tags more than 4K into
Increase buffer size for meta tag sniffing
--
Key: TIKA-357
URL: https://issues.apache.org/jira/browse/TIKA-357
Project: Tika
Issue Type: Improvement
Affects Versions: 0.6
Reporter:
HtmlParser's http-equiv code needs to be more flexible
--
Key: TIKA-349
URL: https://issues.apache.org/jira/browse/TIKA-349
Project: Tika
Issue Type: Improvement
Affects Versions: 0.6
HtmlParser's content-type handling code needs to be more flexible
-
Key: TIKA-350
URL: https://issues.apache.org/jira/browse/TIKA-350
Project: Tika
Issue Type: Improvement
MediaType.parse should be more forgiving of broken input
Key: TIKA-351
URL: https://issues.apache.org/jira/browse/TIKA-351
Project: Tika
Issue Type: Improvement
Reporter:
[
https://issues.apache.org/jira/browse/TIKA-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-351:
-
Attachment: TIKA-351.patch
This patch also moves MediaTypeTest.java from tika-parsers to tika-core, since
Use MediaType.parse when extracting charset from content-type metadata in
parsers
-
Key: TIKA-352
URL: https://issues.apache.org/jira/browse/TIKA-352
Project: Tika
[
https://issues.apache.org/jira/browse/TIKA-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-352:
-
Attachment: TIKA-352.patch
Use MediaType.parse when extracting charset from content-type metadata in
[
https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789806#action_12789806
]
Ken Krugler commented on TIKA-344:
--
It would be useful for various detectors of charset
[
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786712#action_12786712
]
Ken Krugler commented on LUCENE-1343:
-
Just to make sure this point doesn't get lost
[
https://issues.apache.org/jira/browse/TIKA-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784839#action_12784839
]
Ken Krugler commented on TIKA-340:
--
Funny, I was just looking at the size of the Hadoop job
[
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-332:
-
Attachment: TIKA-332-2.patch
Additional cleanup to new test, plus others - include head tags around
[
https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-341:
-
Priority: Minor (was: Major)
Use charset in CONTENT_TYPE metadata when detecting the character encoding
Use charset in CONTENT_TYPE metadata when detecting the character encoding
--
Key: TIKA-341
URL: https://issues.apache.org/jira/browse/TIKA-341
Project: Tika
Issue
[
https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-341:
-
Attachment: TIKA-341.patch
Use charset in CONTENT_TYPE metadata when detecting the character encoding
[
https://issues.apache.org/jira/browse/TIKA-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784442#action_12784442
]
Ken Krugler commented on TIKA-339:
--
There's another issue here. If you add the detected
[
https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-335:
-
Attachment: TIKA-335-2.patch
Minor improvement to test case - avoid use of UTF-8 chars in strings (use
Improve accuracy of charset detection for HTML pages
Key: TIKA-333
URL: https://issues.apache.org/jira/browse/TIKA-333
Project: Tika
Issue Type: Improvement
Affects Versions: 0.5
[
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782550#action_12782550
]
Ken Krugler commented on TIKA-332:
--
It turns out the HtmlParser code doesn't even use the
[
https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler closed TIKA-333.
Resolution: Not A Problem
In actually walking the parse code, I see that the real problem is that the
[
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-332:
-
Description:
Currently Tika doesn't use the charset info that's optionally present in HTML
documents, via
HtmlParser should use CharsetDetector whenever no charset is specified via meta
http-equiv tag
--
Key: TIKA-334
URL: https://issues.apache.org/jira/browse/TIKA-334
TXTParser use of CharsetDetector has several bugs
-
Key: TIKA-335
URL: https://issues.apache.org/jira/browse/TIKA-335
Project: Tika
Issue Type: Bug
Affects Versions: 0.5
[
https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-334:
-
Attachment: TIKA-334.patch
HtmlParser should use CharsetDetector whenever no charset is specified via
[
https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-335:
-
Attachment: TIKA-335.patch
This patch also cleans up some generics warnings (sorry about mixing the two, I
[
https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782100#action_12782100
]
Ken Krugler commented on TIKA-331:
--
I believe this is an issue for the PDF parser (PDFBox)
[
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765578#action_12765578
]
Ken Krugler commented on TIKA-295:
--
Hi Thilo - I also looked at mstor, but trying to figure
[
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765579#action_12765579
]
Ken Krugler commented on TIKA-295:
--
Hi Alex - thanks for looking into the formatting issues.
[
https://issues.apache.org/jira/browse/TIKA-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764471#action_12764471
]
Ken Krugler commented on TIKA-298:
--
Jukka said on the mailing list:
[
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764475#action_12764475
]
Ken Krugler commented on TIKA-295:
--
Hi Jukka,
Is there an Eclipse formatter file that
[
https://issues.apache.org/jira/browse/TIKA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764477#action_12764477
]
Ken Krugler commented on TIKA-288:
--
Hi Jukka,
If overriding in TikaConfig, would you
[
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-295:
-
Attachment: tika-295.patch
Rough cut of mbox parser
Key:
[
https://issues.apache.org/jira/browse/TIKA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-296:
-
Attachment: tika-296.patch
Automatically set the supertype for +xml mimetypes
[
https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760023#action_12760023
]
Ken Krugler commented on TIKA-285:
--
The file command line utility also has a pretty good set
HtmlParser should resolve relative paths in a href=xxx elements
---
Key: TIKA-287
URL: https://issues.apache.org/jira/browse/TIKA-287
Project: Tika
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753069#action_12753069
]
Ken Krugler commented on NUTCH-751:
---
I'm using HttpClient 4.0 in Bixo, and I agree that
[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737568#action_12737568
]
Ken Krugler commented on SOLR-1301:
---
Hi Jason,
Re Katta, you're right that it doesn't
[
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722242#action_12722242
]
Ken Krugler commented on NUTCH-731:
---
This is definitely an issue - I've been pinging
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277
]
Ken Krugler commented on NUTCH-739:
---
There's another approach that works well here, and
[
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678108#action_12678108
]
Ken Krugler commented on SOLR-1044:
---
I agree with both of Yonik's points:
# We'd first
[
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622746#action_12622746
]
Ken Krugler commented on LUCENE-1343:
-
Hi Robert,
So given that you and the Unicode
[
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12622432#action_12622432
]
Ken Krugler commented on LUCENE-1343:
-
Hi Robert,
FWIW, the issues being discussed
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
]
Ken Krugler commented on NUTCH-25:
--
I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.
[
https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493198
]
Ken Krugler commented on SOLR-69:
-
Ryan Brian's comments above are (I think) indicative of how most people want
to
[
https://issues.apache.org/jira/browse/SOLR-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491746
]
Ken Krugler commented on SOLR-214:
--
There's some complex interplay of the content-type in the request, the charset
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260
]
Ken Krugler commented on NUTCH-353:
---
Another small note about this (see NUTCH-411 for a related but different
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261
]
Ken Krugler commented on NUTCH-353:
---
Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ]
Ken Krugler commented on NUTCH-385:
---
There is a middle ground, though we don't know yet how important it is to
address.
When we crawl partner sites, we
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ]
Ken Krugler commented on NUTCH-353:
---
+1 that the redirect target is not always the real URL that we want to keep.
For example,
In would be handy if StandardRequestHandler returned an error when a query
requested sorting on a non-indexed field
---
Key: SOLR-9
URL:
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ]
Ken Krugler commented on NUTCH-230:
---
So Doug beat me to this comment :)
I was going to describe the two cases we'd run into...
1. There's a great page, but most of the
OPIC score for outlinks should be based on # of valid links, not total # of
links.
--
Key: NUTCH-230
URL: http://issues.apache.org/jira/browse/NUTCH-230
Project: Nutch
Type: Improvement
87 matches
Mail list logo