[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914482#comment-16914482 ] Ken Krugler commented on TIKA-1599: --- >From TIKA-2928, an example of text that fails with TagSoup but

[jira] [Updated] (TIKA-1599) Switch from TagSoup to JSoup

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1599: -- Priority: Major (was: Minor) > Switch from TagSoup to JSoup > > >

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-23 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914481#comment-16914481 ] Ken Krugler commented on TIKA-2928: --- Hi [~Sargent_D] - thanks for trying this out! I'm going to bump the

[jira] [Updated] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2928: -- Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > Less than sign within tag

[jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

2019-08-22 Thread Ken Krugler (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913382#comment-16913382 ] Ken Krugler commented on TIKA-2928: --- The issue isn't that this is "somewhat non-standard" HTML - it's

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869004#comment-16869004 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - I finally got around to looking at your

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856107#comment-16856107 ] Ken Krugler commented on TIKA-2790: --- [~talli...@apache.org] - I'd have to look at the code used to

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-06-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856052#comment-16856052 ] Ken Krugler commented on TIKA-2790: --- Yalder processes the entire string. I thought Optimaize's version

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2019-05-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836738#comment-16836738 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - thanks for running the comparisons.

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2019-04-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812492#comment-16812492 ] Ken Krugler commented on TIKA-2849: --- Hi [~boris-petrov] - two things here. First, do you have the call

[jira] [Commented] (TIKA-2794) Tika extracts text from pdf on MacBook, but not windows server.,

2018-12-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710767#comment-16710767 ] Ken Krugler commented on TIKA-2794: --- Hi [~phallett] - it's better if you first post something like this

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707521#comment-16707521 ] Ken Krugler commented on TIKA-2790: --- Yalder is about 2-2.5x faster than language-detector, depending on

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707343#comment-16707343 ] Ken Krugler commented on TIKA-2790: --- My concern with OpenNLP is that during a web crawl, even with the

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

2018-12-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707292#comment-16707292 ] Ken Krugler commented on TIKA-2790: --- Hi [~talli...@apache.org] - Is there an issue with the current

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658028#comment-16658028 ] Ken Krugler commented on TIKA-2758: --- [~markus17] - My comment above was about the previous change (from

[jira] [Comment Edited] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657976#comment-16657976 ] Ken Krugler edited comment on TIKA-2758 at 10/20/18 7:51 PM: - At least for the

[jira] [Commented] (TIKA-2758) Possible error charset detection

2018-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657976#comment-16657976 ] Ken Krugler commented on TIKA-2758: --- At least for the "detroidnews.html" file, I believe the reason why

[jira] [Resolved] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2683. --- Resolution: Fixed Fixed via [PR

[jira] [Assigned] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

2018-07-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-2683: - Assignee: Ken Krugler > Missing space and inappropriate new-line in Boilerpipe extracted text >

[jira] [Commented] (TIKA-2648) mime detection based on resource name detects resources as "text/x-php" instead of "text/html"

2018-07-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536396#comment-16536396 ] Ken Krugler commented on TIKA-2648: --- [~wastl-nagel] - you mentioned that you thought this solution was

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Description: org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's metadata. So when

[jira] [Updated] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2671: -- Component/s: detector > HtmlEncodingDetector doesnt take provided metadata into account >

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-18 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516644#comment-16516644 ] Ken Krugler commented on TIKA-2671: --- Hi [~gbouchar] - I'm curious how much testing you did, and with

[jira] [Commented] (TIKA-2671) HtmlEncodingDetector doesnt take provided metadata into account

2018-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514355#comment-16514355 ] Ken Krugler commented on TIKA-2671: --- Unfortunately there's no great solution here. Ideally we'd have a

[jira] [Commented] (TIKA-2654) Installation issue

2018-05-29 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493927#comment-16493927 ] Ken Krugler commented on TIKA-2654: --- Hi Ankit - for problems encountered while building/using Tika, it's

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482586#comment-16482586 ] Ken Krugler commented on TIKA-2643: --- When you've got conflicting jars on the classpath, you often run

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481791#comment-16481791 ] Ken Krugler commented on TIKA-2643: --- Looking at the crash log, I see the following duplicate jars

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481786#comment-16481786 ] Ken Krugler commented on TIKA-2643: --- Hi [~fyemaple] - how do you know that Tika 1.5 (or any of the jars

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479468#comment-16479468 ] Ken Krugler commented on TIKA-2643: --- [~fyemaple] - yes, but note that {{kill -QUIT doesn't kill the

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477811#comment-16477811 ] Ken Krugler commented on TIKA-2643: --- [~talli...@apache.org] - different versions of framework jars, I'd

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

2018-05-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477513#comment-16477513 ] Ken Krugler commented on TIKA-2643: --- If I was going to guess, it's that your Cloudera installation has

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384242#comment-16384242 ] Ken Krugler commented on TIKA-2592: --- [~AndreasMeier] - I assume when you said: {quote}I don't think we

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Attachment: IANA Charset names.txt > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Priority: Minor (was: Major) > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Updated] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-02 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2592: -- Issue Type: Improvement (was: Bug) > HTML with charset unicode handled as utf-16 instead utf-8 >

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-03-01 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382330#comment-16382330 ] Ken Krugler commented on TIKA-2592: --- Before making this kind of change (default "unicode" to UTF-8), 

[jira] [Commented] (TIKA-2592) HTML with charset unicode handled as utf-16 instead utf-8

2018-02-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380874#comment-16380874 ] Ken Krugler commented on TIKA-2592: --- Hi [~AndreasMeier] - actually "unicode" is a supported charset name

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379747#comment-16379747 ] Ken Krugler commented on TIKA-2576: --- [~talli...@mitre.org] - After some grepping, I found the Jira issue

[jira] [Commented] (TIKA-2576) Add application/zstd detection and parser

2018-02-26 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377744#comment-16377744 ] Ken Krugler commented on TIKA-2576: --- Is this going to trigger more warnings in the logs? :) {code:java}

[jira] [Resolved] (TIKA-2539) TagSoup HTML parser is project EOL

2018-01-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-2539. --- Resolution: Duplicate > TagSoup HTML parser is project EOL > -- > >

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215838#comment-16215838 ] Ken Krugler commented on TIKA-2478: --- Hi [~talli...@apache.org] - I've attached two mixed examples I'd

[jira] [Updated] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2478: -- Attachment: mixed-simple mixed-with-pdf-inline > MBOX import includes redundant copies

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-22 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214491#comment-16214491 ] Ken Krugler commented on TIKA-2478: --- I recently had to dig into extracting text from emails, and it isn't

[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213150#comment-16213150 ] Ken Krugler commented on TIKA-2471: --- Hi [~talli...@apache.org] - I don't think using MBoxIterator is the

[jira] [Commented] (TIKA-2482) java.lang.NoSuchMethodError at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)

2017-10-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212870#comment-16212870 ] Ken Krugler commented on TIKA-2482: --- Hi [~cermar] - in general it's best to first post this type of issue

[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195386#comment-16195386 ] Ken Krugler commented on TIKA-2472: --- I had to deal with this before in another project - FWIR, I

[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-08-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423280#comment-15423280 ] Ken Krugler commented on TIKA-2056: --- Hi [~chrismattmann] - I haven't actually dealt with the ForkParser

[jira] [Updated] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-2038: -- Description: Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as the

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378434#comment-15378434 ] Ken Krugler commented on TIKA-2033: --- Yes, of course...I was thinking of whether we'd want to extract it

[jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML

2016-07-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378358#comment-15378358 ] Ken Krugler commented on TIKA-2033: --- Do you have a suggestion for how the text should appear in the

[jira] [Commented] (TIKA-2010) Unable to get value when header is incorrect

2016-06-15 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331829#comment-15331829 ] Ken Krugler commented on TIKA-2010: --- Would it be possible for you to try this broken HTML with JSoup?

[jira] [Closed] (TIKA-1938) HtmlParser drops

2016-05-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-1938. - Resolution: Fixed Fix with commit da5bbbe..46d5775. Thanks Joseph! > HtmlParser drops elements found

[jira] [Assigned] (TIKA-1938) HtmlParser drops

2016-05-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-1938: - Assignee: Ken Krugler > HtmlParser drops elements found inside >

[jira] [Commented] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-04-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227078#comment-15227078 ] Ken Krugler commented on TIKA-1835: --- I’d rolled in Markus’s patch directly to support these other link

[jira] [Commented] (TIKA-1896) Invalid closing script tag not handled gracefully by HtmlParser

2016-03-30 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218412#comment-15218412 ] Ken Krugler commented on TIKA-1896: --- Hi Tim - hmm, changing the type of the script tag from cdata to

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-25 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167891#comment-15167891 ] Ken Krugler commented on TIKA-1855: --- The things I don't like about this approach are that (a) core

[jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code back to tika-core and distribute test files to parser modules

2016-02-24 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15165642#comment-15165642 ] Ken Krugler commented on TIKA-1855: --- I'm ok with having some duplicated test files - though for most of

[jira] [Commented] (TIKA-1858) Unable to extract content from chunked portion of large file

2016-02-17 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150618#comment-15150618 ] Ken Krugler commented on TIKA-1858: --- Hi Raghu, This is a great question for the user mailing list (see

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-12 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145135#comment-15145135 ] Ken Krugler commented on TIKA-1851: --- +1 for the proposal. Let me know if you want me to take a swing at

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-10 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141632#comment-15141632 ] Ken Krugler commented on TIKA-1851: --- Hi [~talli...@apache.org] - thanks for generating this output.

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136003#comment-15136003 ] Ken Krugler commented on TIKA-1851: --- I got a clean build w/o any pre-installed modules, so much better,

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136077#comment-15136077 ] Ken Krugler commented on TIKA-1723: --- OK, I've committed this code to a new tika-langdetect module in the

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-06 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136079#comment-15136079 ] Ken Krugler commented on TIKA-1851: --- After poking around a bit, my vote would be to (a) move the test

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135342#comment-15135342 ] Ken Krugler commented on TIKA-1851: --- Hmm, now the top-level build fails on the tika parser text module,

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135336#comment-15135336 ] Ken Krugler commented on TIKA-1851: --- I did a top-level "mvn clean install", which failed with: [ERROR]

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133624#comment-15133624 ] Ken Krugler commented on TIKA-1851: --- Hi [~talli...@apache.org] - I'm also getting a local build failure

[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133629#comment-15133629 ] Ken Krugler commented on TIKA-1851: --- I'm also curious why we have Groovy code and shell scripts inside of

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-04 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132961#comment-15132961 ] Ken Krugler commented on TIKA-1723: --- Good idea re gathering input - I just emailed the dev list. >

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130676#comment-15130676 ] Ken Krugler commented on TIKA-1723: --- [~talli...@apache.org] I must admit, focusing on this change in 2.0,

[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131749#comment-15131749 ] Ken Krugler commented on TIKA-1824: --- As someone who regularly deals with 100s of jars in the dependency

[jira] [Commented] (TIKA-1848) Address issues with Tika 1.12rc#1

2016-02-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130666#comment-15130666 ] Ken Krugler commented on TIKA-1848: --- Unless I'm not understanding the issues properly, I agree with the

[jira] [Resolved] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler resolved TIKA-1835. --- Resolution: Fixed Git commit 489ab93..fe841bc > LinkContentHandler skips iframe and rel tags >

[jira] [Comment Edited] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1558#comment-1558 ] Ken Krugler edited comment on TIKA-1835 at 1/21/16 7:36 PM: Git commit

[jira] [Assigned] (TIKA-1835) LinkContentHandler skips iframe and rel tags

2016-01-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-1835: - Assignee: Ken Krugler > LinkContentHandler skips iframe and rel tags >

[jira] [Commented] (TIKA-1838) Just a quick question regarding compatibility

2016-01-20 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109054#comment-15109054 ] Ken Krugler commented on TIKA-1838: --- Hi Raymond - this is a question that you should post on the Tika

[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-01-19 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106908#comment-15106908 ] Ken Krugler commented on TIKA-1836: --- This seems to be an issue for POI, as per the message in the stack

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048773#comment-15048773 ] Ken Krugler commented on TIKA-1599: --- I'm hoping we could use one or the other, as I don't know how a Tika

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048806#comment-15048806 ] Ken Krugler commented on TIKA-1599: --- Hi [~markus.jel...@openindex.io] - I was actually talking about how

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048819#comment-15048819 ] Ken Krugler commented on TIKA-1599: --- I think we'd be wanting to parse the raw crawl results (with both

[jira] [Commented] (TIKA-1808) Head section closed too eager

2015-12-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047029#comment-15047029 ] Ken Krugler commented on TIKA-1808: --- Hi Markus - I don't think this is actually a bug. I created a

[jira] [Commented] (TIKA-1794) TXTParser removes form feed characters

2015-11-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006797#comment-15006797 ] Ken Krugler commented on TIKA-1794: --- Tika uses XHTML 1.0, which doesn't allow the form-feed character.

[jira] [Commented] (TIKA-1794) TXTParser removes form feed characters

2015-11-16 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006743#comment-15006743 ] Ken Krugler commented on TIKA-1794: --- The output of the Tika parse process is XHTML, and I don't believe a

[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

2015-10-31 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984111#comment-14984111 ] Ken Krugler commented on TIKA-1443: --- Hi [~talli...@apache.org] - I did look at it, and realized I wanted

[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901434#comment-14901434 ] Ken Krugler commented on TIKA-1726: --- [~talli...@apache.org] had asked for input on this - I don't have

[jira] [Commented] (TIKA-568) Language Detection isReasonablyCertain() hides valuable information

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729414#comment-14729414 ] Ken Krugler commented on TIKA-568: -- The new LanguageDetector API has a getRawScore() call on the result,

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729588#comment-14729588 ] Ken Krugler commented on TIKA-1723: --- Hi Tim, 1. Not sure about "Make language detection configurable via

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729595#comment-14729595 ] Ken Krugler commented on TIKA-1723: --- Biggest remaining issue before I commit is how to deal with language

[jira] [Assigned] (TIKA-568) Language Detection isReasonablyCertain() hides valuable information

2015-09-03 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-568: Assignee: Ken Krugler > Language Detection isReasonablyCertain() hides valuable information >

[jira] [Assigned] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

2015-08-29 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-856: Assignee: Ken Krugler Support CJK (Chinese, Japanese and Korean) language detection

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2015-08-29 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721159#comment-14721159 ] Ken Krugler commented on TIKA-369: -- Initial results from integrating language-detector (see

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-08-28 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720702#comment-14720702 ] Ken Krugler commented on TIKA-1723: --- I've also been thinking about how to use lang=xx and

[jira] [Created] (TIKA-1723) Integrate language-detector into Tika

2015-08-27 Thread Ken Krugler (JIRA)
Ken Krugler created TIKA-1723: - Summary: Integrate language-detector into Tika Key: TIKA-1723 URL: https://issues.apache.org/jira/browse/TIKA-1723 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-1723) Integrate language-detector into Tika

2015-08-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1723: -- Attachment: TIKA-1723.patch Integrate language-detector into Tika

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2015-08-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717772#comment-14717772 ] Ken Krugler commented on TIKA-1723: --- The above work added the language-detector

[jira] [Updated] (TIKA-1723) Integrate language-detector into Tika

2015-08-27 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-1723: -- Component/s: languageidentifier Integrate language-detector into Tika

[jira] [Commented] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-07-23 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639257#comment-14639257 ] Ken Krugler commented on TIKA-1696: --- Hi Paul - see

[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency

2015-07-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619365#comment-14619365 ] Ken Krugler commented on TIKA-1675: --- Not sure why the above discussion is being

[jira] [Closed] (TIKA-1624) Syntax error in DOAP file release section

2015-05-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-1624. - Resolution: Done With Tyler's change to the release procedure doc on the wiki

[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section

2015-05-14 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544193#comment-14544193 ] Ken Krugler commented on TIKA-1624: --- As per Chris Mattmann's email, You should only have

[jira] [Assigned] (TIKA-1624) Syntax error in DOAP file release section

2015-05-08 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-1624: - Assignee: Ken Krugler Syntax error in DOAP file release section

  1   2   3   >