[jira] [Updated] (TIKA-682) Creative Suite formats support

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-682: - Affects Version/s: (was: 0.9) 1.8 Creative Suite formats support

[jira] [Updated] (TIKA-682) Creative Suite formats support

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-682: - Component/s: (was: metadata) parser Creative Suite formats support

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342518#comment-14342518 ] Tyler Palsulich commented on TIKA-712: -- Is there any update on this? Otherwise, I'll

[jira] [Resolved] (TIKA-713) Tika can not parse all of the persian pdf files

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-713. -- Resolution: Fixed Tika can not parse all of the persian pdf files

[jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342530#comment-14342530 ] Tyler Palsulich commented on TIKA-539: -- Hi [~kkrugler]. I didn't have a specific fix

[jira] [Updated] (TIKA-768) Parser for EDF files

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-768: - Labels: edf new-parser (was: edf) Parser for EDF files

[jira] [Commented] (TIKA-766) Trim down the NetCDF dependency

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342602#comment-14342602 ] Tyler Palsulich commented on TIKA-766: -- Do we need to look into this more? Now

[jira] [Commented] (TIKA-770) New ODF metadata keys

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342603#comment-14342603 ] Tyler Palsulich commented on TIKA-770: -- [~gagravarr], 3 years later, is it time? New

[jira] [Resolved] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-821. -- Resolution: Fixed Marking fixed based on committed comment. Support detecting old MIcrosoft

[jira] [Resolved] (TIKA-630) Dealing with PDF documents from scanning programs

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-630. -- Resolution: Fixed Dealing with PDF documents from scanning programs

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342507#comment-14342507 ] Tyler Palsulich commented on TIKA-675: -- Is this still worth implementing? [~gagravarr

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342533#comment-14342533 ] Tyler Palsulich commented on TIKA-369: -- Thanks, Ken! In that case, I definitely agree

[jira] [Commented] (TIKA-774) ExifTool Parser

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605 ] Tyler Palsulich commented on TIKA-774: -- Do we still want to integrate

[jira] [Commented] (TIKA-807) PHP version of Tika

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342610#comment-14342610 ] Tyler Palsulich commented on TIKA-807: -- [Here|https://github.com/pierroweb

[jira] [Closed] (TIKA-648) Parsing HTML anchors with embedded div faulty

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-648. Resolution: Won't Fix Parsing HTML anchors with embedded div faulty

[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342521#comment-14342521 ] Tyler Palsulich commented on TIKA-591: -- I bring up tika-batch (from [~talli

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-715: - Labels: newbie (was: ) Some parsers produce non-well-formed XHTML SAX events

[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342528#comment-14342528 ] Tyler Palsulich commented on TIKA-465: -- [~kkrugler], I commented in case someone else

[jira] [Updated] (TIKA-774) ExifTool Parser

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-774: - Labels: features new-parser newbie patch (was: features newbie patch,) ExifTool Parser

[jira] [Comment Edited] (TIKA-774) ExifTool Parser

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605 ] Tyler Palsulich edited comment on TIKA-774 at 3/2/15 1:45 AM

[jira] [Updated] (TIKA-443) Geographic Information Parser

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-443: - Labels: new-parser (was: ) Geographic Information Parser

[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342526#comment-14342526 ] Tyler Palsulich commented on TIKA-715: -- List of parser tests that fail after applying

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342532#comment-14342532 ] Tyler Palsulich commented on TIKA-723: -- The default of behavior of Tika still prints

[jira] [Closed] (TIKA-765) add icu dependency

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-765. Resolution: Won't Fix Closing as Won't Fix since the Persian character issues seem to be solved

[jira] [Updated] (TIKA-852) Quicktime / MP4 Metadata Parser

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-852: - Labels: new-parser (was: ) Quicktime / MP4 Metadata Parser

[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-879: - Labels: new-parser (was: ) Detection problem: message/rfc822 file is detected as text/plain

[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342670#comment-14342670 ] Tyler Palsulich commented on TIKA-879: -- [~lfcnassif], that seems like a reasonable

[jira] [Commented] (TIKA-893) Tika-server bundle includes wrong META-INF/services/org.apache.tika.parser.Parser, doesn't work

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342705#comment-14342705 ] Tyler Palsulich commented on TIKA-893: -- Is this still an issue? From what I understand

[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-891: - Labels: newbie (was: ) Use POST in addition to PUT on method calls in tika-server

[jira] [Closed] (TIKA-903) NPE thrown with password protected Pages file

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-903. Resolution: Fixed No exception is thrown with Tika 1.8-SNAPSHOT. So, closing as fixed. NPE thrown

[jira] [Closed] (TIKA-836) parsing really slow on some documents

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-836. Resolution: Cannot Reproduce We can't reproduce this without the problem files. If you still have

[jira] [Resolved] (TIKA-862) JPSS HDF5 files not being detected appropriately

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-862. -- Resolution: Fixed Marking as fixed. The output from the above file is {code} ?xml version=1.0

[jira] [Updated] (TIKA-849) Identify and parse the Apple iBooks format

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-849: - Labels: new-parser (was: ) Identify and parse the Apple iBooks format

[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-858: - Labels: new-parser (was: ) Tika add parsing support for ANPA-1312 news wire feeds

[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342664#comment-14342664 ] Tyler Palsulich commented on TIKA-858: -- Does anyone have an ANPA file we can use

[jira] [Commented] (TIKA-880) while integrating microsoft parser it is giving error

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342672#comment-14342672 ] Tyler Palsulich commented on TIKA-880: -- Hi [~som.mukhopadhyay]. Thank you for raising

[jira] [Resolved] (TIKA-887) Tika fails to parse some MP3 tags correctly and produces null characters in value

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-887. -- Resolution: Fixed No objection and the linked file seemed to have valid metadata. So I'm marking

[jira] [Closed] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-888. Resolution: Fixed Tika is now using Java 1.6 (talking about 1.7) and there were some Java 1.5

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342703#comment-14342703 ] Tyler Palsulich commented on TIKA-891: -- I made a couple changes related

[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342709#comment-14342709 ] Tyler Palsulich commented on TIKA-894: -- [~lewismc], if you have the time, this would

[jira] [Closed] (TIKA-897) UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-897. Resolution: Fixed Closing as fixed per Nick's comment above. We can open a new issue if someone

[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-911: - Affects Version/s: (was: 1.1) 1.8 Converted PDF document contains

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342716#comment-14342716 ] Tyler Palsulich commented on TIKA-911: -- Still seeing this issue (question marks instead

[jira] [Commented] (TIKA-885) Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342675#comment-14342675 ] Tyler Palsulich commented on TIKA-885: -- [~lfcnassif], is this issue superseded by TIKA

[jira] [Closed] (TIKA-899) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-899. Resolution: Duplicate [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser

[jira] [Closed] (TIKA-898) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-898. Resolution: Cannot Reproduce There are a few ways to configure available Parsers. You can use the new

[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342722#comment-14342722 ] Tyler Palsulich commented on TIKA-891: -- There are only 3 -- getText, getXML, getHTML

Re: Curating Issues

2015-03-01 Thread Tyler Palsulich
++ -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, March 1, 2015 at 8:14 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Curating Issues On Sun, 1 Mar 2015, Tyler Palsulich wrote: I've started labeling some

[jira] [Updated] (TIKA-912) Response charset encoding not declared, and depends on host OS (Windows/Linux)

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-912: - Attachment: TIKA-912.palsulich.patch Attached an updated patch which adds charset info to each

[jira] [Closed] (TIKA-613) PDF parser is changing letters positions

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-613. Resolution: Fixed Significant PDF updates within Tika and PDFBox since this issue. Can reopen

[jira] [Commented] (TIKA-94) Speech recognition

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342395#comment-14342395 ] Tyler Palsulich commented on TIKA-94: - Sphinx actually seems really straightforward

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-03-01 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342372#comment-14342372 ] Tyler Palsulich commented on TIKA-289: -- The dir is the one you linked on GH above

[jira] [Closed] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-539. Resolution: Fixed Encoding detection is too biased by encoding in meta tag

[jira] [Closed] (TIKA-307) Better handling of partial/truncated input data to parsers

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-307. Resolution: Fixed Zip and other type Parsers are much more robust at this point. Can reopen if still

[jira] [Closed] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-89. --- Resolution: Fixed Rename MimeType and MimeTypes - Key

[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342023#comment-14342023 ] Tyler Palsulich commented on TIKA-465: -- Is there still interest in implementing

[jira] [Closed] (TIKA-590) Create facility for deeper introspection of media files

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-590. Resolution: Won't Fix Create facility for deeper introspection of media files

[jira] [Updated] (TIKA-579) DcXMLParser: DC metadata text in extracted body

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-579: - Affects Version/s: (was: 0.8) 1.8 DcXMLParser: DC metadata text

[jira] [Commented] (TIKA-579) DcXMLParser: DC metadata text in extracted body

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342031#comment-14342031 ] Tyler Palsulich commented on TIKA-579: -- +1. DC tags should be put into the Metadata

[jira] [Resolved] (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-577. -- Resolution: Not a Problem The document is corrupted. The POI error is now {{Caused

[jira] [Commented] (TIKA-291) Adobe InDesign support

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341982#comment-14341982 ] Tyler Palsulich commented on TIKA-291: -- We still don't have support

[jira] [Updated] (TIKA-90) Allow thumbnails as document metadata

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-90: Priority: Minor (was: Major) Allow thumbnails as document metadata

[jira] [Closed] (TIKA-354) ProfilingHandler should take a length-limiting parameter

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-354. Resolution: Not a Problem Closing this off, unless you're still interested in getting

[jira] [Commented] (TIKA-375) Improve code quality metrics

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341999#comment-14341999 ] Tyler Palsulich commented on TIKA-375: -- This is a great candidate for any new

[jira] [Updated] (TIKA-375) Improve code quality metrics

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-375: - Labels: newbie (was: ) Improve code quality metrics

[jira] [Commented] (TIKA-381) HtmlParser should strip linefeeds out of links

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342009#comment-14342009 ] Tyler Palsulich commented on TIKA-381: -- This is still an issue in 1.8-SNAPSHOT. {code

[jira] [Updated] (TIKA-381) HtmlParser should strip linefeeds out of links

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-381: - Affects Version/s: (was: 0.6) 1.8 HtmlParser should strip linefeeds out

[jira] [Closed] (TIKA-272) Expose characters offsets information while parsing text-based inputs.

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-272. Resolution: Won't Fix Expose characters offsets information while parsing text-based inputs

[jira] [Closed] (TIKA-288) Support override parsers in AutoDetectParser

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-288. Resolution: Duplicate Support override parsers in AutoDetectParser

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341980#comment-14341980 ] Tyler Palsulich commented on TIKA-289: -- Does anyone know if Tika integrated the magic

[jira] [Commented] (TIKA-94) Speech recognition

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341978#comment-14341978 ] Tyler Palsulich commented on TIKA-94: - This is similar to machine text translation

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341994#comment-14341994 ] Tyler Palsulich commented on TIKA-369: -- Is there any update on this? Language detection

[jira] [Commented] (TIKA-524) Unification of HTML output from Office, OOXML and Open Document parsers

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342026#comment-14342026 ] Tyler Palsulich commented on TIKA-524: -- Is there still interest/a possibility

[jira] [Closed] (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in meta http-equiv=xxx attributes before putting into metadata

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-497. Resolution: Fixed HtmlHandler should fix up incorrect capitalization of names in meta http-equiv

[jira] [Closed] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-289. Resolution: Won't Fix I agree, [~gagravarr]. Let's consider {{file}} as a reference when we need help

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342005#comment-14342005 ] Tyler Palsulich commented on TIKA-289: -- Sounds great! Add magic byte patterns from

[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342034#comment-14342034 ] Tyler Palsulich commented on TIKA-591: -- Is there still interest

[jira] [Closed] (TIKA-100) Structured PDF parsing

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-100. Resolution: Fixed Structured PDF parsing -- Key: TIKA-100

[jira] [Commented] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341976#comment-14341976 ] Tyler Palsulich commented on TIKA-89: - Is there still interest in renaming

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-27 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340624#comment-14340624 ] Tyler Palsulich commented on TIKA-1558: --- [~gagravarr], that sounds good! IMO

[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

2015-02-27 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340662#comment-14340662 ] Tyler Palsulich commented on TIKA-1509: --- Just to reiterate the above and be clear

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-21 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14331975#comment-14331975 ] Tyler Palsulich commented on TIKA-1558: --- This has the added benefit of working

[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-02-21 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14331975#comment-14331975 ] Tyler Palsulich edited comment on TIKA-1558 at 2/22/15 12:53 AM

[jira] [Created] (TIKA-1558) Create a Parser Blacklist

2015-02-20 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1558: - Summary: Create a Parser Blacklist Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature

[jira] [Updated] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1557: -- Issue Type: New Feature (was: Bug) Create TesseractOCR Option to Never Run

[jira] [Closed] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1557. - Resolution: Won't Fix Fix Version/s: (was: 1.8) Closing this as Won't Fix for a clean

[jira] [Closed] (TIKA-1187) java.lang.OutOfMemoryError: Java heap space

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1187. - Resolution: Cannot Reproduce java.lang.OutOfMemoryError: Java heap space

[jira] [Closed] (TIKA-1250) Process loops infintely processing a CHM file

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1250. - Resolution: Cannot Reproduce We can't reproduce this without the file. And, there were some

[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330021#comment-14330021 ] Tyler Palsulich commented on TIKA-1194: --- [~tssk], were you ever able to create a safe

[jira] [Closed] (TIKA-1239) Using Spring and Tika together. Need to extract the content and metadata.

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1239. - Resolution: Cannot Reproduce Using Spring and Tika together. Need to extract the content

[jira] [Resolved] (TIKA-1558) Create a Parser Blacklist

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1558. --- Resolution: Fixed Fix Version/s: 1.8 Assignee: Tyler Palsulich Above strategy

[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330017#comment-14330017 ] Tyler Palsulich commented on TIKA-1437: --- [~Lukeliush], can you make a couple updates

[jira] [Commented] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330031#comment-14330031 ] Tyler Palsulich commented on TIKA-1460: --- Hi [~onyas]. The dialog isn't in a very

[jira] [Resolved] (TIKA-1521) Handle password protected 7zip files

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1521. --- Resolution: Fixed Thanks for finding a workaround, Tim! Closing this now that Jenkins is happy

[jira] [Closed] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1555. - Resolution: Duplicate Assignee: Tyler Palsulich posix_spawn is not a supported process

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329438#comment-14329438 ] Tyler Palsulich commented on TIKA-1555: --- You can also disable OCR by setting

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329470#comment-14329470 ] Tyler Palsulich commented on TIKA-1555: --- My mistake. Please see [this test|https

[jira] [Created] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1557: - Summary: Create TesseractOCR Option to Never Run Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329486#comment-14329486 ] Tyler Palsulich commented on TIKA-1555: --- [~thetaphi], that's true. Please see TIKA

Re: Gsoc2015

2015-02-17 Thread Tyler Palsulich
Hi Abhinav, Have you tried reading the 5 minute Parse guide on the website ( http://tika.apache.org/1.7/parser_guide.html)? That should help give you an idea of how to create a new Parser. Tika is split into multiple components. Each component is responsible for a different feature of Tika.

<    1   2   3   4   5   6   7   8   >