[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run

2017-05-03 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994941#comment-15994941
 ] 

Tyler Palsulich commented on TIKA-1334:
---

The format should probably be in the form:

{noformat}
[
  {
"mime-type": "something",
"count": 1234,
"version": "a"
  },
  {
"mime-type": "something",
"count": 4321,
"version": "b"
  },
  ...
]
{noformat}

> Add presentation layer for results of each run
> --
>
> Key: TIKA-1334
> URL: https://issues.apache.org/jira/browse/TIKA-1334
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: static_stats.zip
>
>
> If I'm doing this, it'll probably be vintage mid-90s html.  If someone with 
> some .js kung-fu wants to take this, please do.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903878#comment-14903878
 ] 

Tyler Palsulich commented on TIKA-1743:
---

[Copied from the list]

This sounds like a great idea! We should make the size of the pool configurable 
with TikaConfig.

> NetworkParser can create Unbounded Number of Threads
> 
>
> Key: TIKA-1743
> URL: https://issues.apache.org/jira/browse/TIKA-1743
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>
> The current NetworkParser class creates new instances of the Thread class 
> which each call to parse.  This could create an unbounded number of threads 
> created by this class.  I'd suggest replacing this logic with a 
> ThreadPoolExecutor and a configurable number of threads.  This will help 
> prevent creating an unbounded number of threads and allow the user to tune 
> performance to the hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-08-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722705#comment-14722705
 ] 

Tyler Palsulich commented on TIKA-1672:
---

Hmm. Maybe we should rename the module? Right now, it doesn't make sense to 
have a java7 component when the entire project depends on Java 7.

 Integrate tika-java7 component
 --

 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.11


 Code requiring Java 7 doesn't need to be in a separate module now that 
 TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246
 ] 

Tyler Palsulich commented on TIKA-1362:
---

If you have a pressing need for better configuration abilities for the Google 
Translator, feel free to open up a new issue and upload a patch! :) We'd be 
happy to help you get started. Check out the [contributing 
page|https://tika.apache.org/contribute.html] for some general information.

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1672) Integrate tika-java7 component

2015-07-02 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1672:
-

 Summary: Integrate tika-java7 component
 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.10


Code requiring Java 7 doesn't need to be in a separate module now that 
TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-07-02 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1536.
---
Resolution: Fixed

Upgraded in  r1688779. Thanks, all. Will open a new issue regarding integrating 
tika-java7.

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605772#comment-14605772
 ] 

Tyler Palsulich commented on TIKA-1536:
---

Yep, see http://apache.markmail.org/thread/7oubuh4hp6rdlbch.

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1481) TikaJAXRS get metadata calls give different results

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1481.
-
Resolution: Not A Problem

Hi [~arbuzovada]. Sorry for the trouble! Did you make sure to respond to the 
automated response, confirming your subscription?

I'm closing this issue as not a problem. But, don't hesitate to let us know if 
you have any more issues.

 TikaJAXRS get metadata calls give different results
 ---

 Key: TIKA-1481
 URL: https://issues.apache.org/jira/browse/TIKA-1481
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.6
 Environment: Windows 8, JDK 1.8
Reporter: Darya Arbuzova
Priority: Minor
 Attachments: sample.csv


 Hello!
 I'm trying to use Tika in server mode.
 I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
 I have tried to get file metadata in 2 different ways (as explained here: 
 http://wiki.apache.org/tika/TikaJAXRS ):
 {{ curl -T sample.csv http://localhost:9998/meta --header Content-Type: 
 text/csv}}
 {{Content-Encoding,windows-1252}}
 {{Content-Type,text/plain; charset=windows-1252}}
 and
 {{ curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
 Content-Type: text/csv}}
 {{Content-Encoding,ISO-8859-1}}
 {{Content-Type,text/plain; charset=ISO-8859-1}}
 How come they give different results in encoding if I call the same 
 {{http://localhost:9998/meta}}?
 What could the other differences appear and which is the preferable way to 
 get metadata?
 Many thanks!
 Best regards,
 Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-756) XMP output from Tika CLI

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-756.
--
Resolution: Fixed

Marking this as Fixed, since there are a few more references to tika-parser 
components (see TikaToXMP). Feel free to reopen if you disagree.

 XMP output from Tika CLI
 

 Key: TIKA-756
 URL: https://issues.apache.org/jira/browse/TIKA-756
 Project: Tika
  Issue Type: New Feature
  Components: cli, metadata
Reporter: Jukka Zitting
Assignee: Jörg Ehrlich
  Labels: metadata, xmp
 Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch


 It would be great if the Tika CLI could output metadata also in the XMP 
 format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1429) Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1429.
-
Resolution: Not A Problem

Closing this as not a problem. The file needs to be kept in memory for the GUI 
to work. So, the problem should be fixed with a higher limit.

 Unable to  View a 9mb file even after setting  a large Heap Size of 3GB  
 while TIKA GUI
 ---

 Key: TIKA-1429
 URL: https://issues.apache.org/jira/browse/TIKA-1429
 Project: Tika
  Issue Type: Bug
  Components: gui
Affects Versions: 1.6
 Environment: Windows 8
Reporter: Gautham Gowrishankar
Priority: Minor

  we seem to have found an issue while tika1.6 jar as a GUI (-g option),It 
 seems to work for smaller .tsv files but we running into GC Overload 
 Excpetion while running on of the files in your DataSet. Strangely it seems 
 to work with -x option.
 There might be an issue with  at 
 org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284). 
 Just bringing it to your notice.
 Below are the logs.
 =
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.util.Arrays.copyOfRange(Unknown Source)
 at java.lang.String.init(Unknown Source)
 at java.lang.StringBuilder.toString(Unknown Source)
 at java.lang.StackTraceElement.toString(Unknown Source)
 at java.lang.String.valueOf(Unknown Source)
 at java.lang.StringBuilder.append(Unknown Source)
 at java.lang.Throwable.printStackTrace(Unknown Source)
 at java.lang.Throwable.printStackTrace(Unknown Source)
 at org.apache.tika.gui.TikaGUI.handleError(TikaGUI.java:351)
 at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284)
 at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
 at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
 at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
 at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
 at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
 at javax.swing.AbstractButton.doClick(Unknown Source)
 at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
 at 
 javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown
 Source)
 at java.awt.Component.processMouseEvent(Unknown Source)
 at javax.swing.JComponent.processMouseEvent(Unknown Source)
 at java.awt.Component.processEvent(Unknown Source)
 at java.awt.Container.processEvent(Unknown Source)
 at java.awt.Component.dispatchEventImpl(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Window.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.awt.EventQueue$4.run(Unknown Source)
 at java.awt.EventQueue$4.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.awt.EventQueue.dispatchEvent(Unknown Source)
 at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
 at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
 at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.run(Unknown Source)
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.lang.StringBuilder.toString(Unknown Source)
 at 
 com.sun.java.swing.plaf.windows.TMSchema$Part.getControlName(Unknown
 Source)
 at com.sun.java.swing.plaf.windows.XPStyle.isSkinDefined(Unknown 
 Source)
 at 
 

[jira] [Commented] (TIKA-1493) Update for JAXRS page with details on passing password

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605292#comment-14605292
 ] 

Tyler Palsulich commented on TIKA-1493:
---

Can someone familiar with the latest in passing a password to Tika server 
update the wiki page? Or, is setting the environment variable enough?

 Update for JAXRS page with details on passing password
 --

 Key: TIKA-1493
 URL: https://issues.apache.org/jira/browse/TIKA-1493
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Reporter: Peter Bowyer
Priority: Minor
  Labels: documentation, newbie

 I signed up for a wiki account to make the edit, but the page is immutable :(
 It would be really helpful to put on https://wiki.apache.org/tika/TikaJAXRS 
 information about passing the password for encrypted PDFs into TikaJAXRS. In 
 Changelog.txt I discovered the TIKA_PASSWORD environment variable which has 
 worked for me, and it'd be nice to save others having to hunt around.
 I'd also like to know if there's a way to pass it in per-request (a HTTP 
 header? Useful when many different passwords) - not found anything in the 
 source code for that though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1552) Pdf document parser

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1552.
-
Resolution: Not A Problem

Marking this as not a problem, since Adobe Reader also adds white space.

 Pdf document parser
 ---

 Key: TIKA-1552
 URL: https://issues.apache.org/jira/browse/TIKA-1552
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Konstantin
 Attachments: 2014_US_Federal_Budget.pdf, issue.jpg


 Hello,
 We found that when a pdf document has marked text inside frame (table) then 
 after parsing Tika insert tabs between words.
 Original text from attached file:
 Provides $17.7 billion in discretionary funding for the National Aeronautics 
 and Space
 Parsed text (jira removed tabs, so i will add - symbols instead):
 •Provides - $17.7 - 
 billion-in-discretionary-funding-for-the-National-Aeronautics-and-Space
 Please  take a look in attached screenshot.
 On the left side is the parsed text in text editor
 Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1452.
-
Resolution: Not A Problem

I'm closing this as not a problem. But, please feel free to reopen if you're 
still having this issue!

 parser.parse() throws exception after which the procesed file is not getting 
 renamed/moved/deleted
 --

 Key: TIKA-1452
 URL: https://issues.apache.org/jira/browse/TIKA-1452
 Project: Tika
  Issue Type: Bug
  Components: detector, metadata, parser
Affects Versions: 1.6
 Environment: jre6
Reporter: Abhishek

 I am passing a file as input stream to parser.parse() method while using 
 apache tika library to convert file to text.The method throws an exception 
 (displayed below) but the input stream is closed in the finally block 
 successfully. Then while renaming the file, the File.renameTo method from 
 java.io returns false. I am not able to rename/delete/move the file despite 
 successfully closing the inputStream. I am afraid another instance of file is 
 created, while parser.parse() method processess the file, which doesn't get 
 closed till the time exception is throw. Is that possible? If so what should 
 I do to rename or delete the file.
 The Exception thrown while checking the content type is
 java.lang.NoClassDefFoundError: Could not initialize class 
 com.adobe.xmp.impl.XMPMetaParser
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
 at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
 
 at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
 at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1439) PDF embeded with document can not parse.

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1439.
-
Resolution: Duplicate

 PDF embeded with document can not parse.
 

 Key: TIKA-1439
 URL: https://issues.apache.org/jira/browse/TIKA-1439
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
 Environment: windows7
Reporter: sunxingzhe
  Labels: pdfbox
 Attachments: PDF2XHTML.java_diff.html


 I insert a Excel file into the pdf file.
 But can not extracte embedded excel resources.
 The attachment file PDF2XHTML.java_diff.html is the diff file.
 Please confirm it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1233:
--
Fix Version/s: (was: 1.6)
   1.10

 PDFBox can throw StringIndexOutOfBoundsException on some dates
 --

 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
  Labels: easyfix
 Fix For: 1.10


 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
 string for parsing is empty or contains only spaces.  A few of my test pdfs 
 have this feature.
 Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
 causing problems in TIKA
 {noformat}
 @@ -171,6 +171,9 @@
  addMetadata(metadata, TikaCoreProperties.CREATED, 
 info.getCreationDate());
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
  try {
  Calendar modified = info.getModificationDate();
 @@ -178,6 +181,9 @@
  addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1585) Create Example Website with Form Submission

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1585.
---
Resolution: Fixed

Good idea, [~lewismc]. I added it to 
http://people.apache.org/~tpalsulich/tika.html. The server is down right now. 
If/when another one is started, we'll need to start it with the right CORS 
argument (http://people.apache.org) and I'll update the page with the right IP 
address.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605300#comment-14605300
 ] 

Tyler Palsulich commented on TIKA-1536:
---

Now that 1.9 is released, are there any blockers for upgrading to Java 1.7?

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1199) Tika extracts weird signs instead of text

2015-06-09 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1199.
-
Resolution: Not A Problem

 Tika extracts weird signs instead of text
 -

 Key: TIKA-1199
 URL: https://issues.apache.org/jira/browse/TIKA-1199
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: MacOSX, Linux
Reporter: Marc Teutelink
 Attachments: gaat fout.pdf, 
 plain_text_tika_output_from_gaat_fout_pdf.txt, 
 structured_text_tika_output_from_gaat_fout_pdf.xml


 Tika extracts complete bogus text from the attached document. I have attached 
 the .PDF in question and also added the plain and structured text output from 
 Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1630) Mention APK support in List of Supported Formats

2015-06-09 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1630.
---
   Resolution: Fixed
Fix Version/s: 1.9
 Assignee: Tyler Palsulich

Bolded the Please note for version 1.9. Hopefully that will help clear things 
up.

[~flowlo], thank you for reporting this! Please let us know if you run into any 
other issues or have any other suggested improvements.

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Assignee: Tyler Palsulich
Priority: Trivial
 Fix For: 1.9


 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986
 ] 

Tyler Palsulich commented on TIKA-1652:
---

I think this is a duplicate of TIKA-1426?

 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section

2015-05-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553281#comment-14553281
 ] 

Tyler Palsulich commented on TIKA-1624:
---

Thanks, Ken. I published the file a few minutes ago.

 Syntax error in DOAP file release section
 -

 Key: TIKA-1624
 URL: https://issues.apache.org/jira/browse/TIKA-1624
 Project: Tika
  Issue Type: Bug
 Environment: 
 http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf
Reporter: Sebb
Assignee: Ken Krugler

 DOAP files can contain details of multiple release Versions, however each 
 must be listed in a separate release section, for example:
 release
   Version
 nameApache XYZ/name
 created2015-02-16/created
 revision1.6.2/revision
   /Version
 /release
 release
   Version
 nameApache XYZ/name
 created2014-09-24/created
 revision1.6.1/revision
   /Version
 /release
 Please can the project DOAP be corrected accordingly?
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats

2015-05-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553272#comment-14553272
 ] 

Tyler Palsulich commented on TIKA-1630:
---

That is a very good point. There is a paragraph on the formats page which 
explains in a little bit more detail:
bq. (Please note that Apache Tika is able to detect a much wider range of 
formats than those listed below, this page only documents those formats from 
which Tika is able to extract metadata and/or textual content)

Would it help if we included a link to the mimetypes file (which has all 
filetypes Tika can detect)?

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Priority: Trivial

 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats

2015-05-14 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544104#comment-14544104
 ] 

Tyler Palsulich commented on TIKA-1630:
---

Hi. Thanks for reporting this! Can you be a little more specific about which 
file is supported? What in the Tika codebase indicates support for APK formats? 
Also, just to be clear, are you referring to android application packages?

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Priority: Trivial

 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section

2015-05-14 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544150#comment-14544150
 ] 

Tyler Palsulich commented on TIKA-1624:
---

[~kkrugler], yes. I just updated the release instructions.

 Syntax error in DOAP file release section
 -

 Key: TIKA-1624
 URL: https://issues.apache.org/jira/browse/TIKA-1624
 Project: Tika
  Issue Type: Bug
 Environment: 
 http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf
Reporter: Sebb
Assignee: Ken Krugler

 DOAP files can contain details of multiple release Versions, however each 
 must be listed in a separate release section, for example:
 release
   Version
 nameApache XYZ/name
 created2015-02-16/created
 revision1.6.2/revision
   /Version
 /release
 release
   Version
 nameApache XYZ/name
 created2014-09-24/created
 revision1.6.1/revision
   /Version
 /release
 Please can the project DOAP be corrected accordingly?
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

2015-04-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507259#comment-14507259
 ] 

Tyler Palsulich commented on TIKA-1585:
---

Is there an Apache hosted location we'd like to stand this up? If not, I'll 
close this issue off.

http://tpalsulich.github.io/TikaExamples/

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata

2015-04-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503778#comment-14503778
 ] 

Tyler Palsulich commented on TIKA-1607:
---

Good idea! What if you created a subclass of {{Metadata}} 
({{ExtendedMetadata}}?) which supports mapping to a {{ListMapString, 
Object}}. Then, when populating the metadata with a phone number, you can 
check if {{metadata instanceof ExtendedMetadata}} and respond accordingly.

Any drastic changes would be a good candidate for Tika 2.0.

 Introduce new HashMapString, Object data structure for persitsence of Tika 
 Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  ListHashMapString,String
 {code}
 Where Object could be a CollectionHashMapString/Property, String/int/long 
 e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox

2015-04-16 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1266.
-
Resolution: Not A Problem

Thanks, [~bobpaulin]!

 Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
 --

 Key: TIKA-1266
 URL: https://issues.apache.org/jira/browse/TIKA-1266
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.4, 1.5
Reporter: pm

 The tika-bundle currently has the Embed-Dependency header filled with 
 embedded dependencies. 
 Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is .
 Please add Bundle-ClassPath with list of embedded JAR names prefixed with 
 ., .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492638#comment-14492638
 ] 

Tyler Palsulich commented on TIKA-1593:
---

See https://svn.apache.org/repos/asf/tika/site/src/site/apt/download.apt.vm -- 
you need the vm extension. Then, you can use 
{code}${project.parent.version}{code} to get the current version of the 
project. Then, when we update the site for a new release, you just have to 
change the version number in the site's pom.xml file.

I'll fix this right now.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1593.
---
Resolution: Fixed
  Assignee: Tyler Palsulich

Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any 
more.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Assignee: Tyler Palsulich
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492662#comment-14492662
 ] 

Tyler Palsulich edited comment on TIKA-1593 at 4/13/15 5:02 PM:


Fixed in r1673240 and r1673241. Thank you [~bhamail]! Please let us know if you 
find any more.


was (Author: tpalsulich):
Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any 
more.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Assignee: Tyler Palsulich
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1600.
---
Resolution: Fixed
  Assignee: Hong-Thai Nguyen

Thanks, [~thaichat04]! I just updated it -- reformatted the ODF parsing files 
(they were all a bit odd with whitespace) and moved the test into the existing 
test file.

Marking this as fixed and will cut a new release shortly.

 Unable to parse ODT files because of failed to close temporary resources
 

 Key: TIKA-1600
 URL: https://issues.apache.org/jira/browse/TIKA-1600
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
 Environment: Windows
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
 Attachments: Manuel_koha.odt


 Many ODT files are failed to parse causing of this exception. A sample file 
 in attachment
 {code}
 Apache Tika was unable to parse the document
 at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Failed to close temporary resources
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256)
   at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
   at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
   at javax.swing.AbstractButton.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown 
 Source)
   at java.awt.Component.processMouseEvent(Unknown Source)
   at javax.swing.JComponent.processMouseEvent(Unknown Source)
   at java.awt.Component.processEvent(Unknown Source)
   at java.awt.Container.processEvent(Unknown Source)
   at java.awt.Component.dispatchEventImpl(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Window.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
   at java.awt.EventQueue.access$400(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue.dispatchEvent(Unknown Source)
   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.run(Unknown Source)
 Caused by: java.io.IOException: Could not delete temporary file 
 C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp
   at 
 org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
   at 
 org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
   ... 42 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1600:
--
Priority: Blocker  (was: Major)

 Unable to parse ODT files because of failed to close temporary resources
 

 Key: TIKA-1600
 URL: https://issues.apache.org/jira/browse/TIKA-1600
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
 Environment: Windows
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Blocker
 Attachments: Manuel_koha.odt


 Many ODT files are failed to parse causing of this exception. A sample file 
 in attachment
 {code}
 Apache Tika was unable to parse the document
 at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Failed to close temporary resources
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256)
   at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
   at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
   at javax.swing.AbstractButton.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown 
 Source)
   at java.awt.Component.processMouseEvent(Unknown Source)
   at javax.swing.JComponent.processMouseEvent(Unknown Source)
   at java.awt.Component.processEvent(Unknown Source)
   at java.awt.Container.processEvent(Unknown Source)
   at java.awt.Component.dispatchEventImpl(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Window.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
   at java.awt.EventQueue.access$400(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue.dispatchEvent(Unknown Source)
   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.run(Unknown Source)
 Caused by: java.io.IOException: Could not delete temporary file 
 C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp
   at 
 org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
   at 
 org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
   ... 42 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1592.
-
Resolution: Invalid

Closing as Invalid. Feel free to create additional issues if you run into other 
problems with Tika!

Thank you for updating with the solution! I'm glad you found it. :) (I'm also 
glad this wasn't a Tika issue... Ha.)

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393246#comment-14393246
 ] 

Tyler Palsulich commented on TIKA-1592:
---

I tried building ikube on a Mac, but I ran into multiple test failures.
{code}
Tests in error:
  analyze(ikube.analytics.weka.WekaClassifierIntegration)
  initializationError(ikube.action.rule.IsRemoteIndexCurrentIntegration)
  initializationError(ikube.analytics.weka.WekaForecastClassifierIntegration)
  initializationError(ikube.database.DataBaseIntegration)
  
initializationError(ikube.action.index.handler.database.TableResourceProviderIntegration)
  initializationError(ikube.web.service.AnalyzerIntegration)
  initializationError(ikube.analytics.AnalyticsServiceIntegration)
  initializationError(ikube.scheduling.SnapshotScheduleIntegration)
  initializationError(ikube.web.service.SearcherJsonIntegration)
  initializationError(ikube.scheduling.PruneScheduleIntegration)
  
initializationError(ikube.action.index.handler.email.IndexableEmailHandlerIntegration)
  
initializationError(ikube.action.index.handler.strategy.GeospatialEnrichmentStrategyIntegration)
  
initializationError(ikube.action.index.handler.filesystem.IndexableFilesystemHandlerIntegration)
  initializationError(ikube.web.service.SearcherXmlIntegration)
  initializationError(ikube.action.ResetIntegration)
  initializationError(ikube.action.index.handler.internet.SvnHandlerIntegration)
  initializationError(ikube.toolkit.DatabaseUtilitiesIntegration)
  initializationError(ikube.action.rule.RulesIntegration)
  initializationError(ikube.analytics.neuroph.NeurophAnalyzerIntegration)
  initializationError(ikube.database.EntityIntegration)
  initializationError(ikube.cluster.hzc.ClusterManagerCacheSearchIntegration)
  
initializationError(ikube.action.index.handler.database.IndexableTableHandlerIntegration)
{code}

Is Linux required?

Can you give some context of how you're using Tika in the failing unit test? 
Tika should not have any (or, really, there is very little) OS specific code. 
So, it doesn't make sense why something would try to start x11. But, a 
dependency could definitely be up to something fishy.

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184
 ] 

Tyler Palsulich commented on TIKA-1592:
---

Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...). 
When you say the logging is a a gig, is that what is sent to stdout when doing 
{{mvn install}}? Or something else?

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184
 ] 

Tyler Palsulich edited comment on TIKA-1592 at 4/2/15 7:09 PM:
---

Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? -After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...)- 
See {{grep}} output below. When you say the logging is a a gig, is that what is 
sent to stdout when doing {{mvn install}}? Or something else?

{code}
➜  trunk  grep -Ri dbus .
Binary file ./tika-parsers/src/test/resources/test-documents/testTIFF.tif 
matches
Binary file ./tika-parsers/target/test-classes/test-documents/testTIFF.tif 
matches
Binary file ./tika-parsers/target/tika-parsers-1.8-SNAPSHOT-tests.jar matches
Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches
➜  trunk  grep -Ri gconf .
Binary file ./tika-app/target/tika-app-1.8-SNAPSHOT.jar matches
Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches
{code}


was (Author: tpalsulich):
Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...). 
When you say the logging is a a gig, is that what is sent to stdout when doing 
{{mvn install}}? Or something else?

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission

2015-04-01 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841
 ] 

Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM:
---

Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now 
closed.


was (Author: tpalsulich):
Done. It works. I'll see if I can shut 9997 down right now.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1558:
--
Description: 
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

-So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.-

  was:
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432
 ] 

Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM:


-Above strategy added in r1661284. You can now blacklist Parsers by adding 
names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the 
same format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.-

Edit: Service loading blacklisting disabled in r1670487. Use a custom 
TikaConfig like [this 
one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml]
 to disable a Parser. Any subclasses of that Parser will also be excluded.


was (Author: tpalsulich):
Above strategy added in r1661284. You can now blacklist Parsers by adding names 
to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same 
format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1587) ForkParser::setJavaCommand should take ListString

2015-03-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386685#comment-14386685
 ] 

Tyler Palsulich commented on TIKA-1587:
---

Thank you for reporting this! It seems like a definite problem. Is there any 
way you can provide a patch?

 ForkParser::setJavaCommand should take ListString
 ---

 Key: TIKA-1587
 URL: https://issues.apache.org/jira/browse/TIKA-1587
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Oleg Oshmyan

 ForkParser::setJavaCommand currently takes a string and splits it on 
 whitespace. This makes it impossible to use commands with paths that contain 
 spaces. In particular, it makes it impossible to reliably use 
 System.getProperty(java.home) in order to launch the same Java that the 
 current process is running in, because it might contain spaces. If it would 
 just take a ListString and pass (a clone of) it directly to ProcessBuilder, 
 this wouldn't be a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386906#comment-14386906
 ] 

Tyler Palsulich edited comment on TIKA-1584 at 3/30/15 4:05 PM:


Yup! The 1.8 release process should start this week. Ideally, it will hit the 
mirrors some time next week.

[edit: 1.8, not 1.7!]


was (Author: tpalsulich):
Yup! The 1.7 release process should start this week. Ideally, it will hit the 
mirrors some time next week.

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker
 Fix For: 1.8


 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1575:
--
Fix Version/s: 1.8

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8

 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, 
 content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1579) Add file type to NetCDFParser

2015-03-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1579.
---
Resolution: Fixed

 Add file type to NetCDFParser
 -

 Key: TIKA-1579
 URL: https://issues.apache.org/jira/browse/TIKA-1579
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1579.abburgess.190315.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format.
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483
 ] 

Tyler Palsulich commented on TIKA-1584:
---

We now have two major issues which need a quick release. So, I would say go for 
1.8. Tim, can you chime in on the current discuss thread?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1585:
-

 Summary: Create Example Website with Form Submission
 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


It would be great to have a website where we can direct people who ask what 
Tika can do for [filetype] without needing them to actually download Tika.

Some initial work to do that is 
[here|http://tpalsulich.github.io/TikaExamples/].

I'm far from a design guru, but I imagine the site as having a form where you 
can upload a file at the top, checkboxes for if you want metadata, content, or 
both, and a submit button. The request should be sent with AJAX and the result 
should populate a {{div}}.

One issue with AJAX requests is that Tika Server doesn't currently allow 
Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly 
updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1526.
---
Resolution: Fixed

Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or 
anyone else, please reopen this if you find any other cases.

Thank you everyone for the help!

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337
 ] 

Tyler Palsulich commented on TIKA-1581:
---

Hi [~kkrugler]. Thanks. The comment is now
bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight 
(https://github.com/codelibs/jhighlight)

If this looks good, I'll start a \[DISCUSS\] thread on the list about a new 
version.

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1586:
-

 Summary: Enable CORS on Tika Server
 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


Tika Server should allow configuration of CORS requests (for uses like 
TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from 
CXF for how to add it.

The only change from that site is that we will need to add a 
{{CrossOriginResourceSharingFilter}} as a provider.

Ideally, this is configurable (limit which resources have CORS, and which 
origins are allowed). But, I'm not thinking of any general methods of how to do 
that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1586.
---
Resolution: Fixed

Fixed in r1669799.

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411
 ] 

Tyler Palsulich commented on TIKA-1585:
---

CORS work is now integrated. [~talli...@mitre.org], can you restart the server 
on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option?

Then, we can close off the 9997 port (my github.io site is querying 9997, 
though, so I'll need to update that).

Is there an official place we'd like to host the above site?

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372
 ] 

Tyler Palsulich commented on TIKA-1586:
---

Can someone take a look at the above PR and make sure I'm not doing anything 
bone-headed? Thanks!

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-03-27 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1354.
-
   Resolution: Fixed
Fix Version/s: 1.7

Marking as Fixed.

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac
 Fix For: 1.7


 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1583) Convert Module Level READMEs to Markdown

2015-03-27 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1583:
-

 Summary: Convert Module Level READMEs to Markdown
 Key: TIKA-1583
 URL: https://issues.apache.org/jira/browse/TIKA-1583
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1583) Convert Module Level READMEs to Markdown

2015-03-27 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1583.
---
Resolution: Done

Done in r1669644 and r1669645.

 Convert Module Level READMEs to Markdown
 

 Key: TIKA-1583
 URL: https://issues.apache.org/jira/browse/TIKA-1583
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell

2015-03-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376796#comment-14376796
 ] 

Tyler Palsulich commented on TIKA-1273:
---

{{original-tika-server-1.8-SNAPSHOT.jar}}? I don't see any flags in the 
tika-server pom.xml. So, I'm not sure where the activation is.

 old tika-server jar artifact contains no manifest so not able to invoke from 
 shell
 --

 Key: TIKA-1273
 URL: https://issues.apache.org/jira/browse/TIKA-1273
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8


 I've never ever used the old tika-server artifact which is generated when one 
 installs the server module. It needs to contain a manifest otherwise it 
 cannot be invoked from the shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1565) image/gif parse error

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1565.
---
   Resolution: Fixed
Fix Version/s: (was: 1.7)
   1.8
 Assignee: Tyler Palsulich

Marking as Fixed for 1.8. The file is now parsed without an Exception. Please 
reopen if you are still running into this issue with Trunk or 1.8 (when it is 
released some time in the future).

 image/gif parse error
 -

 Key: TIKA-1565
 URL: https://issues.apache.org/jira/browse/TIKA-1565
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: win7 x64  jdk1.7
Reporter: lixin
Assignee: Tyler Palsulich
 Fix For: 1.8

 Attachments: JNK16-1309-173.mht


 I am getting an exception parsing the following mht File
 {code}
 org.apache.tika.exception.TikaException: image/gif parse error
   at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
   at org.apache.tika.example.MyTest.test1(MyTest.java:31)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
 Caused by: javax.imageio.IIOException: Unexpected block type 1!
   at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown 
 Source)
   at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source)
   at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92)
   ... 32 more
 {code}
 my test code:
 {code}
 AutoDetectParser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 parser.parse(new FileInputStream(new File(file)), handler, 
 metadata,context);
 System.out.println(handler.toString());
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1543.
-
Resolution: Fixed

This isn't actually a problem. I just tested locally -- it works.

We have unit tests for the path, but it's difficult to test that extraction 
works with a non-standard path, since we don't know what the path is...

I think the problem is either:
The path you set is not to the directory that contains the executable or 
The path doesn't have a tessdata directory inside it.

You can see all of the Tesseract debugging messages by enabling {{debug}} level 
logging (put a 
[log4j.properties|https://github.com/apache/tika/blob/10298692cb27d1ad3732589930987e2fe2681ee8/tika-parsers/src/test/resources/log4j.properties]
 file on your classpath and set the output level to {{debug}}).

I'd be happy to help you debug further.

 TesseractOCRParser.setTesseractPath() doesn't work on Linux
 ---

 Key: TIKA-1543
 URL: https://issues.apache.org/jira/browse/TIKA-1543
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Sean Zhao
 Fix For: 1.8

   Original Estimate: 168h
  Remaining Estimate: 168h

 After call setTesseractPath() to set the Tesseract path to a not-default 
 path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
 will return.
 Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1460.
-
Resolution: Cannot Reproduce

Closing as Cannot Reproduce, since it's been a month since my last comment and 
we don't have the file which reproduces the issue. Please reopen if you're 
still running into this!

 Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
 --

 Key: TIKA-1460
 URL: https://issues.apache.org/jira/browse/TIKA-1460
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: win7,myeclipse8.5
Reporter: onyas
Priority: Critical

 for some reason,I could not upload the file,Here is the info..
 and i checked all the version in the directory of 
 \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@d640af
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of 
 the file
   at 
 org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:202)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 21 more
 the major code is :
 Parser parser = new AutoDetectParser();
   ContentHandler handler = new BodyContentHandler(getNum());
   Metadata metadata = new Metadata();
   ParseContext context = new ParseContext();
   InputStream stream = null;
   StringBuffer content = new StringBuffer();
   try {
   stream = new FileInputStream(file);
   if (stream != null) {
   parser.parse(stream, handler, metadata, 
 context);
   content = content.append(handler);
   
   if(StringUtils.isNotBlank(content.toString())){
   hasContent = true;
   handler = null;
   metadata = null;
   context = null;
   }
   }
 And the exception is throwed at this line== parser.parse(stream, handler, 
 metadata, context);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1565) image/gif parse error

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1565:
--
Description: 
I am getting an exception parsing the following mht File
{code}
org.apache.tika.exception.TikaException: image/gif parse error
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at org.apache.tika.example.MyTest.test1(MyTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: javax.imageio.IIOException: Unexpected block type 1!
at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown 
Source)
at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source)
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92)
... 32 more
{code}
my test code:
{code}
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(new FileInputStream(new File(file)), handler, 
metadata,context);
System.out.println(handler.toString());
{code}

  was:
I am getting an exception parsing the following mht File

org.apache.tika.exception.TikaException: image/gif parse error
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at org.apache.tika.example.MyTest.test1(MyTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

[jira] [Updated] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1543:
--
Fix Version/s: (was: 1.7)
   1.8

 TesseractOCRParser.setTesseractPath() doesn't work on Linux
 ---

 Key: TIKA-1543
 URL: https://issues.apache.org/jira/browse/TIKA-1543
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Sean Zhao
 Fix For: 1.8

   Original Estimate: 168h
  Remaining Estimate: 168h

 After call setTesseractPath() to set the Tesseract path to a not-default 
 path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
 will return.
 Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux

2015-03-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375173#comment-14375173
 ] 

Tyler Palsulich commented on TIKA-1543:
---

(I just added the logging in r1668477 a few minutes ago. See [this 
commit|https://github.com/apache/tika/commit/84825f035069d572f155f86fa4c18d5a79b48028]
 on GitHub.)

 TesseractOCRParser.setTesseractPath() doesn't work on Linux
 ---

 Key: TIKA-1543
 URL: https://issues.apache.org/jira/browse/TIKA-1543
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Sean Zhao
 Fix For: 1.8

   Original Estimate: 168h
  Remaining Estimate: 168h

 After call setTesseractPath() to set the Tesseract path to a not-default 
 path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing 
 will return.
 Not sure if this is related to TIKA-1421.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell

2015-03-21 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372909#comment-14372909
 ] 

Tyler Palsulich commented on TIKA-1273:
---

Where exactly is the old jar? The one I ran above?

 old tika-server jar artifact contains no manifest so not able to invoke from 
 shell
 --

 Key: TIKA-1273
 URL: https://issues.apache.org/jira/browse/TIKA-1273
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8


 I've never ever used the old tika-server artifact which is generated when one 
 installs the server module. It needs to contain a manifest otherwise it 
 cannot be invoked from the shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372082#comment-14372082
 ] 

Tyler Palsulich commented on TIKA-1344:
---

[~gagravarr], can we close this one off? Thank you, [~skibaa]!

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372107#comment-14372107
 ] 

Tyler Palsulich commented on TIKA-1354:
---

[~chrismattmann] and [~hlavki], are there any other updates needed for this 
issue? The build failure just got pruned from Jenkins.

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac

 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1356) degraded performance OOXMLParser with WriteOutContentHandler

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1356:
--
Description: 
If use OOXMLParser with WriteOutContentHandler as destination of result, we can 
recieve degraded performance. Reason of this problem is ignoring SAXException 
in 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.SheetTextAsHTML.endRow()
 and others methods of this class.
As example: source doc have many empty rows in end of the table(about 100). 
When WriteOutContentHandler is full WriteLimitReachedException raised lot times.
Below is stacktrace of long proccess

{code}
org.apache.tika.sax.ContentHandlerDecorator.ignorableWhitespace(ContentHandlerDecorator.java:157)
   
org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:46)
   
org.apache.tika.sax.SafeContentHandler$2.write(SafeContentHandler.java:94)
   
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
   
org.apache.tika.sax.SafeContentHandler.ignorableWhitespace(SafeContentHandler.java:293)
   
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:242)
   
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:275)
   
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:203)
   
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:295)
   
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:287)
   org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
   
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown 
Source)
   
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
   
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
   org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
   org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
   org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
   
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:164)
   
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:120)
   
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:105)
   
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
   
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   
org.elasticsearch.index.mapper.attachment.AttachmentMapper$RecursiveMetadataParser.parse(AttachmentMapper.java:104)
   org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
   
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
   
org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169)
   org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
{code}

  was:
If use OOXMLParser with WriteOutContentHandler as destination of result, we can 
recieve degraded performance. Reason of this problem is ignoring SAXException 
in 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.SheetTextAsHTML.endRow()
 and others methods of this class.
As example: source doc have many empty rows in end of the table(about 100). 
When WriteOutContentHandler is full WriteLimitReachedException raised lot times.
Below is stacktrace of long proccess


org.apache.tika.sax.ContentHandlerDecorator.ignorableWhitespace(ContentHandlerDecorator.java:157)
   
org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:46)
   
org.apache.tika.sax.SafeContentHandler$2.write(SafeContentHandler.java:94)
   

[jira] [Updated] (TIKA-1358) Add support for newer iWork file formats

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1358:
--
Labels: new-parser newbie  (was: newbie)

 Add support for newer iWork file formats
 

 Key: TIKA-1358
 URL: https://issues.apache.org/jira/browse/TIKA-1358
 Project: Tika
  Issue Type: Wish
  Components: parser
Affects Versions: 1.5
Reporter: Jelle Kastelein
  Labels: new-parser, newbie
 Attachments: iwork13-testdocs-zips.zip, iwork13-testfiles-2014-11.zip


 IWork 2013 uses a revised file format which replaces the xml files that hold 
 the content by .iwa files (a binary format). This file format is becoming 
 increasingly relevant as more and more people are using apple products. 
 However, it does not appear to work with the current IWorkPackageParser 
 (tested with several of the example .pages files one can get from the 
 iCloud). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372123#comment-14372123
 ] 

Tyler Palsulich commented on TIKA-1367:
---

This is still worth doing, but it needs to be better than the dependency tree 
idea I gave above. Still not sure about a good solution. Should this be a page 
on the website?

 Tika documentation should list tika-parsers parser dependencies
 ---

 Key: TIKA-1367
 URL: https://issues.apache.org/jira/browse/TIKA-1367
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Reporter: Sergey Beryozkin
 Fix For: 1.8


 tika-parsers module has many strong transitive parser dependencies. Maven 
 users of tika-parsers have to exclude all the transitivie dependencies 
 manually. Documenting the list of the existing transitive dependencies and 
 keeping the list up to date will help developers exclude the libraries not 
 needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1379:
--
Description: 
we tried to get the mime type of an xml file with xades signature embedded. the 
result is text/html and not the expected text/xml or application/xml.

here is an example of the xml file:
{code}
VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
VERBALE Id=1 tipologia=Verbale esame
VERB_NUM00094853 0003 2/VERB_NUM
DATA_APP2013-09-23/DATA_APP
DATA_ESA2013-09-23/DATA_ESA
AD_CODD69017/AD_COD
ADFILOSOFIA DELLA SCIENZA/AD
CDS_CODD69/CDS_COD
CDSTEATRO E ARTI VISIVE/CDS
TIPO_ESA/TIPO_ESA
MAT1233456/MAT
NOMEPAOLINO/NOME
COGNOMEPAPERINO/COGNOME
VOTO23.0/VOTO
VOTODECOD23/VOTODECOD
CAUSALE/CAUSALE
TIPO_MODULO/TIPO_MODULO
IMG_PATH/IMG_PATH
AA_SES_ID2012/AA_SES_ID
AD_CFU6.0/AD_CFU
NOTA/NOTA
ATENEO9/ATENEO
ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
AD_STU_CODD69017/AD_STU_COD
AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
CDS_STU_CODD69/CDS_STU_COD
CDS_STUTEATRO E ARTI VISIVE/CDS_STU
DOCENTEQUI QUO QUA/DOCENTE
DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
SOFTWARE_DI_CREAZIONE
NOME3/NOME
VERSIONE11.09.03/VERSIONE
/SOFTWARE_DI_CREAZIONE
/VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
Id=sig08744308748201048377
ds:SignedInfo
ds:CanonicalizationMethod 
Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
ds:SignatureMethod 
Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
ds:Reference URI=
ds:Transforms
ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
/ds:Transform
ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion 
version=1.0
kion:ml module=FirmaDigitale target=kion/kion:ml
xsl:output method=xml/xsl:output

xsl:variable name=mostra_ad_figlie select=1/xsl:variable
xsl:variable name=verbale_root 
select=/VERBALI/VERBALE/xsl:variable
xsl:variable name=sostituzione_root 
select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
xsl:variable name=RAGG_ROOT 
select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
xsl:variable name=COMM_ROOT 
select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable

xsl:template match=/
html
head
meta content=text/html;charset=UTF-8 
http-equiv=Content-Type/meta
xsl:choose 
xsl:when 
test=$sostituzione_root
titleDichiarazione 
conformità Verbale Esame/title
/xsl:when
xsl:otherwise
titleVerbalizzazione 
esame/title
/xsl:otherwise
/xsl:choose
style type=text/css
 td  {font-family: Arial; font-size:10pt;} 
 div {font-family: Arial; font-size:10pt;}
 pre {font-family: Arial; font-size:10pt;} 
/style
/head
body
table
xsl:choose 
xsl:when 
test=$sostituzione_root
trtd align=center 
colspan=2bigstrongxsl:value-of 
select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
trtd align=center 
colspan=2bigstrongDICHIARAZIONE DI 
CONFORMITÀ/strong/bigbr/br/td/tr
trtd align=left 
colspan=2strongIl sottoscritto xsl:value-of 
select=$verbale_root/TITOLARE_PROCEDIMENTO/xsl:value-of, docente di 
xsl:value-of select=$verbale_root/AD/xsl:value-of/strongbr/br
   /td
/tr
tr
  

[jira] [Commented] (TIKA-1379) error in Tika().detect for xml files with xades signature

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372133#comment-14372133
 ] 

Tyler Palsulich commented on TIKA-1379:
---

The file is still detected as text/html. Should we update the magic to detect 
it as xml?

 error in Tika().detect for xml files with xades signature
 -

 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.4
Reporter: Alessandro De Angelis
  Labels: new-parser
 Fix For: 1.8


 we tried to get the mime type of an xml file with xades signature embedded. 
 the result is text/html and not the expected text/xml or 
 application/xml.
 here is an example of the xml file:
 {code}
 VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
 VERBALE Id=1 tipologia=Verbale esame
   VERB_NUM00094853 0003 2/VERB_NUM
   DATA_APP2013-09-23/DATA_APP
   DATA_ESA2013-09-23/DATA_ESA
   AD_CODD69017/AD_COD
   ADFILOSOFIA DELLA SCIENZA/AD
   CDS_CODD69/CDS_COD
   CDSTEATRO E ARTI VISIVE/CDS
   TIPO_ESA/TIPO_ESA
   MAT1233456/MAT
   NOMEPAOLINO/NOME
   COGNOMEPAPERINO/COGNOME
   VOTO23.0/VOTO
   VOTODECOD23/VOTODECOD
   CAUSALE/CAUSALE
   TIPO_MODULO/TIPO_MODULO
   IMG_PATH/IMG_PATH
   AA_SES_ID2012/AA_SES_ID
   AD_CFU6.0/AD_CFU
   NOTA/NOTA
   ATENEO9/ATENEO
   ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
   TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
   TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
   AD_STU_CODD69017/AD_STU_COD
   AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
   CDS_STU_CODD69/CDS_STU_COD
   CDS_STUTEATRO E ARTI VISIVE/CDS_STU
   DOCENTEQUI QUO QUA/DOCENTE
 DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
 SOFTWARE_DI_CREAZIONE
   NOME3/NOME
   VERSIONE11.09.03/VERSIONE
 /SOFTWARE_DI_CREAZIONE
 /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
 Id=sig08744308748201048377
 ds:SignedInfo
 ds:CanonicalizationMethod 
 Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
 ds:SignatureMethod 
 Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
 ds:Reference URI=
 ds:Transforms
 ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
 dsig-xpath:XPath 
 xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
 Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
 /ds:Transform
 ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
 xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; 
 exclude-result-prefixes=kion version=1.0
   kion:ml module=FirmaDigitale target=kion/kion:ml
   xsl:output method=xml/xsl:output
   xsl:variable name=mostra_ad_figlie select=1/xsl:variable
   xsl:variable name=verbale_root 
 select=/VERBALI/VERBALE/xsl:variable
   xsl:variable name=sostituzione_root 
 select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
   xsl:variable name=RAGG_ROOT 
 select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
   xsl:variable name=COMM_ROOT 
 select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable
   
   xsl:template match=/
   html
   head
   meta content=text/html;charset=UTF-8 
 http-equiv=Content-Type/meta
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   titleDichiarazione 
 conformità Verbale Esame/title
   /xsl:when
   xsl:otherwise
   titleVerbalizzazione 
 esame/title
   /xsl:otherwise
   /xsl:choose
   style type=text/css
td  {font-family: Arial; font-size:10pt;} 
div {font-family: Arial; font-size:10pt;}
pre {font-family: Arial; font-size:10pt;} 
   /style
   /head
   body
   table
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   trtd align=center 
 colspan=2bigstrongxsl:value-of 
 select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
 

[jira] [Commented] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371882#comment-14371882
 ] 

Tyler Palsulich commented on TIKA-1266:
---

After a quick Google, I don't think this is actually a problem? I really don't 
know, though.

 Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
 --

 Key: TIKA-1266
 URL: https://issues.apache.org/jira/browse/TIKA-1266
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.4, 1.5
Reporter: pm

 The tika-bundle currently has the Embed-Dependency header filled with 
 embedded dependencies. 
 Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is .
 Please add Bundle-ClassPath with list of embedded JAR names prefixed with 
 ., .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371890#comment-14371890
 ] 

Tyler Palsulich commented on TIKA-1276:
---

Is there anything else keeping this issue open? From the above, I don't think 
so. Please correct me if I'm wrong.

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.8

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 

[jira] [Commented] (TIKA-1579) Add file type to NetCDFParser

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371894#comment-14371894
 ] 

Tyler Palsulich commented on TIKA-1579:
---

+1, ship it! You don't need a review board for small changes. :)

 Add file type to NetCDFParser
 -

 Key: TIKA-1579
 URL: https://issues.apache.org/jira/browse/TIKA-1579
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1579.abburgess.190315.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format.
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1578) Add file type description to HDFParsers

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371895#comment-14371895
 ] 

Tyler Palsulich commented on TIKA-1578:
---

+1!

 Add file type description to HDFParsers
 ---

 Key: TIKA-1578
 URL: https://issues.apache.org/jira/browse/TIKA-1578
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1578.abburgess.150319.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format. 
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1154.
---
Resolution: Fixed

Marking as Fixed, since the file is detected and parsed without issue. Not sure 
what was happening before! Thanks!

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372038#comment-14372038
 ] 

Tyler Palsulich commented on TIKA-1296:
---

The only mimetype definition that uses {{stringignorecase}} is rfc822. Are 
there any (other than HTML) that could benefit from this?

 Add case insensitive matching for text/html mime type
 -

 Key: TIKA-1296
 URL: https://issues.apache.org/jira/browse/TIKA-1296
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Phil Lester
Assignee: Ken Krugler

 Currently in tika-mimetypes.xml for the mime type text/html (and possibly 
 others) matches in a couple different cases are provided for the elements so 
 that varying HTML writing styles are matched. As of version 1.5 of Tika the 
 ability exists to make these case insensitive using the stringignorecase 
 type. This would allow consolidation of some matches and improve detection of 
 poorly-formed HTML that would be rendered by most browsers regardless of case.
 For example:
   match value=lt;BODY type=string offset=0/
   match value=lt;body type=string offset=0/
 could become:
   match value=lt;BODY type=stringignorecase offset=0/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1307:
--
Labels: build  (was: )

 Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
 ---

 Key: TIKA-1307
 URL: https://issues.apache.org/jira/browse/TIKA-1307
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
Reporter: Lewis John McGibbney
  Labels: build
 Fix For: 1.8


 N.B. Can someone please create a *build* tag in Admin area? The assign it to 
 this issue?
 This issue was flagged up by Hong-Thai during the DISCUSS nightly builds 
 thread recently
 http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1307.
---
Resolution: Done

Marking this as Done, since the Java7 component is now tested. See 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/lastStableBuild/org.apache.tika$tika-java7/.

 Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
 ---

 Key: TIKA-1307
 URL: https://issues.apache.org/jira/browse/TIKA-1307
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
Reporter: Lewis John McGibbney
  Labels: build
 Fix For: 1.8


 N.B. Can someone please create a *build* tag in Admin area? The assign it to 
 this issue?
 This issue was flagged up by Hong-Thai during the DISCUSS nightly builds 
 thread recently
 http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372060#comment-14372060
 ] 

Tyler Palsulich commented on TIKA-1308:
---

This would be a great feature! The default would need to be disabled, though, 
since some files are larger than memory. And, as mentioned above, some parsers 
require writing the output to a file in order to use the external parsing 
library.

 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: jefferyyuan
  Labels: gae
 Fix For: 1.8


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 {code}
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 {code}
 This fails with exception:
 {code}
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 {code}
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1308:
--
Description: 
I am trying to use Tika in GAE and write a simple servlet to extract meta data 
info from jpeg:
{code}
String urlStr = req.getParameter(imageUrl);
byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));

ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
Metadata metadata = new Metadata();
BodyContentHandler ch = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(bais, ch, metadata, new ParseContext());
bais.close();
{code}
This fails with exception:
{code}
Caused by: java.lang.SecurityException: Unable to create temporary file
at java.io.File.createTempFile(File.java:1986)
at 
org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
{code}
Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, 
ContentHandler, Metadata, ParseContext), it creates a temp file from the input 
stream.

I can understand why tika create temp file from the stream: so tika can parse 
it multiple times.

But as GAE and other cloud servers are getting more popular, is it possible to 
avoid create temp file: instead we can copy the origin stream to a byteArray 
stream, so tika can also parse it multiple times.
-- This will have a limit on the file size, as tika keeps the whole file in 
memory, but this can make tika work in GAE and maybe other cloud server.

We can add a parameter in parser.parse to indicate whether do in memory parse 
only.
 

  was:
I am trying to use Tika in GAE and write a simple servlet to extract meta data 
info from jpeg:

String urlStr = req.getParameter(imageUrl);
byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));

ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
Metadata metadata = new Metadata();
BodyContentHandler ch = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(bais, ch, metadata, new ParseContext());
bais.close();

This fails with exception:
Caused by: java.lang.SecurityException: Unable to create temporary file
at java.io.File.createTempFile(File.java:1986)
at 
org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242

Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, 
ContentHandler, Metadata, ParseContext), it creates a temp file from the input 
stream.

I can understand why tika create temp file from the stream: so tika can parse 
it multiple times.

But as GAE and other cloud servers are getting more popular, is it possible to 
avoid create temp file: instead we can copy the origin stream to a byteArray 
stream, so tika can also parse it multiple times.
-- This will have a limit on the file size, as tika keeps the whole file in 
memory, but this can make tika work in GAE and maybe other cloud server.

We can add a parameter in parser.parse to indicate whether do in memory parse 
only.
 


 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: jefferyyuan
  Labels: gae
 Fix For: 1.8


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 {code}
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 {code}
 This fails with exception:
 {code}
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
 

[jira] [Commented] (TIKA-1314) An inappropriate comment of CharsetDetector.detect()

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372066#comment-14372066
 ] 

Tyler Palsulich commented on TIKA-1314:
---

This is still an issue in Tika 1.8-SNAPSHOT. See 
[here|https://github.com/apache/tika/blob/4096059da7f6d50e3d6e018681b8c02a96d3933a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L141-L172].
 Any input on whether we should update the comment or throw an Exception?

 An inappropriate comment of CharsetDetector.detect()
 

 Key: TIKA-1314
 URL: https://issues.apache.org/jira/browse/TIKA-1314
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Yi EungJun
Priority: Minor

 According to the javadoc of CharsetDetector.detect(), it raises an
 exception if no charset appears to match the data:
  * Raise an exception if
  *  ul
  *lino charsets appear to match the input data./li
  *lino input text has been provided/li
  *  /ul
 But it seems to me that in such cases the method returns null but does not 
 raise any exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1114) sgml mime type is not detected when passed in as byte stream

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371945#comment-14371945
 ] 

Tyler Palsulich commented on TIKA-1114:
---

See http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language. It seems 
like there isn't really a dedicated way to know whether is a file is SGML or 
not...

 sgml mime type is not detected when passed in as byte stream
 

 Key: TIKA-1114
 URL: https://issues.apache.org/jira/browse/TIKA-1114
 Project: Tika
  Issue Type: Bug
  Components: mime
Reporter: Vikas Garg

 When passing sgml files as  TikaInputStream (created from byte[]) to 
 Detector.detect(), it returns text/plain as mediatype and not 
 application/sgml or text/sgml. But when I provide the file name to metadata, 
 then it gives me correct mime-type, i.e., text/sgml.
 Is it because Tika is missing any designated parser for sgml files OR am I 
 missing something? I am on Tika-1.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1267) Improve Mbox file detection

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371957#comment-14371957
 ] 

Tyler Palsulich commented on TIKA-1267:
---

The current definition is:
{code}
  mime-type type=application/mbox
sub-class-of type=text/plain/
glob pattern=*.mbox/
  /mime-type
{code}

I think it would be too general to call all text files that start with {{From 
}} to be identified as {{application/mbox}}.

 Improve Mbox file detection
 ---

 Key: TIKA-1267
 URL: https://issues.apache.org/jira/browse/TIKA-1267
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Luis Filipe Nassif
Priority: Minor

 Could we add to application/mbox mime-type definition code below:
 {code}
 magic priority=70
 match value=From  type=string offset=0/
 /magic
 {code}
 Or is it too common out there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1287) Update NetCDF .jar file on Maven Central

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1287.
---
Resolution: Fixed

Marking this as Fixed, since [~annieburgess] and [~lewismc] have pushed to 
Central.

 Update NetCDF .jar file on Maven Central
 

 Key: TIKA-1287
 URL: https://issues.apache.org/jira/browse/TIKA-1287
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
  Labels: jar, maven, netcdf, tika, unit-test, update

 I am working to update the NetCDFParser file.  When using the most-recent 
 .jar file available from http://www.unidata.ucar.edu/ at the command line I 
 receive a note about a depreciated API: 
 javac -classpath 
 ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar
  org/apache/tika/parser/netcdf/NetCDFParser.java
 Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a 
 deprecated API.
 Note: Recompile with -Xlint:deprecation for details.
 After updating the NetCDFParser file with non-deprecated methods (e.x. 
 changing dimension.getName() to dimension.getFullName()) however, I get 
 failed unit tests in maven, which I assume is because the Maven Central Repo 
 has the lapsed version of the .jar file needed for NetCDF files (
 http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22)
  .
 Can anyone provide insight into how I get the updated .jar file into the 
 Maven Central Repository? Is there an alternative method to update Tika so I 
 can run my unit tests in Maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1325) Move the font metadata definitions to properties

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372076#comment-14372076
 ] 

Tyler Palsulich commented on TIKA-1325:
---

Does anyone know of an external standard for these metadata keys? Or, can we 
close this off?

 Move the font metadata definitions to properties
 

 Key: TIKA-1325
 URL: https://issues.apache.org/jira/browse/TIKA-1325
 Project: Tika
  Issue Type: Improvement
  Components: metadata, parser
Affects Versions: 1.5, 1.6
Reporter: Nick Burch
 Attachments: TIKA-1325_TimeZone.patch


 As noticed while working on TIKA-1182, the AFM font parser has a bunch of 
 hard coded strings it uses as metadata keys, while the TTF font parser 
 doesn't have many
 We should switch these to being proper Properties, with definitions from a 
 well known standard (+ compatibility fallbacks), and have both use largely 
 the same set



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371900#comment-14371900
 ] 

Tyler Palsulich edited comment on TIKA-1154 at 3/20/15 7:19 PM:


Marking as Fixed, since the file is detected and parsed without issue. Not sure 
what was happening before! Thank you, [~anjackson]!


was (Author: tpalsulich):
Marking as Fixed, since the file is detected and parsed without issue. Not sure 
what was happening before! Thanks!

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371939#comment-14371939
 ] 

Tyler Palsulich commented on TIKA-1194:
---

Thank you, [~tssk]! Is there any way you can create a patch from {{svn diff}}, 
instead of (I think) just regular {{diff}}? Then, we can hopefully integrate 
this into trunk. :)

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical
 Attachments: OP-06-015.doc, apache-tika-1.5.patch


 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1273.
---
Resolution: Fixed

The server now starts as expected. So, I'm marking this as Fixed.

{code}
➜  trunk  java -jar tika-server/target/tika-server-1.8-SNAPSHOT.jar
Mar 20, 2015 3:53:04 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
Mar 20, 2015 3:53:04 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Mar 20, 2015 3:53:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Mar 20, 2015 3:53:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Mar 20, 2015 3:53:04 PM org.apache.tika.server.TikaServerCli main
INFO: Started
{code}

 old tika-server jar artifact contains no manifest so not able to invoke from 
 shell
 --

 Key: TIKA-1273
 URL: https://issues.apache.org/jira/browse/TIKA-1273
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8


 I've never ever used the old tika-server artifact which is generated when one 
 installs the server module. It needs to contain a manifest otherwise it 
 cannot be invoked from the shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1289) Ligatures convert on text extraction

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1289.
---
   Resolution: Fixed
Fix Version/s: 1.7

I'm marking this as Fixed, since the first sentence now seems to be valid:
{code}
pReplace this file with prentcsmacro.sty for your meeting,
or with entcsmacro.sty for your meeting. Both can be
found at the ENTCS Macro Home Page.
/p
{code}

 Ligatures convert on text extraction
 

 Key: TIKA-1289
 URL: https://issues.apache.org/jira/browse/TIKA-1289
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: win 8, jre 1.5
Reporter: Alex Andrushchak
 Fix For: 1.7

 Attachments: PDF_text that can be copied is over the picture.pdf


 According to tika sources review, it uses pdfbox to parse pdf files. 
 I found that pdfbox itself uses icu4j to handle ligatures.
 Unfortunately, when i added icu4j jar to my classpath nothing changed, 
 ligatures are still not converted. Sample pdf file is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1580) ISA-Tab parsers

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1580:
--
Labels: new-parser  (was: )

 ISA-Tab parsers
 ---

 Key: TIKA-1580
 URL: https://issues.apache.org/jira/browse/TIKA-1580
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Giuseppe Totaro
Priority: Minor
  Labels: new-parser
 Attachments: TIKA-1580.patch


 We are going to add parsers for ISA-Tab data formats.
 ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help 
 to manage an increasingly diverse set of life science, environmental and 
 biomedical experiments that employing one or a combination of technologies.
 The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular 
 format. Therefore, ISA-Tab data format includes three types of file: 
 Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file 
 ({{a_.txt}}). These files are organized as [top-down 
 hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation 
 file includes one or more Study files: each Study files includes one or more 
 Assay files.
 Essentially, the Investigation files contains high-level information about 
 the related study, so it provides only metadata about ISA-Tab files.
 More details on file format specification are [available 
 online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
 The patch in attachment provides a preliminary version of ISA-Tab parsers 
 (there are three parsers; one parser for each ISA-Tab filetype):
 * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts 
 only metadata.
 * {{ISATabStudyParser.java}}: parses Study files.
 * {{ISATabAssayParser.java}}: parses Assay files.
 The most important improvements are:
 * Combine these three parsers in order to parse an ISArchive
 * Provide a better mapping of both study and assay data on XHML. Currently, 
 {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping 
 function relying on [Apache Commons 
 CSV|https://commons.apache.org/proper/commons-csv/].
 Thanks for supporting me on this work [~chrismattmann]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1293) Netscape bookmark files are not being detected as HTML

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372020#comment-14372020
 ] 

Tyler Palsulich commented on TIKA-1293:
---

Looks good to me. Any objections to adding this magic for HTML Netscape 
bookmark files?

 Netscape bookmark files are not being detected as HTML
 --

 Key: TIKA-1293
 URL: https://issues.apache.org/jira/browse/TIKA-1293
 Project: Tika
  Issue Type: Bug
  Components: detector, mime
Reporter: Phil Lester
 Attachments: bookmarks.txt


 We are able to circumvent the HTML file type detection using the standard 
 Netscape bookmark file doctype (!DOCTYPE NETSCAPE-Bookmark-file-1) and 
 renaming the file extension to .txt. Standard HTML elements can then be 
 included in the file. Some browsers (such as Firefox) will detect the .txt 
 file as HTML and display it accordingly when downloading.
 We were able to resolve this by adding a custom mime-type for text/html that 
 included a match pattern for the Netscape doctype:
 match value=lt;!DOCTYPE NETSCAPE-Bookmark-file-1 type=string 
 offset=0:64/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1304) Implement Metadata Property with PropertyType ALT

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372029#comment-14372029
 ] 

Tyler Palsulich commented on TIKA-1304:
---

Did you ever have an update on this, [~talli...@mitre.org]?

 Implement Metadata Property with PropertyType ALT
 -

 Key: TIKA-1304
 URL: https://issues.apache.org/jira/browse/TIKA-1304
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Trivial

 PropertyType Alt has been available for a while, but it doesn't appear to 
 have been implemented.  I'd like to implement it to fix TIKA-1295.
 If I've missed the implementation or if there is a preferred workaround, 
 please let me know, and I'll close this issue and use that to fix TIKA-1295.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1382) Output document outlinks

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372141#comment-14372141
 ] 

Tyler Palsulich commented on TIKA-1382:
---

This patch looks good to commit. But, [~chrismattmann], do you have any update?

 Output document outlinks
 

 Key: TIKA-1382
 URL: https://issues.apache.org/jira/browse/TIKA-1382
 Project: Tika
  Issue Type: New Feature
  Components: cli
Affects Versions: 1.5
Reporter: Greg Padiasek
Assignee: Chris A. Mattmann
 Attachments: outlinks.patch


 Would you consider adding CLI options to output document outlinks?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1401) occured infinite loop using tika library

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1401:
--
Description: 
Hi

1. Save the file with the following content as errorfile.xml
{code}
?xml version=1.0?
!DOCTYPE billion [
!ELEMENT billion (#PCDATA)
!ENTITY laugh0 

[jira] [Commented] (TIKA-1402) Insert chart content in PPTX, the graph information cannot be extracted

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372175#comment-14372175
 ] 

Tyler Palsulich commented on TIKA-1402:
---

Still not extracted with Tika 1.8-SNAPSHOT.

 Insert chart content in PPTX, the graph information cannot be extracted
 ---

 Key: TIKA-1402
 URL: https://issues.apache.org/jira/browse/TIKA-1402
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: win7
Reporter: Ben Gao
 Attachments: bug.pptx


 I inserted a chart in bug.pptx, the chart contains AAA, BBB, CCC, DDD, 1,2,3 
 and other information, graph information cannot be extracted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1408) Fix version for tikadotnet to be tracked along with trunk and release version

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1408.
---
Resolution: Fixed

Fixed in r1668165. Will make a note in the release notes now.

 Fix version for tikadotnet to be tracked along with trunk and release version
 -

 Key: TIKA-1408
 URL: https://issues.apache.org/jira/browse/TIKA-1408
 Project: Tika
  Issue Type: Bug
  Components: packaging
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8


 As reported by [~thaichat04] the tikadotnet versioning doesn't match up with 
 trunk. This is because we aren't releasing this code yet and it's not part of 
 the pom.xml file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1398) Dependency in tika-parsers 1.5 contains variables

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1398.
-
Resolution: Not a Problem

Closing as Not a Problem. If you have a project which fails to build because of 
this, please reopen! If you'd like an example of using Tika as a Maven 
dependency in an external project, please see 
[here|https://github.com/tpalsulich/phone_numbers].

 Dependency in tika-parsers 1.5 contains variables
 -

 Key: TIKA-1398
 URL: https://issues.apache.org/jira/browse/TIKA-1398
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Eddie Olsson

 In tika-parsers, the dependency for tika-core contains two variables that 
 resolve differently depending on the local project in which it is used, thus 
 breaking the dependency.
 From org/apache/tika/tika-parsers/1.5/tika-parsers-1.5.pom:
 dependency
 groupId${project.groupId}/groupId
 artifactIdtika-core/artifactId
 version${project.version}/version
 /dependency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1401) occured infinite loop using tika library

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372172#comment-14372172
 ] 

Tyler Palsulich commented on TIKA-1401:
---

Still loop infinitely with Tika 1.8-SNAPSHOT.

 occured infinite loop using tika library
 

 Key: TIKA-1401
 URL: https://issues.apache.org/jira/browse/TIKA-1401
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.5
Reporter: Robin.Hwang

 Hi
 1. Save the file with the following content as errorfile.xml
 {code}
 ?xml version=1.0?
 !DOCTYPE billion [
 !ELEMENT billion (#PCDATA)
 !ENTITY laugh0 
 

[jira] [Updated] (TIKA-1405) Uppercase content detected as Estonian

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1405:
--
Summary: Uppercase content detected as Estonian  (was: German content 
detected as French)

 Uppercase content detected as Estonian
 --

 Key: TIKA-1405
 URL: https://issues.apache.org/jira/browse/TIKA-1405
 Project: Tika
  Issue Type: Bug
  Components: languageidentifier
Affects Versions: 1.4
 Environment: Linux
Reporter: Zaheer Beig
  Labels: newbie

 Hi,
 We are using Apache Tika 1.4  for document conversion to text and language 
 detection in one of our project. We are facing below issue with language 
 detection:
 1. When the text is in all UPPER CASE, even though the language is English, 
 it gets detected as Estonian.
 Any update on this will be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1406) Problems in TXT encoding detection

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372183#comment-14372183
 ] 

Tyler Palsulich commented on TIKA-1406:
---

Can someone familiar with the UniversalEncodingDetector take a look at this 
patch? Sorry no one ever got back, [~almson]! Do you have a couple test files 
which demonstrate these issues?

 Problems in TXT encoding detection
 --

 Key: TIKA-1406
 URL: https://issues.apache.org/jira/browse/TIKA-1406
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6, 1.7
Reporter: Aleksandr Dubinsky
 Attachments: 0001-fix-TXT-encoding-detection.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 The detection of TXT file encoding often makes mistakes. Two things can 
 improve the detection in many cases:
 - Increase the lookahead from 16k bytes to a larger number, such as 128k or 
 larger.
 - Improve on the brain-dead heuristic that if a file doesn't have a \r then 
 it must be ISO 8859-1(5) instead of Windows-1252. (For one, it mis-detects 
 files that don't have any newlines.)
 A remaining problem that doesn't have an immediate solution, is the frequent 
 mis-detection of Windows-1252 as Shift-JIS. A flag to forbit Shift-JIS is 
 desirable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1416) Refactor Translator Exception Handling

2015-03-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1416.
---
Resolution: Fixed

Fixed in r1668166. {{Translator.translate()}} can now throw a TikaException or 
an IOException.

 Refactor Translator Exception Handling
 --

 Key: TIKA-1416
 URL: https://issues.apache.org/jira/browse/TIKA-1416
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Tyler Palsulich
 Fix For: 1.8


 `Translator.translate()` currently throws `Exception`. We should make it more 
 specific. The only real limitation comes from MicrosoftTranslator -- the 
 library used throws `Exception`, but that shouldn't mean Tika does too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   >