from:"Tim Allison \(JIRA\)"

[jira] [Created] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-05-23 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1124:
-

 Summary: Nested documents not extracted if a PDF file is in the 
chain
 Key: TIKA-1124
 URL: https://issues.apache.org/jira/browse/TIKA-1124
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Minor


Tika 1.3 is not able to get attachments from the attached PDF.
The trunk is able to get attachments from the PDF.  However, if that PDF is 
then embedded in another document, the docs embedded in the PDF are not 
extracted.

I'm not sure of a solution, but I found two things that might help with the 
diagnosis:
1) If you modify the code in PDFParser so that it doesn't wrap the handler in a 
BodyContentHandler, everything works (in trunk).
2) If you modify BodyContentHandler to use my toy 
SimpleBodyMatchingContentHandler, the problem is also solved.

The cause may be in the MatchingContentHandler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-05-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1124:
--

Attachment: pdf_attachment_issues.zip

outer.docx contains the attached.pdf, which itself contains an attachment.  Toy 
examples of avoiding the use of MatchingContentHandler also attached.

 Nested documents not extracted if a PDF file is in the chain
 

 Key: TIKA-1124
 URL: https://issues.apache.org/jira/browse/TIKA-1124
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Minor
 Attachments: pdf_attachment_issues.zip


 Tika 1.3 is not able to get attachments from the attached PDF.
 The trunk is able to get attachments from the PDF.  However, if that PDF is 
 then embedded in another document, the docs embedded in the PDF are not 
 extracted.
 I'm not sure of a solution, but I found two things that might help with the 
 diagnosis:
 1) If you modify the code in PDFParser so that it doesn't wrap the handler in 
 a BodyContentHandler, everything works (in trunk).
 2) If you modify BodyContentHandler to use my toy 
 SimpleBodyMatchingContentHandler, the problem is also solved.
 The cause may be in the MatchingContentHandler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676495#comment-13676495
 ] 

Tim Allison commented on TIKA-1130:
---

I've submitted a patch to POI for this 
(https://issues.apache.org/bugzilla/show_bug.cgi?id=54849).  I haven't gotten 
any feedback after my initial trivial fix.  The issue is that sdt/content 
controls can stand alone as the equivalent of a paragraph or table.  POI isn't 
currently picking those up. 

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-07 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677957#comment-13677957
 ] 

Tim Allison commented on TIKA-1130:
---

I'll try to submit the Tika portion of the POI-54849 patch by early next week 
in case anyone wants to apply both patches at home.

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-13 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682628#comment-13682628
]

Tim Allison commented on TIKA-1132:
---

Tika gui took longer than I was willing to wait, too. tika.parseToString()
returned a value in about 30 seconds. As you both suggested, the fraction
formatter was likely the culprit. I just submitted a patch to poi 54686.

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Key: TIKA-1132
URL: https://issues.apache.org/jira/browse/TIKA-1132
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2, 1.3
Environment: Linux Suse:
java version 1.7.0
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
OSX 10.8.3:
java version 1.7.0_06
Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
Reporter: Ryan Krueger
Fix For: 1.1

Attachments: mod3.xlsx, mod.xls

Some XLS documents hang the entire JVM. A control-C or regular kill won't
stop the JVM, a kill -9 is required.
We're running within an email server application parsing documents to extract
text of all attachments. When we hit a message with the affected attachment
the entire JVM hangs and we mark the message to skip extracting the text from
the affected message the next attempt. Unfortunately, it kills all email
processing on the server until the internal watchdogs kill -9 the application.
We have seen the issue for several months with different documents, but they
are always Excel files. Some get complaints from Excel when opening but not
all.
In addition to experiencing the problem on our Linux servers I have tested on
OSX and experienced the same problems. I ran the Tika UI and select the
affected file or run the CLI. The problem is the same.
Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
When running on multi-CPU machines there are two threads running at 100%
every time.
I have attached a document that triggers the error.
I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is
accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692110#comment-13692110
 ] 

Tim Allison commented on TIKA-1130:
---

Nick,  
  I think I have to make modifications to Tika to execute the new SDT 
components.  Should my patch be to Tika trunk?

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692517#comment-13692517
 ] 

Tim Allison commented on TIKA-1130:
---

Maven proxy setting in my settings.xml file is working for grabbing 
dependencies, but the proxy info isn't being transferred to testUrlOnly's 
url.openStream() in MimeDetectionTest.  The proxy props appear correctly in the 
surefire-report for MimeDetectionTest, but the proxy settings are null when I 
insert this into testUrlOnly:

System.out.println(HOST:  + System.getProperty(http.proxyHost));
System.out.println(PORT:  + System.getProperty(http.proxyPort));

Will likely find the answer as soon as I post this...

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-25 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693068#comment-13693068
 ] 

Tim Allison commented on TIKA-973:
--

Will submit patch and tests by end of the week.

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor

 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-25 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1130:
--

Attachment: TIKA-1130.patch

Ray's initial test restored after POI-55142 was committed.  Thank you, Nick!

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx, TIKA-1130.patch, TIKA-1130.patch


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: TIKA-973-patch.tar.gz

Patch attached.  Dumps contents of pdf forms at end of document.  

AcroForm field name metadata is in attribute values.  Basic format is ol.

Let me know how this looks.

Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields


 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774
 ] 

Tim Allison commented on TIKA-973:
--

Agree on both.  Also would appreciate feedback on what the output should be.  
The current code extracts this unseemly xhtml:

div class=acroform
 ol   li partialName=form1[0] fullName=form1[0]/
 ol   li partialName=#subform[6] fullName=form1[0].#subform[6]/
li partialName=MiddleInitial[0] 
fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial 
(MI)X/li
 li partialName=FamilyName[0] 
fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee 
Information and Attestation.  Family Name (Last Name)Doe/li
li partialName=GivenName[0] 
fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First 
Name)John/li
li partialName=OtherNamesUsed[0] 
fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. 
Doe/li
li partialName=StreetNumberName[0] 
fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and 
Name123 Main St./li


...

Another idea I had was to include the partialName in the contents and not fill 
out the attrs:
liStreetNumberName[0]: 123 Main St/li

More unit tests on way...

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: i-9_screenshot.png

Screenshot attached.  Thanks again to: 
http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and 
example doc.

The middle ground that you recommend makes sense.



 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1139) Modify Tika-1129 to test against a local file

2013-07-02 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1139:
-

 Summary: Modify Tika-1129 to test against a local file
 Key: TIKA-1139
 URL: https://issues.apache.org/jira/browse/TIKA-1139
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.5


Would prefer to avoid requiring a network call in test unless necessary.  The 
website that was causing the initial issue (Tika-367) has modified their 
content and it now causes no problems for Tika 0.5 (the version against which 
the original issue was raised).

I simplified Tika-367's evilhtml.html (took out all content and truncated most 
of it).  The modified test file causes the original problem in Tika 0.5, but it 
causes no problems in Tika 0.6 or trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1139) Modify Tika-1129 to test against a local file

2013-07-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1139:
--

Attachment: TIKA-1139.patch.tar.gz

Patch attached.

 Modify Tika-1129 to test against a local file
 -

 Key: TIKA-1139
 URL: https://issues.apache.org/jira/browse/TIKA-1139
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.5

 Attachments: TIKA-1139.patch.tar.gz


 Would prefer to avoid requiring a network call in test unless necessary.  The 
 website that was causing the initial issue (Tika-367) has modified their 
 content and it now causes no problems for Tika 0.5 (the version against which 
 the original issue was raised).
 I simplified Tika-367's evilhtml.html (took out all content and truncated 
 most of it).  The modified test file causes the original problem in Tika 0.5, 
 but it causes no problems in Tika 0.6 or trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-07-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: TIKA-973.patch.tar.gz

Middle-road change made.  The alternate name is an attribute and partial name 
is added to content followed by a :.

I also added a few more tests.

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz, 
 TIKA-973.patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-02 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698009#comment-13698009
 ] 

Tim Allison commented on TIKA-1130:
---

That was fast.  Thank you!

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Fix For: 1.5

 Attachments: Resume 6.4.13.docx, TIKA-1130.patch, TIKA-1130.patch


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704699#comment-13704699
 ] 

Tim Allison commented on TIKA-1130:
---

Haven't had a chance to build from trunk today, but the latest attachment seems 
to work on my local build of Tika.  Which portions are missing?

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Fix For: 1.5

 Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal 
 resme.docx, TIKA-1130.patch, TIKA-1130.patch


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-07-10 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704711#comment-13704711
]

Tim Allison commented on TIKA-1130:
---

Tested with freshly built trunk, and the text looks good to me. Let me know if
you don't find the same.

There is a case that this bug fix didn't cover: if a content control takes up
an entire table cell/is the equivalent of a table cell, then that content is
not currently being pulled by POI.

That is on my todo list.

.docx text extract leaves out some portions of text
---

Key: TIKA-1130
URL: https://issues.apache.org/jira/browse/TIKA-1130
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2, 1.3
Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
Fix For: 1.5

Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal
resme.docx, TIKA-1130.patch, TIKA-1130.patch

When parsing a Microsoft Word .docx
(application/vnd.openxmlformats-officedocument.wordprocessingml.document),
certain portions of text remain unextracted.
I have attached a .docx file that can be tested against. The 'gray' portions
of text are what are not extracted, while the darker colored text extracts
fine.
Looking at the document.xml portion of the .docx zip file shows the text is
all there.

[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1150:
-

 Summary: Extract text from textbox in XLSX
 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor


Underlying POI library doesn't appear to support easy extraction of text from 
text boxes in XLSX files. Personal preference would be to wait for 
modifications in POI and then make a few small changes to Tika to run 
XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1150:
--

Attachment: testEXCEL_textbox.xlsx

Simple file that shows issue.

 Extract text from textbox in XLSX
 -

 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor
 Attachments: testEXCEL_textbox.xlsx


 Underlying POI library doesn't appear to support easy extraction of text from 
 text boxes in XLSX files. Personal preference would be to wait for 
 modifications in POI and then make a few small changes to Tika to run 
 XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1150) Extract text from textbox in XLSX

2013-07-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716429#comment-13716429
 ] 

Tim Allison commented on TIKA-1150:
---

Duplicate of http://issues.apache.org/jira/browse/TIKA-1100. Closing this one.

 Extract text from textbox in XLSX
 -

 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor
 Attachments: testEXCEL_textbox.xlsx


 Underlying POI library doesn't appear to support easy extraction of text from 
 text boxes in XLSX files. Personal preference would be to wait for 
 modifications in POI and then make a few small changes to Tika to run 
 XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (TIKA-1150) Extract text from textbox in XLSX

2013-07-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1150.
-

Resolution: Duplicate

 Extract text from textbox in XLSX
 -

 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor
 Attachments: testEXCEL_textbox.xlsx


 Underlying POI library doesn't appear to support easy extraction of text from 
 text boxes in XLSX files. Personal preference would be to wait for 
 modifications in POI and then make a few small changes to Tika to run 
 XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-07-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716432#comment-13716432
 ] 

Tim Allison commented on TIKA-1100:
---

Waiting for improvements in POI-55292.  Will make Tika-side upgrades when the 
next version of POI is released.

Reference: http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
 -

 Key: TIKA-1100
 URL: https://issues.apache.org/jira/browse/TIKA-1100
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Windows7 64bit
Reporter: Kazuaki Matsuba

 When I launch Tika gui from command-line and drag and drop .xlsx file that 
 have textbox, no text in the textbox are extracted.
 When drag and drop .xls file, text in the textbox are extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-07-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1100:
--

Attachment: testEXCEL_textbox.xlsx

Simple example file attached for now.  Will fill out with test cases when POI 
is ready.

 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
 -

 Key: TIKA-1100
 URL: https://issues.apache.org/jira/browse/TIKA-1100
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Windows7 64bit
Reporter: Kazuaki Matsuba
 Attachments: testEXCEL_textbox.xlsx


 When I launch Tika gui from command-line and drag and drop .xlsx file that 
 have textbox, no text in the textbox are extracted.
 When drag and drop .xls file, text in the textbox are extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-08-05 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-792:
-

Attachment: test10.docx

Example document that triggers no such method exceptions for:
CTMarkupRangeImpl, CTMarkupImpl and CTBookmarkRangeImpl

 NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, 
 boolean) processing a OOXML document
 

 Key: TIKA-792
 URL: https://issues.apache.org/jira/browse/TIKA-792
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x
Reporter: Torsten Krah
 Fix For: 1.2

 Attachments: test10.docx


 Parsing some OOXML documents, this stacktrace is logged many times:
 java.lang.NoSuchMethodException: 
 org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,
  boolean)
   at java.lang.Class.getConstructor0(Class.java:2723)
   at java.lang.Class.getDeclaredConstructor(Class.java:2002)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at 
 org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654)
   at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647)
   at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
   at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
   at 
 org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115)
   at 
 org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53)
   at 
 org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Looking at the poi code java is right here, there is no constructor with a 
 SchemaType and a boolean, only with SchemaType.
 My guess is this one was missed during upgrade to poi beta4, but only a 
 guess, anyway needs a fix :-).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729781#comment-13729781
]

Tim Allison commented on TIKA-1124:
---

If anyone has a chance to look into this, I'd appreciate it. I suspect
something is going awry with the recursion in the triggering documents + xpath
query in MatchingContentHandler. Thank you!

Nested documents not extracted if a PDF file is in the chain

Key: TIKA-1124
URL: https://issues.apache.org/jira/browse/TIKA-1124
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Minor
Attachments: pdf_attachment_issues.zip

Tika 1.3 is not able to get attachments from the attached PDF.
The trunk is able to get attachments from the PDF. However, if that PDF is
then embedded in another document, the docs embedded in the PDF are not
extracted.
I'm not sure of a solution, but I found two things that might help with the
diagnosis:
1) If you modify the code in PDFParser so that it doesn't wrap the handler in
a BodyContentHandler, everything works (in trunk).
2) If you modify BodyContentHandler to use my toy
SimpleBodyMatchingContentHandler, the problem is also solved.
The cause may be in the MatchingContentHandler.

[jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729903#comment-13729903
]

Tim Allison commented on TIKA-1124:
---

Ok, I think I figured this out... AbstractOOXML includes contents from embedded
documents before calling handler.endDocument()
PDFParser, however, calls handler.endDocument() and then tries to append
content from embedded documents.
I think this means that the parent handler sees an end of body and therefore
does not process the contents of the embedded document.

trivial fix: move handler.endDocument() out of PDF2XHTML and call it after
processing the embedded documents in PDFParser.

Unless I hear otherwise, I'll commit this over the next few days.

Nested documents not extracted if a PDF file is in the chain

[jira] [Updated] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-05 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-1124:
--

Attachment: TIKA-1124.patch

Chose to move embedded file code into PDF2XHTML. This allows the proper
closing of /body with the PDF2XHTML's XHTMLContentHandler. Will strip
Windows noise before committing, but I wanted to submit this draft in case
anyone wants to review it.

Nested documents not extracted if a PDF file is in the chain

[jira] [Closed] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain

2013-08-08 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1124.
-

   Resolution: Fixed
Fix Version/s: 1.5

Added tests (thanks to Nick's advice to use model of 
POIContainerExtractionTest).  Committed r1511901
and r1511908.

 Nested documents not extracted if a PDF file is in the chain
 

 Key: TIKA-1124
 URL: https://issues.apache.org/jira/browse/TIKA-1124
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.5

 Attachments: pdf_attachment_issues.zip, TIKA-1124.patch


 Tika 1.3 is not able to get attachments from the attached PDF.
 The trunk is able to get attachments from the PDF.  However, if that PDF is 
 then embedded in another document, the docs embedded in the PDF are not 
 extracted.
 I'm not sure of a solution, but I found two things that might help with the 
 diagnosis:
 1) If you modify the code in PDFParser so that it doesn't wrap the handler in 
 a BodyContentHandler, everything works (in trunk).
 2) If you modify BodyContentHandler to use my toy 
 SimpleBodyMatchingContentHandler, the problem is also solved.
 The cause may be in the MatchingContentHandler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-08-08 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733804#comment-13733804
 ] 

Tim Allison commented on TIKA-792:
--

Committed in POI.  Once POI3.9beta2 is released, I'll increment POI's version 
in Tika's build file and confirm that this is taken care of.  There may be 
other sources of this than the one that my test document triggered.

 NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, 
 boolean) processing a OOXML document
 

 Key: TIKA-792
 URL: https://issues.apache.org/jira/browse/TIKA-792
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x
Reporter: Torsten Krah
 Fix For: 1.2

 Attachments: test10.docx


 Parsing some OOXML documents, this stacktrace is logged many times:
 java.lang.NoSuchMethodException: 
 org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,
  boolean)
   at java.lang.Class.getConstructor0(Class.java:2723)
   at java.lang.Class.getDeclaredConstructor(Class.java:2002)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at 
 org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654)
   at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647)
   at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
   at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
   at 
 org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115)
   at 
 org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53)
   at 
 org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Looking at the poi code java is right here, there is no constructor with a 
 SchemaType and a boolean, only with SchemaType.
 My guess is this one was missed during upgrade to poi beta4, but only a 
 guess, anyway needs a fix :-).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736875#comment-13736875
 ] 

Tim Allison commented on TIKA-1153:
---

Fellow Tika committers, I made this change locally and all tests passed.  Is 
there something else I should do before committing this change?

 Upgrade pdfbox to latest 1.8.2 version
 --

 Key: TIKA-1153
 URL: https://issues.apache.org/jira/browse/TIKA-1153
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


 Current version is 1.8.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-12 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-1001:
--

Attachment: TIKA-1001v1.tar.gz

This is a draft that simplifies the extraction of the charset attribute within
a meta tag (old html and new HTML5) and should make the charset extraction
more robust to noisy metaheaders.

The strategy is:
1) find the meta tags
2) find charset=x within the meta tag
3) return the first valid charset

Is the proposed strategy too broad? Will there be false positives?

Will commit in a few days if there is no feedback. Thank you!

P.S. Ignore the patch.xml file, of course. :)

tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6
charset
-

Key: TIKA-1001
URL: https://issues.apache.org/jira/browse/TIKA-1001
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2
Reporter: david lemon
Attachments: badarabic.html, TIKA-1001v1.tar.gz

attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.
The difference appears to be that tika 1.1 honors the http meta content-type
tag which specifies the charset as iso-8859-6, and correctly converts the
output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.
Some noodling seems to indicate that the problem is the charset.
it doesn't matter what mode tika is used in (server, app mode, etc. even if
content-type is specified with a charset, the output is still garbage).

[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-14 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740580#comment-13740580
 ] 

Tim Allison commented on TIKA-1001:
---

Fixed as of r1514126. Thank you for submitting this issue with test file!

 tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 
 charset
 -

 Key: TIKA-1001
 URL: https://issues.apache.org/jira/browse/TIKA-1001
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: david lemon
 Attachments: badarabic.html, TIKA-1001v1.tar.gz


 attached document extracts correctly in Tika 1.1
 attached document extracts incorrectly in tika 1.2.
 The difference appears to be that tika 1.1 honors the http meta content-type 
 tag which specifies the charset as iso-8859-6, and correctly converts the 
 output to UTF-8.
 tika 1.2 appears to ignore the charset specified in the meta tag.
 Some noodling seems to indicate that the problem is the charset.
 it doesn't matter what mode tika is used in (server, app mode, etc. even if 
 content-type is specified with a charset, the output is still garbage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-15 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785
 ] 

Tim Allison commented on TIKA-1153:
---

Committed as of r1514551.

 Upgrade pdfbox to latest 1.8.2 version
 --

 Key: TIKA-1153
 URL: https://issues.apache.org/jira/browse/TIKA-1153
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


 Current version is 1.8.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-15 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13741785#comment-13741785
 ] 

Tim Allison commented on TIKA-1153:
---

Committed as of r1514551.

 Upgrade pdfbox to latest 1.8.2 version
 --

 Key: TIKA-1153
 URL: https://issues.apache.org/jira/browse/TIKA-1153
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


 Current version is 1.8.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1001.
-


 tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 
 charset
 -

 Key: TIKA-1001
 URL: https://issues.apache.org/jira/browse/TIKA-1001
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: david lemon
 Attachments: badarabic.html, TIKA-1001v1.tar.gz


 attached document extracts correctly in Tika 1.1
 attached document extracts incorrectly in tika 1.2.
 The difference appears to be that tika 1.1 honors the http meta content-type 
 tag which specifies the charset as iso-8859-6, and correctly converts the 
 output to UTF-8.
 tika 1.2 appears to ignore the charset specified in the meta tag.
 Some noodling seems to indicate that the problem is the charset.
 it doesn't matter what mode tika is used in (server, app mode, etc. even if 
 content-type is specified with a charset, the output is still garbage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-08-16 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1153.
-

Resolution: Fixed

 Upgrade pdfbox to latest 1.8.2 version
 --

 Key: TIKA-1153
 URL: https://issues.apache.org/jira/browse/TIKA-1153
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


 Current version is 1.8.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1001.
---

Resolution: Fixed

 tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 
 charset
 -

 Key: TIKA-1001
 URL: https://issues.apache.org/jira/browse/TIKA-1001
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: david lemon
 Attachments: badarabic.html, TIKA-1001v1.tar.gz


 attached document extracts correctly in Tika 1.1
 attached document extracts incorrectly in tika 1.2.
 The difference appears to be that tika 1.1 honors the http meta content-type 
 tag which specifies the charset as iso-8859-6, and correctly converts the 
 output to UTF-8.
 tika 1.2 appears to ignore the charset specified in the meta tag.
 Some noodling seems to indicate that the problem is the charset.
 it doesn't matter what mode tika is used in (server, app mode, etc. even if 
 content-type is specified with a charset, the output is still garbage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-08-16 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742129#comment-13742129
 ] 

Tim Allison commented on TIKA-1162:
---

Would you be willing to attach a document/test case that triggers this issue?

 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1001) tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

2013-08-16 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742266#comment-13742266
]

Tim Allison commented on TIKA-1001:
---

David,

Thank you for submitting this. I fixed the issue triggered by your file and a
few other variants that occurred to me. I wouldn't be surprised if we'll need
to make more modifications. Please submit any other issues you find. Thank
you, again.

tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6
charset
-

[jira] [Reopened] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-19 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison reopened TIKA-1132:
---

Assignee: Tim Allison

Will add test case in Tika.

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Attachments: mod3.xlsx, mod.xls

[jira] [Created] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1173:
-

 Summary: Upgrade to POI-3.10-beta2
 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1173) Upgrade to POI-3.10-beta2

2013-09-19 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1173.
---

Resolution: Fixed

 Upgrade to POI-3.10-beta2
 -

 Key: TIKA-1173
 URL: https://issues.apache.org/jira/browse/TIKA-1173
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-20 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116
]

Tim Allison edited comment on TIKA-1132 at 9/20/13 5:35 PM:

Any recommendations for a test? The underlying problem was that POI was doing
on the order of 10^18 division calculations...so not infinite, but exceedingly
slow. Would a jUnit timeout of, say, 10 seconds be reasonable?

was (Author: talli...@mitre.org):
Will add test case in Tika.

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Attachments: mod3.xlsx, mod.xls

[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-20 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772116#comment-13772116
]

Tim Allison edited comment on TIKA-1132 at 9/20/13 5:36 PM:

Any recommendations for a test? The underlying problem was that POI was doing
on the order of 10^24 division calculations...so not infinite, but exceedingly
slow. Would a jUnit timeout of, say, 10 seconds be reasonable?

was (Author: talli...@mitre.org):
Any recommendations for a test? The underlying problem was that POI was
doing on the order of 10^18 division calculations...so not infinite, but
exceedingly slow. Would a jUnit timeout of, say, 10 seconds be reasonable?

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Attachments: mod3.xlsx, mod.xls

[jira] [Commented] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773212#comment-13773212
 ] 

Tim Allison commented on TIKA-792:
--

This is now fixed by TIKA-1173.  

Can anyone recommend a more obvious test of the solution to this than kicking 
off a process to extract text from the document and capturing std.err?  It 
would be nice to have something that we can generalize to other documents that 
trigger this issue because of a different set of missing beans.

 NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, 
 boolean) processing a OOXML document
 

 Key: TIKA-792
 URL: https://issues.apache.org/jira/browse/TIKA-792
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x
Reporter: Torsten Krah
 Fix For: 1.2

 Attachments: test10.docx


 Parsing some OOXML documents, this stacktrace is logged many times:
 java.lang.NoSuchMethodException: 
 org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,
  boolean)
   at java.lang.Class.getConstructor0(Class.java:2723)
   at java.lang.Class.getDeclaredConstructor(Class.java:2002)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at 
 org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654)
   at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647)
   at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
   at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
   at 
 org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115)
   at 
 org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53)
   at 
 org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Looking at the poi code java is right here, there is no constructor with a 
 SchemaType and a boolean, only with SchemaType.
 My guess is this one was missed during upgrade to poi beta4, but only a 
 guess, anyway needs a fix :-).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-09-26 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778801#comment-13778801
 ] 

Tim Allison commented on TIKA-1100:
---

Updated XSSFExcelExtractorDecorator and added test as of r1526489.

 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
 -

 Key: TIKA-1100
 URL: https://issues.apache.org/jira/browse/TIKA-1100
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Windows7 64bit
Reporter: Kazuaki Matsuba
 Attachments: testEXCEL_textbox.xlsx


 When I launch Tika gui from command-line and drag and drop .xlsx file that 
 have textbox, no text in the textbox are extracted.
 When drag and drop .xls file, text in the textbox are extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1100) cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)

2013-09-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1100.
---

   Resolution: Fixed
Fix Version/s: 1.5

r1526498

 cannot extract text in text-box for Excel 2007 file(.xlsx, .xlsm)
 -

 Key: TIKA-1100
 URL: https://issues.apache.org/jira/browse/TIKA-1100
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Windows7 64bit
Reporter: Kazuaki Matsuba
 Fix For: 1.5

 Attachments: testEXCEL_textbox.xlsx


 When I launch Tika gui from command-line and drag and drop .xlsx file that 
 have textbox, no text in the textbox are extracted.
 When drag and drop .xls file, text in the textbox are extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-792:
--


added test that catches stderr.
r1526570.
reopening just to record this.

 NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, 
 boolean) processing a OOXML document
 

 Key: TIKA-792
 URL: https://issues.apache.org/jira/browse/TIKA-792
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x
Reporter: Torsten Krah
 Fix For: 1.2

 Attachments: test10.docx


 Parsing some OOXML documents, this stacktrace is logged many times:
 java.lang.NoSuchMethodException: 
 org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,
  boolean)
   at java.lang.Class.getConstructor0(Class.java:2723)
   at java.lang.Class.getDeclaredConstructor(Class.java:2002)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at 
 org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654)
   at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647)
   at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
   at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
   at 
 org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115)
   at 
 org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53)
   at 
 org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Looking at the poi code java is right here, there is no constructor with a 
 SchemaType and a boolean, only with SchemaType.
 My guess is this one was missed during upgrade to poi beta4, but only a 
 guess, anyway needs a fix :-).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-792) NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, boolean) processing a OOXML document

2013-09-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-792.
--

Resolution: Fixed

 NoSuchMethodException CTMarkupImpl.init(org.apache.xmlbeans.SchemaType, 
 boolean) processing a OOXML document
 

 Key: TIKA-792
 URL: https://issues.apache.org/jira/browse/TIKA-792
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
 Environment: Linux, JDK 1.6, Jetty 8.x, Tomcat 6.x
Reporter: Torsten Krah
 Fix For: 1.2

 Attachments: test10.docx


 Parsing some OOXML documents, this stacktrace is logged many times:
 java.lang.NoSuchMethodException: 
 org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTMarkupImpl.init(org.apache.xmlbeans.SchemaType,
  boolean)
   at java.lang.Class.getConstructor0(Class.java:2723)
   at java.lang.Class.getDeclaredConstructor(Class.java:2002)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1749)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1886)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1875)
   at 
 org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1021)
   at 
 org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:893)
   at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1657)
   at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2654)
   at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2647)
   at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
   at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
   at 
 org.apache.poi.xwpf.usermodel.XWPFParagraph.init(XWPFParagraph.java:83)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:145)
   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
   at 
 org.apache.poi.xwpf.usermodel.XWPFDocument.init(XWPFDocument.java:115)
   at 
 org.apache.poi.xwpf.extractor.XWPFWordExtractor.init(XWPFWordExtractor.java:53)
   at 
 org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:63)
   at 
 org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Looking at the poi code java is right here, there is no constructor with a 
 SchemaType and a boolean, only with SchemaType.
 My guess is this one was missed during upgrade to poi beta4, but only a 
 guess, anyway needs a fix :-).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-09-26 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison resolved TIKA-1132.
---

Resolution: Fixed

Resolved with upgrade to poi-3.10-beta2.
Could use help getting jUnit's timeout to work.
Currently no unit tests for this.

Parsing some XLS documents hangs entire JVM, requires kill -9
-

Attachments: mod3.xlsx, mod.xls

[jira] [Resolved] (TIKA-1076) Upgrade to Apache POI 3.9

2013-09-27 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1076.
---

Resolution: Fixed

Added some code similar to the fix to POI-54722 to HSLFExtractor.  Uncommented 
old test.  Text is now extracted from tables in HSLF.

 Upgrade to Apache POI 3.9
 -

 Key: TIKA-1076
 URL: https://issues.apache.org/jira/browse/TIKA-1076
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Nick Burch
 Fix For: 1.5


 We should upgrade to Apache POI 3.9, which is the latest version

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-09-27 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-817.
--

Resolution: Fixed

As mentioned above, this was fixed a while ago.  I added test documents from 
POI-52367 and POI-52368, and I created simple tests to confirm behavior 
described in POI issues.

 (PPT/PPTX) Missing date/time in text content.
 -

 Key: TIKA-817
 URL: https://issues.apache.org/jira/browse/TIKA-817
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.5


 Missing date/time text in text content for PPT and PPTX files.
 The date and time are missing from the text content.  This occurs when one 
 chooses the following with MS-PowerPoint 2010:
 1) Insert
 2) Date  Time
 3) Update automatically
 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-09-30 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922
 ] 

Tim Allison commented on TIKA-1162:
---

Dear Colleague,
  I'm on paternity leave.  Will be back part time on October 14.

   Best,

Tim



 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2013-11-01 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811201#comment-13811201
 ] 

Tim Allison commented on TIKA-817:
--

Thank you!

 (PPT/PPTX) Missing date/time in text content.
 -

 Key: TIKA-817
 URL: https://issues.apache.org/jira/browse/TIKA-817
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.5


 Missing date/time text in text content for PPT and PPTX files.
 The date and time are missing from the text content.  This occurs when one 
 chooses the following with MS-PowerPoint 2010:
 1) Insert
 2) Date  Time
 3) Update automatically
 4) save to PPT or PPTX



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (TIKA-1200) Upgrade pdfbox 1.8.3

2013-12-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1200.
---

Resolution: Fixed

Fixed in r1547037.  Waiting for Jenkins to pick up change to confirm.  Thank 
you!

 Upgrade pdfbox 1.8.3
 

 Key: TIKA-1200
 URL: https://issues.apache.org/jira/browse/TIKA-1200
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


 pdfbox just released new 1.8.3 version
 http://www.apache.org/dist/pdfbox/1.8.3/RELEASE-NOTES.txt



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Assigned] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1201:
-

Assignee: Tim Allison

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Assignee: Tim Allison
Priority: Critical

 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1201:
--

Attachment: TIKA-1201.patch

Trivial patch

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Assignee: Tim Allison
Priority: Critical
 Attachments: TIKA-1201.patch


 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1201.
---

   Resolution: Fixed
Fix Version/s: 1.5

Basic parameter-based capability added in r1547250.  User beware that there may 
be differences in metadata processing between the NonSequentialPDFParser and 
the traditional parser.  Will open issue to track failure to extract metadata 
from testAnnotations.pdf.

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Assignee: Tim Allison
Priority: Critical
 Fix For: 1.5

 Attachments: TIKA-1201.patch


 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-02 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1202:
-

 Summary: Refactor PDFParser to enable easier parameter setting
 Key: TIKA-1202
 URL: https://issues.apache.org/jira/browse/TIKA-1202
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial


It would be handy to be able to set PDFParser parameters 
(extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-02 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1202:
--

Attachment: TIKA-1202.patch

Would appreciate community feedback on this before I commit it (December 6?).  
Is it ok to deprecate the setters and getters for the PDFParser parameters?  Is 
the use of a simple properties file and integration via ParseContext consistent 
with design principles of Tika?

Thank you!

 Refactor PDFParser to enable easier parameter setting
 -

 Key: TIKA-1202
 URL: https://issues.apache.org/jira/browse/TIKA-1202
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Attachments: TIKA-1202.patch


 It would be handy to be able to set PDFParser parameters 
 (extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (TIKA-1203) Some metadata not extracted from PDF files when NonSequentialPDFParser is used

2013-12-03 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1203:
-

 Summary: Some metadata not extracted from PDF files when 
NonSequentialPDFParser is used
 Key: TIKA-1203
 URL: https://issues.apache.org/jira/browse/TIKA-1203
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Tim Allison
Priority: Minor


While working on TIKA-1201, I noticed that metadata was not being extracted 
from the testAnnotations.pdf file when the NonSequentialPDFParser was being 
used.  I opened PDFBOX-1792.  This TIKA issue is a placeholder.  When 
PDFBOX-1792 is fixed, we can stop skipping testAnnotations.pdf in 
PDFParserTest.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837169#comment-13837169
 ] 

Tim Allison edited comment on TIKA-1201 at 12/3/13 4:25 PM:


Basic parameter-based capability added in r1547250.  User beware that there may 
be differences in metadata processing between the NonSequentialPDFParser and 
the traditional parser.  (See TIKA-1203 for failure of NonSequentialPDFParser 
to extract metadata from testAnnotations.pdf).


was (Author: talli...@mitre.org):
Basic parameter-based capability added in r1547250.  User beware that there may 
be differences in metadata processing between the NonSequentialPDFParser and 
the traditional parser.  Will open issue to track failure to extract metadata 
from testAnnotations.pdf.

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Assignee: Tim Allison
Priority: Critical
 Fix For: 1.5

 Attachments: TIKA-1201.patch


 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (TIKA-1199) Tika extracts weird signs instead of text

2013-12-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838856#comment-13838856
 ] 

Tim Allison commented on TIKA-1199:
---

Doh!  Duplicated Marc's PDFBOX-1783.  Sorry about that.

 Tika extracts weird signs instead of text
 -

 Key: TIKA-1199
 URL: https://issues.apache.org/jira/browse/TIKA-1199
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: MacOSX, Linux
Reporter: Marc Teutelink
 Attachments: gaat fout.pdf, 
 plain_text_tika_output_from_gaat_fout_pdf.txt, 
 structured_text_tika_output_from_gaat_fout_pdf.xml


 Tika extracts complete bogus text from the attached document. I have attached 
 the .PDF in question and also added the plain and structured text output from 
 Tika.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-06 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1202.
---

   Resolution: Fixed
Fix Version/s: 1.5

Committed in r1548700.  Thank you, Mike and Hong-Thai for feedback. More 
parameters on the way...

 Refactor PDFParser to enable easier parameter setting
 -

 Key: TIKA-1202
 URL: https://issues.apache.org/jira/browse/TIKA-1202
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5

 Attachments: TIKA-1202.patch


 It would be handy to be able to set PDFParser parameters 
 (extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Reopened] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-09 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-1202:
---


Small bug in using default vs config.

 Refactor PDFParser to enable easier parameter setting
 -

 Key: TIKA-1202
 URL: https://issues.apache.org/jira/browse/TIKA-1202
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5

 Attachments: TIKA-1202.patch


 It would be handy to be able to set PDFParser parameters 
 (extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Resolved] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-09 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1202.
---

Resolution: Fixed

r1549646

 Refactor PDFParser to enable easier parameter setting
 -

 Key: TIKA-1202
 URL: https://issues.apache.org/jira/browse/TIKA-1202
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5

 Attachments: TIKA-1202.patch


 It would be handy to be able to set PDFParser parameters 
 (extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Created] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1205:
-

 Summary: Allow PDFParser to fallback to other parser if there is 
an exception
 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
instead of the traditional parser for parsing PDF files.  Following the 
description in PDFBOX-1199, it would be useful to allow fallback to the classic 
parser if NonSequentialPDFParser encounters an IOException.  For the sake of 
symmetry, I propose a boolean useParserFallbackOnException parameter.  If this 
parameter is true, and if Tika's PDFParser is using the classic parser, Tika 
will fall back to the NonSequentialPDFParser if there is an IOException; if 
this parameter is true and if Tika's PDFParser is using the 
NonSequentialPDFParser it will fall back to the classic parser if there is an 
IOException.

Many thanks to Hong-Thai for championing the addition of the added 
NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845429#comment-13845429
 ] 

Tim Allison commented on TIKA-1205:
---

Thank you for your feedback!  TIKA-456 is the existing issue for general 
timeout capability.  I agree that it would be great to add.  TIKA-1205 is a 
very narrowly defined improvement for PDFParser.

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Reopened] (TIKA-973) PDF form data isn't included in extracted content.

2013-12-13 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-973:
--

  Assignee: Tim Allison

In hindsight, would prefer to use test documents that are unequivocally 
consistent with Apache License.  I've removed docs from trunk and commented out 
test cases (r1550725).  If anyone would like to contribute an example doc that 
is unequivocally consistent with Apache License 2.0, I'll modify the test case 
for that doc.  I'll be on the lookout for test docs and will leave this open 
until test cases are turned back on.  The functionality within Tika is still 
available, of course.

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.5

 Attachments: TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz, 
 i-9_screenshot.png


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852938#comment-13852938
 ] 

Tim Allison commented on TIKA-1212:
---

On first issue: do you mean that you'd like to have a parameter that would 
unzip the abc.zip file but not unzip the pqr.zip file?  Or do you want to be 
able to select embedded document types that you don't want to recurse through?


 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical

 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1212:
--

Attachment: abc.zip

Does this test file meet your description?

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: abc.zip


 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1205:
--

Due Date: 17/Jan/14  (was: 20/Dec/13)

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864393#comment-13864393
 ] 

Tim Allison commented on TIKA-1216:
---

Give this a shot:
https://builds.apache.org/job/Tika-trunk/org.apache.tika$tika-app/lastSuccessfulBuild/artifact/org.apache.tika/tika-app/1.5-20131229.024202-48/tika-app-1.5-20131229.024202-48.jar
 

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1216.
---

   Resolution: Fixed
Fix Version/s: 1.5

Following reporter's comment, this looks to be fixed in 1.5-SNAPSHOT.  If it 
turns out to be a duplicate of TIKA-1215, I'll switch resolution to 
duplicate.  Thank you for reporting this!

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Fix For: 1.5

 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-09 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866916#comment-13866916
 ] 

Tim Allison commented on TIKA-1216:
---

Agreed.  I didn't think this was a duplicate.  It is fixed, though, in trunk?  
If so, let's close this issue.

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Fix For: 1.5

 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528
 ] 

Tim Allison commented on TIKA-1215:
---

[~thaichat04] thank you for sending a clean patch. This area of the code base 
is not exceedingly familiar to me, but if I understand Tika's history and your 
code correctly, your if statement wasn't necessary in 1.4, and (based on a very 
quick look) it looks like nothing else in the relevant lines of the MP3 parser 
changed between 1.4 and trunk.  Are you able to determine what changed btwn 1.4 
and trunk that led to this regression?  Thank you!

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1226:
-

Assignee: Tim Allison

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison

 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880130#comment-13880130
 ] 

Tim Allison commented on TIKA-1226:
---

Eric,
  Thank you for reporting this.  I'll make the fix shortly.  Are you able to 
share your document as a test case?  Thank you, again.

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison

 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880273#comment-13880273
 ] 

Tim Allison commented on TIKA-1226:
---

How about we grab the name? 
{noformat}
  if (field instanceof PDSignatureField){
  PDSignature sig = ((PDSignatureField)field).getSignature();
  if (sig != null){
  value = sig.getName();
  }
  } else {
  value = field.getValue();
  }
{noformat}

Should we also grab the contactinfo, location, the date or the reason?

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison

 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383
 ] 

Tim Allison commented on TIKA-1226:
---

Thank you for the test file.  I'll use that in the formal test.  I used another 
doc for dev that I unfortunately can't share.  Does this format look good?  My 
dev doc only had name and date, but the other info would also show up if it 
existed...

{noformat}
div class=acroform
ol
li altName=nameName: my name/li
li
ol type=signaturedata   
li 
signdata=date2014-01-17T11:57:26-0500/li
li signdata=namemy name/li
/ol
/li
/ol
/div
{noformat}

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison
 Attachments: pdf-form-with-signature-field-empty.pdf


 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383
 ] 

Tim Allison edited comment on TIKA-1226 at 1/24/14 8:22 PM:


Thank you for the test file.  I'll use that in the formal test.  I used another 
doc for dev that I unfortunately can't share.  Does this format look good?  My 
dev doc only had name and date, but the other info would also show up if it 
existed...

{noformat}
div class=acroform
ol
li altName=nameName: my name/li
li
ol type=signaturedata   
li 
signdata=date2014-01-17T11:57:26-0500/li
li signdata=namemy name/li
/ol
/li
/ol
/div
{noformat}


was (Author: talli...@mitre.org):
Thank you for the test file.  I'll use that in the formal test.  I used another 
doc for dev that I unfortunately can't share.  Does this format look good?  My 
dev doc only had name and date, but the other info would also show up if it 
existed...

{noformat}
div class=acroform
ol
li altName=nameName: my name/li
li
ol type=signaturedata   
li 
signdata=date2014-01-17T11:57:26-0500/li
li signdata=namemy name/li
/ol
/li
/ol
/div
{noformat}

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison
 Attachments: pdf-form-with-signature-field-empty.pdf


 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison commented on TIKA-1228:
---

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM:
---

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{noformat}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{noformat}

where processEmbedded is shorthand for the existing code:
{noformat}
if (embeddedFileNames != null){
...
}
{noformat}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)


was (Author: talli...@mitre.org):
I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison edited comment on TIKA-1228 at 2/3/14 6:11 PM:
---

I won't have time to fix this for a week or so, but, I'll take this unless 
another committer has time sooner.


was (Author: talli...@mitre.org):
I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{noformat}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{noformat}

where processEmbedded is shorthand for the existing code:
{noformat}
if (embeddedFileNames != null){
...
}
{noformat}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1228.
---

   Resolution: Fixed
Fix Version/s: 1.5

Fixed in r1564042.

Thank you, [~agi20dla], for reporting this and diagnosing the cause and 
solution for this bug!

I'm resolving this for now.  I'm waiting to hear back from users@pdfbox to see 
if we should search recursively for non-null attachment data.  The example that 
you provided does show only checking the children.  I'll reopen this issue if 
we need to switch to full recursion.

Thank you, again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Issue Comment Deleted] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1228:
--

Comment: was deleted

(was: I won't have time to fix this for a week or so, but, I'll take this 
unless another committer has time sooner.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890605#comment-13890605
 ] 

Tim Allison commented on TIKA-1228:
---

Not sure I understand.  Is this the snippet that you refer to in PDNameTreeNode:
{noformat}
public MapString, COSObjectable getNames() throws IOException
{
COSArray namesArray = (COSArray)node.getDictionaryObject( COSName.NAMES 
);
{noformat}

The above throws a class cast exception, but the code that you show doesn't?

Are you getting a class cast exception on the document that you submitted with 
this issue or is it a different document?

Thank you, again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890610#comment-13890610
 ] 

Tim Allison commented on TIKA-1228:
---

Y.  That's the point of open source. :)  Enjoy!

Now that I'm looking at this issue again, I dragged out some of my pre-Tika 
code for pdf attachments using a different pdf library.  It looks like the pdf 
files I was coding against could have the file name in a parent node and the 
actual bytes in a child or more distant descendant node.

Will see if I can dig up the triggering files and see if Tika needs any more 
mods on PDF attachment extraction.

{noformat}
private MyPDFAttachment lookForByteStream(COSDictionary dict, MyPDFAttachment 
attach, int recursiveDepth){

COSName fCOSName = COSName.create(F);
COSName efCOSName = COSName.create(EF);
COSObject fObj = dict.get(fCOSName);
COSObject efObj = dict.get(efCOSName);
if (null != fObj){
if (fObj.getClass() == COSString.class){
attach.setName(fObj.stringValue());
} else if (fObj.getClass() == COSStream.class){
attach.setBytes(((COSStream)fObj).getDecodedBytes());
return attach;
}
} 
if (null != efObj  efObj.getClass() == COSDictionary.class){ 
int tmpI = recursiveDepth;
tmpI++;
return lookForByteStream((COSDictionary)efObj, attach, tmpI);   
}
return null;
}
{noformat}

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890613#comment-13890613
 ] 

Tim Allison commented on TIKA-1228:
---

Ok, to confirm, the PDNameTreeNode class cast exception is a non-issue?

Thanks again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1230:
-

 Summary: Update PDFBox to v1.8.4
 Key: TIKA-1230
 URL: https://issues.apache.org/jira/browse/TIKA-1230
 Project: Tika
  Issue Type: Improvement
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.5






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (TIKA-1230) Update PDFBox to v1.8.4

2014-02-04 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1230.
---

Resolution: Fixed

r1564335

 Update PDFBox to v1.8.4
 ---

 Key: TIKA-1230
 URL: https://issues.apache.org/jira/browse/TIKA-1230
 Project: Tika
  Issue Type: Improvement
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.5






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (TIKA-1231) Safely handle null embedded files in PDFs

2014-02-04 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1231:
-

 Summary: Safely handle null embedded files in PDFs
 Key: TIKA-1231
 URL: https://issues.apache.org/jira/browse/TIKA-1231
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.5


I filed a potential fix, unit test and test doc for this in PDFBOX-1884.  We'll 
need to add one test for null in the Tika PDFParser to handle this change once 
it is fixed in PDFBox.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1232:
-

Assignee: Tim Allison

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892146#comment-13892146
 ] 

Tim Allison commented on TIKA-1232:
---

How about Application-Version to follow the deprecated example in 
org.apache.tika.metadata.MSOffice?

Tika Community,
  Is there a more appropriate label for this?  I didn't find anything relevant 
in TikaCoreProperties.  Thank you.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-02-05 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1233:
-

 Summary: PDFBox can throw StringIndexOutOfBoundsException on some 
dates
 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
 Fix For: 1.6


PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
string for parsing is empty or contains only spaces.  A few of my test pdfs 
have this feature.

I've raised PDFBOX-1883.  Until that is resolved, we can add an extra catch to 
prevent this from causing problems in TIKA

{noformat}
@@ -171,6 +171,9 @@
 addMetadata(metadata, TikaCoreProperties.CREATED, 
info.getCreationDate());
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }
 try {
 Calendar modified = info.getModificationDate();
@@ -178,6 +181,9 @@
 addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }

{noformat}

I'd commit now, but I don't want to interfere with cutting of 1.5.  Let me know 
if I should commit, or please do it for me if appropriate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380
 ] 

Tim Allison commented on TIKA-1232:
---

Interesting.  Thank you, [~johanvanderknijff] and [~anjackson].  I personally 
like Extended-Content-Type, but following 
(http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar 
with Dublin Core and/or XMP who could recommend appropriate tags?  Many 
apologies if either one of those recommends Extended-Content-Type :).

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380
 ] 

Tim Allison edited comment on TIKA-1232 at 2/6/14 2:31 PM:
---

Interesting.  Thank you, [~johanvanderknijff] and [~anjackson].  I personally 
like Extended-Content-Type, but following 
(http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar 
(than I am) with Dublin Core and/or XMP who could recommend appropriate tags?  
Many apologies if either one of those recommends Extended-Content-Type :).


was (Author: talli...@mitre.org):
Interesting.  Thank you, [~johanvanderknijff] and [~anjackson].  I personally 
like Extended-Content-Type, but following 
(http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar 
with Dublin Core and/or XMP who could recommend appropriate tags?  Many 
apologies if either one of those recommends Extended-Content-Type :).

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893426#comment-13893426
]

Tim Allison commented on TIKA-1232:
---

[~anjackson], y, I'd like to add your code if others agree that it would be
useful. No need for a formal patch. I'll take your github code nearly
directly.

Two items:
1) Would you be interested in contributing your extension-level extraction
code to PDFBox if it doesn't currently exist there (I haven't checked but I
assume you wouldn't reinvent the wheel). I think that would be more at home
within PDFBox.
2) How much testing have you done for potential exceptions thrown by PDFBox
on pdfs in the wild when grabbing this new metadata (cf. null pointer checks
around date parsing in current metadata code and TIKA-1226, TIKA-1232,
TIKA-1233)?

Thank you, again.

Add PDF version to PDFParser output
---

Key: TIKA-1232
URL: https://issues.apache.org/jira/browse/TIKA-1232
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.5
Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
Attachments: pdfversion.patch

I'd like to identify the PDF version of files, this is not currently reported
by the PDFParser although the information is available via PDFBox. I have
attached a patch that adds the format version to the Metadata object.
However, I am not familiar enough with the Tika source to know if an
alternative metadata key should be used, or this new one added.
Comments welcome.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-02-09 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1233:
--

Description: 
PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
string for parsing is empty or contains only spaces.  A few of my test pdfs 
have this feature.

Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
causing problems in TIKA

{noformat}
@@ -171,6 +171,9 @@
 addMetadata(metadata, TikaCoreProperties.CREATED, 
info.getCreationDate());
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }
 try {
 Calendar modified = info.getModificationDate();
@@ -178,6 +181,9 @@
 addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }

{noformat}


  was:
PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
string for parsing is empty or contains only spaces.  A few of my test pdfs 
have this feature.

I've raised PDFBOX-1883.  Until that is resolved, we can add an extra catch to 
prevent this from causing problems in TIKA

{noformat}
@@ -171,6 +171,9 @@
 addMetadata(metadata, TikaCoreProperties.CREATED, 
info.getCreationDate());
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }
 try {
 Calendar modified = info.getModificationDate();
@@ -178,6 +181,9 @@
 addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
 } catch (IOException e) {
 // Invalid date format, just ignore
+} catch (StringIndexOutOfBoundsException e){
+//remove after PDFBOX-1883 is fixed
+// Invalid date format, just ignore
 }

{noformat}

I'd commit now, but I don't want to interfere with cutting of 1.5.  Let me know 
if I should commit, or please do it for me if appropriate.


 PDFBox can throw StringIndexOutOfBoundsException on some dates
 --

 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
  Labels: easyfix
 Fix For: 1.6


 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
 string for parsing is empty or contains only spaces.  A few of my test pdfs 
 have this feature.
 Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
 causing problems in TIKA
 {noformat}
 @@ -171,6 +171,9 @@
  addMetadata(metadata, TikaCoreProperties.CREATED, 
 info.getCreationDate());
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
  try {
  Calendar modified = info.getModificationDate();
 @@ -178,6 +181,9 @@
  addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8807 matches

Mail list logo