RE: Command line utilities issue a document display warning

2017-10-05 Thread Allison, Timothy B.
the text. I only suggest using the command line utilities because they exhibit the problem that happens in my own code. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, October 05, 2017 12:21 PM To: users@pdfbox.apache.org Subject: RE: Command line

RE: Command line utilities issue a document display warning

2017-10-05 Thread Allison, Timothy B.
If you want to extract the text from a document with an XFA, Apache Tika (which relies on PDFBox) should be able to extract the text. -Original Message- From: John Liston [mailto:list...@asconline.com] Sent: Thursday, October 5, 2017 11:56 AM To: users@pdfbox.apache.org Subject: Command

RE: Extracting rotated text

2017-09-25 Thread Allison, Timothy B.
, Timothy B.: > Colleagues, > Any recommendations for extracting rotated text such as: > https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES > ? > > Adobe DC gets reasonable text with "save as text". PDFBox's E

Extracting rotated text

2017-09-25 Thread Allison, Timothy B.
Colleagues, Any recommendations for extracting rotated text such as: https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES ? Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText (and Tika) get something like

RE: tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
there. But this is all just a thought. I did not implement anything. Tilman Am 21.09.2017 um 22:07 schrieb Allison, Timothy B.: > All, > > How much effort would it be to track/calculate a ratio of characters with > missing Unicode mappings to those with mappings for a given page? It would >

tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
All, How much effort would it be to track/calculate a ratio of characters with missing Unicode mappings to those with mappings for a given page? It would be neat after trying to extract text from a page to be able to tell how many characters are lost. We could use this info on Tika to

RE: Make PDFBox fail on bad pdf

2017-03-30 Thread Allison, Timothy B.
If you have any recommendations for the more general case, let us know on TIKA-1443 [1]. [1] https://issues.apache.org/jira/browse/TIKA-1443 -Original Message- From: Wouter De Borger [mailto:wouter.debor...@inmanta.com] Sent: Thursday, March 30, 2017 6:00 AM To: users@pdfbox.apache.org

RE: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Allison, Timothy B.
I think I'm getting most of the text with pdfbox app's ExtractText... What text are you missing, specifically? Or, if you're missing the entire body, perhaps look at ExtractText to grab more content? -Original Message- From: pulkit@gmail.com [mailto:pulkit@gmail.com] On Behalf

RE: Extracting/rendering jp2?

2016-11-10 Thread Allison, Timothy B.
Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, November 10, 2016 12:33 AM To: users@pdfbox.apache.org Subject: Re: Extracting/rendering jp2? Am 10.11.2016 um 04:29 schrieb Allison, Timothy B.: >>> private static final

RE: Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
Got it. Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, November 9, 2016 3:38 PM To: users@pdfbox.apache.org Subject: Re: Extracting/rendering jp2? Am 09.11.2016 um 21:26 schrieb Allison, Timothy B.: >> what do you need? The

RE: OOMs extracting inline images

2016-11-09 Thread Allison, Timothy B.
>IMHO we should have a look if this is a know issue/expected behaviour or >something new. Could you provide at least one pdf for every case? Y. Will open an issue in the next few days so that we can share files and determine if this is a problem on our side or expected behavior. I have to wait

RE: Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
>what do you need? The image in any format (e.g. png), or the image in the >original JP2 compression? Ideally the original JP2 compression. >And if you're using ImageIOUtil.writeImage(), what is the parameter in suffix? >If it is JP2, then you'd need to have some plugin for it. jpx >So it

RE: Tika calling exiftool and ffmpeg?

2016-09-01 Thread Allison, Timothy B.
Forwarded to users@tika From: Chris Bamford [mailto:cbamf...@mimecast.com] Sent: Thursday, September 1, 2016 7:03 AM To: Subject: Tika calling exiftool and ffmpeg? Hi I recently noticed on my linux box in the auditd logs that my JVM is making

RE: associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
> Hmm, because it's you, I'll try it myself :-) Thank you, Tilman! > You can't really know for sure with the classic text extraction, but you > could use the extractTextByArea example with the rect coordinates. Based on your example, though, I think this should work. If I cache the

RE: associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
tDir: 5.52 x: 146.30115 xDirAdj: 146.30115 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 x: 151.22 xDirAdj: 151.22 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, July 7, 2016 8:04 AM

associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
All, Is there a recipe for associating a hyperlink to text on the page? Over on Tika, we're dumping these as at the end of each page. If it isn't too hard, it would be great to associate these links with text, e.g. http://tika.apache.org;>tika. This is related to PDFBOX-1143 and TIKA-2029.

RE: OCRing extracted inline images vs. fully rendered pages?

2016-05-17 Thread Allison, Timothy B.
>We have an experimental integration with Tesseract which was created a while >ago by a GSoC student. Because it requires >building C++ we’ve not integrated >it into trunk, but do have it on the todo list for 2.1. Ah, very cool. Y, I'd trust you all to do a better job of integrating OCR for

OCRing extracted inline images vs. fully rendered pages?

2016-05-17 Thread Allison, Timothy B.
All, On Tika, users can choose to run OCR on inline images (and attached images, of course). Would it be better for us to render each full page and then run OCR on that? Best, Tim

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends >> forever" Thank you, Tilman! :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, April 30, 2016 5:24 PM To: users@pdfbox.apache.org Subject: Re: is it possible to

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-04-20 Thread Allison, Timothy B.
Might want to look at Tika (which uses PDFBox) for that. Let's say you have an that contains your zips. java -jar tika-app.jar -J -t -i -o See if that gets you close enough. -Original Message- From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf Of David

RE: C# Version of PDFBox?

2016-03-31 Thread Allison, Timothy B.
Could PDFBox's webapp or tika-server, which wraps PDFBox, be of any use? -Original Message- From: Neil Pitman [mailto:neil.pit...@aquaforest.com] Sent: Wednesday, March 30, 2016 11:06 AM To: users@pdfbox.apache.org Subject: RE: C# Version of PDFBox? It could but would require some

RE: Issues with extraction content of PDF files

2015-12-18 Thread Allison, Timothy B.
Colleagues, So that you don't have to do the initial diagnosis at least. From [0]: >>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode >>mapping for CID+71 (71) in font 505Eddc6Arial >>So, if the file has no Unicode mapping for the font, I doubt they'll be able >>to

Subclassing BaseParser?

2015-10-03 Thread Allison, Timothy B.
All, I'm probably suffering from the same failure that led to (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370), but is it possible to subclass BaseParser outside of the oap.pdfparser

TIKA-1678 PDF metadata extraction and UTF-16 encodings in the xmp

2015-07-15 Thread Allison, Timothy B.
All, Andrew Jackson recently opened TIKA-1678. Tika tries to use Dublin Core items from the xmp, and if that doesn't exist, it takes what it can find from the regular metadata. Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly (? doubly?) encoded in the xmp :

per page processing?

2015-07-15 Thread Allison, Timothy B.
All, Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages. The proposed fix is along these lines: int nop = document.getNumberOfPages();

RE: per page processing?

2015-07-15 Thread Allison, Timothy B.
Onward. Thank you! -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Wednesday, July 15, 2015 5:09 PM To: users@pdfbox.apache.org Subject: Re: per page processing? On 15 Jul 2015, at 04:52, Allison, Timothy B. talli...@mitre.org wrote: All, Raymond Wu

FW: xmp parsing issue -- xmp should start with a processing instruction

2015-07-07 Thread Allison, Timothy B.
All, This is a separate issue than I raised in PDFBox-2855. This, too, was initially noted by Jeremy Anderson on TIKA-1285. I'm not sure if this is a problem with the way our xmp was generated or with the xmp parser. I'm fairly confident the former, but wanted to check. In our test suite,

RE: setting permissions on a new document

2015-02-23 Thread Allison, Timothy B.
Alright. After the exorcism, all is working. I have no idea why it wasn't working before. Thank you, Tilman! -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, February 20, 2015 6:42 PM To: users@pdfbox.apache.org Subject: RE: setting permissions

RE: setting permissions on a new document

2015-02-20 Thread Allison, Timothy B.
[mailto:thaush...@t-online.de] Sent: Friday, February 20, 2015 5:25 PM To: users@pdfbox.apache.org Subject: Re: setting permissions on a new document Hi Tim, add a page to the document. PDPage page = new PDPage(); document.addPage(page); Tilman Am 20.02.2015 um 22:12 schrieb Allison, Timothy B

setting permissions on a new document

2015-02-20 Thread Allison, Timothy B.
All, I'm trying to create a test doc for permission checking over on Tika, when I try the most basic program: public static void main(String[] args) throws Exception { File f = new File(C:/temp/testPDF_protected.pdf); PDDocument document = new PDDocument();

extracting embedded documents -- will getEmbeddedFile() alone miss embedded DOS/Unix/Mac files?

2014-07-23 Thread Allison, Timothy B.
All, Over on Tika, it looks like we copied org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles to extract embedded files. As I look at the source code for PDComplexFileSpecification, I notice that getEmbeddedFile() does not behave like getFilename(); that is, it doesn't iterate through

RE: Radio Groups

2014-06-23 Thread Allison, Timothy B.
Maybe a newbie answer...are you seeing any different values on the PDRadioCollection or its kids with: getAlternateNameField() getFullyQualifiedName() getPartialName() Or did your client really use the same name for all three field name types for the two buttons? Is setValue(String value) on

Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8

2014-05-23 Thread Allison, Timothy B.
All, Over on Tika, we recently added the ability to export PDXObjectImages (TIKA-1268) as we do now with regular attachments. Some users have noticed some eyebrow-raising memory consumption after we made the change with some files. We're currently using PDFBox 1.8.5. This 4MB file shows

attachments

2014-02-03 Thread Allison, Timothy B.
All, According to the code in this example (examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java), the names for the embedded files can exist in efTree or in children of efTree. Does anyone happen to know if client code needs to check further descendants than