tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
All, How much effort would it be to track/calculate a ratio of characters with missing Unicode mappings to those with mappings for a given page? It would be neat after trying to extract text from a page to be able to tell how many characters are lost. We could use this info on Tika to determi

RE: tracking missing Unicode mappings?

2017-09-21 Thread Allison, Timothy B.
here. But this is all just a thought. I did not implement anything. Tilman Am 21.09.2017 um 22:07 schrieb Allison, Timothy B.: > All, > > How much effort would it be to track/calculate a ratio of characters with > missing Unicode mappings to those with mappings for a given page? It

Extracting rotated text

2017-09-25 Thread Allison, Timothy B.
Colleagues, Any recommendations for extracting rotated text such as: https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES ? Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText (and Tika) get something like this

RE: Extracting rotated text

2017-09-25 Thread Allison, Timothy B.
eb Allison, Timothy B.: > Colleagues, > Any recommendations for extracting rotated text such as: > https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES > ? > > Adobe DC gets reasonable text with "save as text&quo

RE: Command line utilities issue a document display warning

2017-10-05 Thread Allison, Timothy B.
If you want to extract the text from a document with an XFA, Apache Tika (which relies on PDFBox) should be able to extract the text. -Original Message- From: John Liston [mailto:list...@asconline.com] Sent: Thursday, October 5, 2017 11:56 AM To: users@pdfbox.apache.org Subject: Command

RE: Command line utilities issue a document display warning

2017-10-05 Thread Allison, Timothy B.
t the text. I only suggest using the command line utilities because they exhibit the problem that happens in my own code. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, October 05, 2017 12:21 PM To: users@pdfbox.apache.org Subject: RE: Command

setting permissions on a new document

2015-02-20 Thread Allison, Timothy B.
All, I'm trying to create a test doc for permission checking over on Tika, when I try the most basic program: public static void main(String[] args) throws Exception { File f = new File("C:/temp/testPDF_protected.pdf"); PDDocument document = new PDDocument(); Access

RE: setting permissions on a new document

2015-02-20 Thread Allison, Timothy B.
ument.addPage(page); Tilman Am 20.02.2015 um 22:12 schrieb Allison, Timothy B.: > All, >I'm trying to create a test doc for permission checking over on Tika, > when I try the most basic program: > > public static void main(String[] args) throws Exception { >

RE: setting permissions on a new document

2015-02-23 Thread Allison, Timothy B.
Alright. After the exorcism, all is working. I have no idea why it wasn't working before. Thank you, Tilman! -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, February 20, 2015 6:42 PM To: users@pdfbox.apache.org Subject: RE: setting permis

FW: xmp parsing issue -- xmp should start with a processing instruction

2015-07-07 Thread Allison, Timothy B.
All, This is a separate issue than I raised in PDFBox-2855. This, too, was initially noted by Jeremy Anderson on TIKA-1285. I'm not sure if this is a problem with the way our xmp was generated or with the xmp parser. I'm fairly confident the former, but wanted to check. In our test suite,

RE: FW: xmp parsing issue -- xmp should start with a processing instruction

2015-07-07 Thread Allison, Timothy B.
Thank you, Tilman. Will regenerate new test file. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, July 07, 2015 2:11 PM To: users@pdfbox.apache.org Subject: Re: FW: xmp parsing issue -- xmp should start with a processing instruction Hi, We got mo

DomXmpParser: namespace not found

2015-07-08 Thread Allison, Timothy B.
All, Apologies for the idiocy I'm about to reveal (well, that won't be a revelation to anyone, really), but is there an obvious solution for this kind of error: Caused by: org.apache.xmpbox.xml.XmpParsingException: Cannot find a definition for the namespace http://ns.adobe.com/lightroom/1.0/

RE: DomXmpParser: namespace not found

2015-07-09 Thread Allison, Timothy B.
ilto:sahy...@fileaffairs.de] Sent: Thursday, July 09, 2015 4:56 AM To: users@pdfbox.apache.org Subject: Re: DomXmpParser: namespace not found Hi, > Am 08.07.2015 um 22:42 schrieb Tilman Hausherr : > > Am 08.07.2015 um 17:22 schrieb Allison, Timothy B.: >> All, >> Apologies fo

TIKA-1678 PDF metadata extraction and UTF-16 encodings in the xmp

2015-07-15 Thread Allison, Timothy B.
All, Andrew Jackson recently opened TIKA-1678. Tika tries to use Dublin Core items from the xmp, and if that doesn't exist, it takes what it can find from the "regular" metadata. Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly (? doubly?) encoded in the xmp :

per page processing?

2015-07-15 Thread Allison, Timothy B.
All, Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing so that if there's an exception on one page, we'll still be able to extract contents from other pages. The proposed fix is along these lines: int nop = document.getNumberOfPages();

RE: per page processing?

2015-07-15 Thread Allison, Timothy B.
Onward. Thank you! -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Wednesday, July 15, 2015 5:09 PM To: users@pdfbox.apache.org Subject: Re: per page processing? > On 15 Jul 2015, at 04:52, Allison, Timothy B. wrote: > > All, > Raymond Wu recently

Subclassing BaseParser?

2015-10-03 Thread Allison, Timothy B.
All, I'm probably suffering from the same failure that led to (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370), but is it possible to subclass BaseParser outside of the oap.pdfpars

RE: Issues with extraction content of PDF files

2015-12-18 Thread Allison, Timothy B.
Colleagues, So that you don't have to do the initial diagnosis at least. From [0]: >>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode >>mapping for CID+71 (71) in font 505Eddc6Arial >>So, if the file has no Unicode mapping for the font, I doubt they'll be able >>to fi

RE: C# Version of PDFBox?

2016-03-31 Thread Allison, Timothy B.
Could PDFBox's webapp or tika-server, which wraps PDFBox, be of any use? -Original Message- From: Neil Pitman [mailto:neil.pit...@aquaforest.com] Sent: Wednesday, March 30, 2016 11:06 AM To: users@pdfbox.apache.org Subject: RE: C# Version of PDFBox? It could but would require some re-arc

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-04-20 Thread Allison, Timothy B.
Might want to look at Tika (which uses PDFBox) for that. Let's say you have an that contains your zips. java -jar tika-app.jar -J -t -i -o See if that gets you close enough. -Original Message- From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf Of David Gr

RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends >> forever" Thank you, Tilman! :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, April 30, 2016 5:24 PM To: users@pdfbox.apache.org Subject: Re: is it possible to bat

RE: PdfParser giving garbage character

2016-05-13 Thread Allison, Timothy B.
> Are you sure that you are using PDFBox. The code doesn't look like ours. That’s Tika. -Original Message- From: Andreas Lehmkühler [mailto:andr...@lehmi.de] Sent: Friday, May 13, 2016 5:53 AM To: Mohit Goyal ; users@pdfbox.apache.org Subject: Re: PdfParser giving garbage character > Mo

OCRing extracted inline images vs. fully rendered pages?

2016-05-17 Thread Allison, Timothy B.
All, On Tika, users can choose to run OCR on inline images (and attached images, of course). Would it be better for us to render each full page and then run OCR on that? Best, Tim

RE: OCRing extracted inline images vs. fully rendered pages?

2016-05-17 Thread Allison, Timothy B.
>We have an experimental integration with Tesseract which was created a while >ago by a GSoC student. Because it requires >building C++ we’ve not integrated >it into trunk, but do have it on the todo list for 2.1. Ah, very cool. Y, I'd trust you all to do a better job of integrating OCR for

associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
All, Is there a recipe for associating a hyperlink to text on the page? Over on Tika, we're dumping these as at the end of each page. If it isn't too hard, it would be great to associate these links with text, e.g. http://tika.apache.org";>tika. This is related to PDFBOX-1143 and TIKA-2029.

RE: associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
.52 heightDir: 5.52 x: 146.30115 xDirAdj: 146.30115 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 x: 151.22 xDirAdj: 151.22 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, July 7, 20

RE: associating text with a PDActionURI?

2016-07-07 Thread Allison, Timothy B.
> Hmm, because it's you, I'll try it myself :-) Thank you, Tilman! > You can't really know for sure with the classic text extraction, but you > could use the extractTextByArea example with the rect coordinates. Based on your example, though, I think this should work. If I cache the rectangl

RE: Tika calling exiftool and ffmpeg?

2016-09-01 Thread Allison, Timothy B.
Forwarded to users@tika From: Chris Bamford [mailto:cbamf...@mimecast.com] Sent: Thursday, September 1, 2016 7:03 AM To: Subject: Tika calling exiftool and ffmpeg? Hi I recently noticed on my linux box in the auditd logs that my JVM is making repeated attempts to call exiftool and ffmpeg. Wh

RE: [Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.COSDictionary"

2016-09-16 Thread Allison, Timothy B.
-Original Message- From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] Sent: Friday, September 16, 2016 8:11 AM To: u...@tika.apache.org Subject: Re: [Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.C

OOMs extracting inline images

2016-11-09 Thread Allison, Timothy B.
All, I kicked off a run against our regression corpus in which I extracted inline images from PDFs. I'm seeing quite a few OOMs, some caused by the JBIG2Filter, some by PDDeviceGray and some by the (unsupported, I know) jaiimageio TIFFWriter. Should I open issues for these or is this expecte

Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
Thanks to Tilman for pointing me to PDFBOX-3246, I now have 2 pdfs with embedded jp2 to work with! How can I extract those? I've effectively copied/pasted PDFBox's ExtractImages into Tika, and I'm using ImageIOUtil.writeImage(image, suffix, out) to write non-jpeg images. When I run this agai

RE: Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
>what do you need? The image in any format (e.g. png), or the image in the >original JP2 compression? Ideally the original JP2 compression. >And if you're using ImageIOUtil.writeImage(), what is the parameter in suffix? >If it is JP2, then you'd need to have some plugin for it. jpx >So it m

RE: OOMs extracting inline images

2016-11-09 Thread Allison, Timothy B.
>IMHO we should have a look if this is a know issue/expected behaviour or >something new. Could you provide at least one pdf for every case? Y. Will open an issue in the next few days so that we can share files and determine if this is a problem on our side or expected behavior. I have to wait

RE: Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
Got it. Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, November 9, 2016 3:38 PM To: users@pdfbox.apache.org Subject: Re: Extracting/rendering jp2? Am 09.11.2016 um 21:26 schrieb Allison, Timothy B.: >> what do you need? The im

RE: Extracting/rendering jp2?

2016-11-09 Thread Allison, Timothy B.
>> private static final List JP2 = >> Arrays.asList(COSName.JPX_DECODE); Apologies for the ignorance of how inline images are stored, but is there an equivalent for png or tiff? Or, do we have to decode and re-encode those?

RE: Extracting/rendering jp2?

2016-11-10 Thread Allison, Timothy B.
Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, November 10, 2016 12:33 AM To: users@pdfbox.apache.org Subject: Re: Extracting/rendering jp2? Am 10.11.2016 um 04:29 schrieb Allison, Timothy B.: >>> private static final

RE: Fwd: Trouble reading IEEE pdf

2017-02-02 Thread Allison, Timothy B.
I think I'm getting most of the text with pdfbox app's ExtractText... What text are you missing, specifically? Or, if you're missing the entire body, perhaps look at ExtractText to grab more content? -Original Message- From: pulkit@gmail.com [mailto:pulkit@gmail.com] On Behalf O

RE: Extracting vector graphics from pdf

2017-02-27 Thread Allison, Timothy B.
PDFBox Colleagues, Any recommendations? Best, Tim -Original Message- From: Andisa Dewi [mailto:theknight...@yahoo.com] Sent: Monday, February 27, 2017 5:32 AM To: u...@tika.apache.org Subject: Extracting vector graphics from pdf Hello guys, I'm currently e

RE: Extracting layout information and text from searchable PDF

2017-02-27 Thread Allison, Timothy B.
Might be relevant: https://github.com/JonathanLink/PDFLayoutTextStripper This might be helpful: https://github.com/apache/tika/pull/152 If you want to extract tables, take a look at Tabula: http://tabula.technology/ -Original Message- From: viraf.bankwa...@yahoo.com.INVALID [mailto:

RE: Extracting vector graphics from pdf

2017-02-28 Thread Allison, Timothy B.
allows to collect the lines. However it won't output an image. Tilman Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.: > PDFBox Colleagues, >Any recommendations? > >Best, > > Tim > > -Original Message- > From: Andisa Dew

RE: Make PDFBox fail on bad pdf

2017-03-30 Thread Allison, Timothy B.
If you have any recommendations for the more general case, let us know on TIKA-1443 [1]. [1] https://issues.apache.org/jira/browse/TIKA-1443 -Original Message- From: Wouter De Borger [mailto:wouter.debor...@inmanta.com] Sent: Thursday, March 30, 2017 6:00 AM To: users@pdfbox.apache.org

attachments

2014-02-03 Thread Allison, Timothy B.
All, According to the code in this example (examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java), the "names" for the embedded files can exist in "efTree" or in children of "efTree." Does anyone happen to know if client code needs to check further descendants

Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8

2014-05-23 Thread Allison, Timothy B.
All, Over on Tika, we recently added the ability to export PDXObjectImages (TIKA-1268) as we do now with regular attachments. Some users have noticed some eyebrow-raising memory consumption after we made the change with some files. We're currently using PDFBox 1.8.5. This 4MB file shows the

RE: Eyebrow-raising memory consumption exporting PDXObjectImages in PDFBox 1.8

2014-05-28 Thread Allison, Timothy B.
The "resources" object, is it the one of a single page, or of the whole PDF file? Tilman Am 23.05.2014 18:18, schrieb Allison, Timothy B.: > I get an OOM when trying to write the embedded images to disk with straight > PDFBox (no Tika) with -Xmx2g (tested on Java 1.7). My wr

RE: Radio Groups

2014-06-23 Thread Allison, Timothy B.
Maybe a newbie answer...are you seeing any different values on the PDRadioCollection or its kids with: getAlternateNameField() getFullyQualifiedName() getPartialName() Or did your client really use the same name for all three field name types for the two buttons? Is setValue(String value) on

extracting embedded documents -- will getEmbeddedFile() alone miss embedded DOS/Unix/Mac files?

2014-07-23 Thread Allison, Timothy B.
All, Over on Tika, it looks like we copied org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles to extract embedded files. As I look at the source code for PDComplexFileSpecification, I notice that getEmbeddedFile() does not behave like getFilename(); that is, it doesn't iterate through