the text. I only suggest using the
command line utilities because they exhibit the problem that happens in my own
code.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, October 05, 2017 12:21 PM
To: users@pdfbox.apache.org
Subject: RE: Command line
If you want to extract the text from a document with an XFA, Apache Tika (which
relies on PDFBox) should be able to extract the text.
-Original Message-
From: John Liston [mailto:list...@asconline.com]
Sent: Thursday, October 5, 2017 11:56 AM
To: users@pdfbox.apache.org
Subject: Command
, Timothy B.:
> Colleagues,
> Any recommendations for extracting rotated text such as:
> https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
> ?
>
> Adobe DC gets reasonable text with "save as text". PDFBox's E
Colleagues,
Any recommendations for extracting rotated text such as:
https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
?
Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText (and
Tika) get something like
there.
But this is all just a thought. I did not implement anything.
Tilman
Am 21.09.2017 um 22:07 schrieb Allison, Timothy B.:
> All,
>
> How much effort would it be to track/calculate a ratio of characters with
> missing Unicode mappings to those with mappings for a given page? It would
>
All,
How much effort would it be to track/calculate a ratio of characters with
missing Unicode mappings to those with mappings for a given page? It would be
neat after trying to extract text from a page to be able to tell how many
characters are lost. We could use this info on Tika to
If you have any recommendations for the more general case, let us know on
TIKA-1443 [1].
[1] https://issues.apache.org/jira/browse/TIKA-1443
-Original Message-
From: Wouter De Borger [mailto:wouter.debor...@inmanta.com]
Sent: Thursday, March 30, 2017 6:00 AM
To: users@pdfbox.apache.org
I think I'm getting most of the text with pdfbox app's ExtractText...
What text are you missing, specifically?
Or, if you're missing the entire body, perhaps look at ExtractText to grab more
content?
-Original Message-
From: pulkit@gmail.com [mailto:pulkit@gmail.com] On Behalf
Thank you!
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, November 10, 2016 12:33 AM
To: users@pdfbox.apache.org
Subject: Re: Extracting/rendering jp2?
Am 10.11.2016 um 04:29 schrieb Allison, Timothy B.:
>>> private static final
Got it. Thank you!
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Wednesday, November 9, 2016 3:38 PM
To: users@pdfbox.apache.org
Subject: Re: Extracting/rendering jp2?
Am 09.11.2016 um 21:26 schrieb Allison, Timothy B.:
>> what do you need? The
>IMHO we should have a look if this is a know issue/expected behaviour or
>something new. Could you provide at least one pdf for every case?
Y. Will open an issue in the next few days so that we can share files and
determine if this is a problem on our side or expected behavior. I have to wait
>what do you need? The image in any format (e.g. png), or the image in the
>original JP2 compression?
Ideally the original JP2 compression.
>And if you're using ImageIOUtil.writeImage(), what is the parameter in suffix?
>If it is JP2, then you'd need to have some plugin for it.
jpx
>So it
Forwarded to users@tika
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Thursday, September 1, 2016 7:03 AM
To:
Subject: Tika calling exiftool and ffmpeg?
Hi
I recently noticed on my linux box in the auditd logs that my JVM is making
> Hmm, because it's you, I'll try it myself :-)
Thank you, Tilman!
> You can't really know for sure with the classic text extraction, but you
> could use the extractTextByArea example with the rect coordinates.
Based on your example, though, I think this should work. If I cache the
tDir: 5.52
x: 146.30115 xDirAdj: 146.30115 y: 425.93 yDirAdj: 425.93 height: 5.52
heightDir: 5.52
x: 151.22 xDirAdj: 151.22 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
-Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, July 7, 2016 8:04 AM
All,
Is there a recipe for associating a hyperlink to text on the page? Over on
Tika, we're dumping these as at the end of each page. If it isn't
too hard, it would be great to associate these links with text, e.g. http://tika.apache.org;>tika.
This is related to PDFBOX-1143 and TIKA-2029.
>We have an experimental integration with Tesseract which was created a while
>ago by a GSoC student. Because it requires >building C++ we’ve not integrated
>it into trunk, but do have it on the todo list for 2.1.
Ah, very cool. Y, I'd trust you all to do a better job of integrating OCR for
All,
On Tika, users can choose to run OCR on inline images (and attached images,
of course). Would it be better for us to render each full page and then run
OCR on that?
Best,
Tim
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends
>> forever"
Thank you, Tilman! :)
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, April 30, 2016 5:24 PM
To: users@pdfbox.apache.org
Subject: Re: is it possible to
Might want to look at Tika (which uses PDFBox) for that.
Let's say you have an that contains your zips.
java -jar tika-app.jar -J -t -i -o
See if that gets you close enough.
-Original Message-
From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf
Of David
Could PDFBox's webapp or tika-server, which wraps PDFBox, be of any use?
-Original Message-
From: Neil Pitman [mailto:neil.pit...@aquaforest.com]
Sent: Wednesday, March 30, 2016 11:06 AM
To: users@pdfbox.apache.org
Subject: RE: C# Version of PDFBox?
It could but would require some
Colleagues,
So that you don't have to do the initial diagnosis at least. From [0]:
>>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode
>>mapping for CID+71
(71) in font 505Eddc6Arial
>>So, if the file has no Unicode mapping for the font, I doubt they'll be able
>>to
All,
I'm probably suffering from the same failure that led to
(https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370),
but is it possible to subclass BaseParser outside of the oap.pdfparser
All,
Andrew Jackson recently opened TIKA-1678. Tika tries to use Dublin Core
items from the xmp, and if that doesn't exist, it takes what it can find from
the regular metadata.
Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly
(? doubly?) encoded in the xmp :
All,
Raymond Wu recently opened TIKA-1679 and recommended that we switch to
per-page processing so that if there's an exception on one page, we'll still be
able to extract contents from other pages.
The proposed fix is along these lines:
int nop = document.getNumberOfPages();
Onward. Thank you!
-Original Message-
From: John Hewson [mailto:j...@jahewson.com]
Sent: Wednesday, July 15, 2015 5:09 PM
To: users@pdfbox.apache.org
Subject: Re: per page processing?
On 15 Jul 2015, at 04:52, Allison, Timothy B. talli...@mitre.org wrote:
All,
Raymond Wu
All,
This is a separate issue than I raised in PDFBox-2855. This, too, was
initially noted by Jeremy Anderson on TIKA-1285. I'm not sure if this is a
problem with the way our xmp was generated or with the xmp parser. I'm fairly
confident the former, but wanted to check.
In our test suite,
Alright. After the exorcism, all is working. I have no idea why it wasn't
working before. Thank you, Tilman!
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, February 20, 2015 6:42 PM
To: users@pdfbox.apache.org
Subject: RE: setting permissions
[mailto:thaush...@t-online.de]
Sent: Friday, February 20, 2015 5:25 PM
To: users@pdfbox.apache.org
Subject: Re: setting permissions on a new document
Hi Tim,
add a page to the document.
PDPage page = new PDPage();
document.addPage(page);
Tilman
Am 20.02.2015 um 22:12 schrieb Allison, Timothy B
All,
I'm trying to create a test doc for permission checking over on Tika, when I
try the most basic program:
public static void main(String[] args) throws Exception {
File f = new File(C:/temp/testPDF_protected.pdf);
PDDocument document = new PDDocument();
All,
Over on Tika, it looks like we copied
org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles to extract embedded
files. As I look at the source code for PDComplexFileSpecification, I notice
that getEmbeddedFile() does not behave like getFilename(); that is, it doesn't
iterate through
Maybe a newbie answer...are you seeing any different values on the
PDRadioCollection or its kids with:
getAlternateNameField()
getFullyQualifiedName()
getPartialName()
Or did your client really use the same name for all three field name types for
the two buttons?
Is setValue(String value) on
All,
Over on Tika, we recently added the ability to export PDXObjectImages
(TIKA-1268) as we do now with regular attachments. Some users have noticed
some eyebrow-raising memory consumption after we made the change with some
files. We're currently using PDFBox 1.8.5.
This 4MB file shows
All,
According to the code in this example
(examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java),
the names for the embedded files can exist in efTree or in children of
efTree. Does anyone happen to know if client code needs to check further
descendants than
34 matches
Mail list logo