All,
How much effort would it be to track/calculate a ratio of characters with
missing Unicode mappings to those with mappings for a given page? It would be
neat after trying to extract text from a page to be able to tell how many
characters are lost. We could use this info on Tika to determi
here.
But this is all just a thought. I did not implement anything.
Tilman
Am 21.09.2017 um 22:07 schrieb Allison, Timothy B.:
> All,
>
> How much effort would it be to track/calculate a ratio of characters with
> missing Unicode mappings to those with mappings for a given page? It
Colleagues,
Any recommendations for extracting rotated text such as:
https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
?
Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText (and
Tika) get something like this
eb Allison, Timothy B.:
> Colleagues,
> Any recommendations for extracting rotated text such as:
> https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
> ?
>
> Adobe DC gets reasonable text with "save as text&quo
If you want to extract the text from a document with an XFA, Apache Tika (which
relies on PDFBox) should be able to extract the text.
-Original Message-
From: John Liston [mailto:list...@asconline.com]
Sent: Thursday, October 5, 2017 11:56 AM
To: users@pdfbox.apache.org
Subject: Command
t the text. I only suggest using the
command line utilities because they exhibit the problem that happens in my own
code.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, October 05, 2017 12:21 PM
To: users@pdfbox.apache.org
Subject: RE: Command
All,
I'm trying to create a test doc for permission checking over on Tika, when I
try the most basic program:
public static void main(String[] args) throws Exception {
File f = new File("C:/temp/testPDF_protected.pdf");
PDDocument document = new PDDocument();
Access
ument.addPage(page);
Tilman
Am 20.02.2015 um 22:12 schrieb Allison, Timothy B.:
> All,
>I'm trying to create a test doc for permission checking over on Tika,
> when I try the most basic program:
>
> public static void main(String[] args) throws Exception {
>
Alright. After the exorcism, all is working. I have no idea why it wasn't
working before. Thank you, Tilman!
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, February 20, 2015 6:42 PM
To: users@pdfbox.apache.org
Subject: RE: setting permis
All,
This is a separate issue than I raised in PDFBox-2855. This, too, was
initially noted by Jeremy Anderson on TIKA-1285. I'm not sure if this is a
problem with the way our xmp was generated or with the xmp parser. I'm fairly
confident the former, but wanted to check.
In our test suite,
Thank you, Tilman. Will regenerate new test file.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, July 07, 2015 2:11 PM
To: users@pdfbox.apache.org
Subject: Re: FW: xmp parsing issue -- xmp should start with a processing
instruction
Hi,
We got mo
All,
Apologies for the idiocy I'm about to reveal (well, that won't be a revelation
to anyone, really), but is there an obvious solution for this kind of error:
Caused by: org.apache.xmpbox.xml.XmpParsingException: Cannot find a definition
for the namespace http://ns.adobe.com/lightroom/1.0/
ilto:sahy...@fileaffairs.de]
Sent: Thursday, July 09, 2015 4:56 AM
To: users@pdfbox.apache.org
Subject: Re: DomXmpParser: namespace not found
Hi,
> Am 08.07.2015 um 22:42 schrieb Tilman Hausherr :
>
> Am 08.07.2015 um 17:22 schrieb Allison, Timothy B.:
>> All,
>> Apologies fo
All,
Andrew Jackson recently opened TIKA-1678. Tika tries to use Dublin Core
items from the xmp, and if that doesn't exist, it takes what it can find from
the "regular" metadata.
Andrew found that for ~200k out of 21million files, the UTF-16 is incorrectly
(? doubly?) encoded in the xmp :
All,
Raymond Wu recently opened TIKA-1679 and recommended that we switch to
per-page processing so that if there's an exception on one page, we'll still be
able to extract contents from other pages.
The proposed fix is along these lines:
int nop = document.getNumberOfPages();
Onward. Thank you!
-Original Message-
From: John Hewson [mailto:j...@jahewson.com]
Sent: Wednesday, July 15, 2015 5:09 PM
To: users@pdfbox.apache.org
Subject: Re: per page processing?
> On 15 Jul 2015, at 04:52, Allison, Timothy B. wrote:
>
> All,
> Raymond Wu recently
All,
I'm probably suffering from the same failure that led to
(https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370),
but is it possible to subclass BaseParser outside of the oap.pdfpars
Colleagues,
So that you don't have to do the initial diagnosis at least. From [0]:
>>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode
>>mapping for CID+71
(71) in font 505Eddc6Arial
>>So, if the file has no Unicode mapping for the font, I doubt they'll be able
>>to fi
Could PDFBox's webapp or tika-server, which wraps PDFBox, be of any use?
-Original Message-
From: Neil Pitman [mailto:neil.pit...@aquaforest.com]
Sent: Wednesday, March 30, 2016 11:06 AM
To: users@pdfbox.apache.org
Subject: RE: C# Version of PDFBox?
It could but would require some re-arc
Might want to look at Tika (which uses PDFBox) for that.
Let's say you have an that contains your zips.
java -jar tika-app.jar -J -t -i -o
See if that gets you close enough.
-Original Message-
From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf
Of David Gr
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends
>> forever"
Thank you, Tilman! :)
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, April 30, 2016 5:24 PM
To: users@pdfbox.apache.org
Subject: Re: is it possible to bat
> Are you sure that you are using PDFBox. The code doesn't look like ours.
That’s Tika.
-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Friday, May 13, 2016 5:53 AM
To: Mohit Goyal ; users@pdfbox.apache.org
Subject: Re: PdfParser giving garbage character
> Mo
All,
On Tika, users can choose to run OCR on inline images (and attached images,
of course). Would it be better for us to render each full page and then run
OCR on that?
Best,
Tim
>We have an experimental integration with Tesseract which was created a while
>ago by a GSoC student. Because it requires >building C++ we’ve not integrated
>it into trunk, but do have it on the todo list for 2.1.
Ah, very cool. Y, I'd trust you all to do a better job of integrating OCR for
All,
Is there a recipe for associating a hyperlink to text on the page? Over on
Tika, we're dumping these as at the end of each page. If it isn't
too hard, it would be great to associate these links with text, e.g. http://tika.apache.org";>tika.
This is related to PDFBOX-1143 and TIKA-2029.
.52
heightDir: 5.52
x: 146.30115 xDirAdj: 146.30115 y: 425.93 yDirAdj: 425.93 height: 5.52
heightDir: 5.52
x: 151.22 xDirAdj: 151.22 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, July 7, 20
> Hmm, because it's you, I'll try it myself :-)
Thank you, Tilman!
> You can't really know for sure with the classic text extraction, but you
> could use the extractTextByArea example with the rect coordinates.
Based on your example, though, I think this should work. If I cache the
rectangl
Forwarded to users@tika
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Thursday, September 1, 2016 7:03 AM
To:
Subject: Tika calling exiftool and ffmpeg?
Hi
I recently noticed on my linux box in the auditd logs that my JVM is making
repeated attempts to call exiftool and ffmpeg. Wh
-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com]
Sent: Friday, September 16, 2016 8:11 AM
To: u...@tika.apache.org
Subject: Re: [Tika] I have a question. --> "Exception :
org.apache.pdfbox.cos.COSArray cannot be cast to
org.apache.pdfbox.cos.C
All,
I kicked off a run against our regression corpus in which I extracted inline
images from PDFs. I'm seeing quite a few OOMs, some caused by the JBIG2Filter,
some by PDDeviceGray and some by the (unsupported, I know) jaiimageio
TIFFWriter. Should I open issues for these or is this expecte
Thanks to Tilman for pointing me to PDFBOX-3246, I now have 2 pdfs with
embedded jp2 to work with!
How can I extract those? I've effectively copied/pasted PDFBox's ExtractImages
into Tika, and I'm using ImageIOUtil.writeImage(image, suffix, out) to write
non-jpeg images.
When I run this agai
>what do you need? The image in any format (e.g. png), or the image in the
>original JP2 compression?
Ideally the original JP2 compression.
>And if you're using ImageIOUtil.writeImage(), what is the parameter in suffix?
>If it is JP2, then you'd need to have some plugin for it.
jpx
>So it m
>IMHO we should have a look if this is a know issue/expected behaviour or
>something new. Could you provide at least one pdf for every case?
Y. Will open an issue in the next few days so that we can share files and
determine if this is a problem on our side or expected behavior. I have to wait
Got it. Thank you!
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Wednesday, November 9, 2016 3:38 PM
To: users@pdfbox.apache.org
Subject: Re: Extracting/rendering jp2?
Am 09.11.2016 um 21:26 schrieb Allison, Timothy B.:
>> what do you need? The im
>> private static final List JP2 =
>> Arrays.asList(COSName.JPX_DECODE);
Apologies for the ignorance of how inline images are stored, but is there an
equivalent for png or tiff? Or, do we have to decode and re-encode those?
Thank you!
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, November 10, 2016 12:33 AM
To: users@pdfbox.apache.org
Subject: Re: Extracting/rendering jp2?
Am 10.11.2016 um 04:29 schrieb Allison, Timothy B.:
>>> private static final
I think I'm getting most of the text with pdfbox app's ExtractText...
What text are you missing, specifically?
Or, if you're missing the entire body, perhaps look at ExtractText to grab more
content?
-Original Message-
From: pulkit@gmail.com [mailto:pulkit@gmail.com] On Behalf O
PDFBox Colleagues,
Any recommendations?
Best,
Tim
-Original Message-
From: Andisa Dewi [mailto:theknight...@yahoo.com]
Sent: Monday, February 27, 2017 5:32 AM
To: u...@tika.apache.org
Subject: Extracting vector graphics from pdf
Hello guys,
I'm currently e
Might be relevant:
https://github.com/JonathanLink/PDFLayoutTextStripper
This might be helpful:
https://github.com/apache/tika/pull/152
If you want to extract tables, take a look at Tabula:
http://tabula.technology/
-Original Message-
From: viraf.bankwa...@yahoo.com.INVALID
[mailto:
allows to collect the lines. However it won't output an image.
Tilman
Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>Any recommendations?
>
>Best,
>
> Tim
>
> -Original Message-
> From: Andisa Dew
If you have any recommendations for the more general case, let us know on
TIKA-1443 [1].
[1] https://issues.apache.org/jira/browse/TIKA-1443
-Original Message-
From: Wouter De Borger [mailto:wouter.debor...@inmanta.com]
Sent: Thursday, March 30, 2017 6:00 AM
To: users@pdfbox.apache.org
All,
According to the code in this example
(examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java),
the "names" for the embedded files can exist in "efTree" or in children of
"efTree." Does anyone happen to know if client code needs to check further
descendants
All,
Over on Tika, we recently added the ability to export PDXObjectImages
(TIKA-1268) as we do now with regular attachments. Some users have noticed
some eyebrow-raising memory consumption after we made the change with some
files. We're currently using PDFBox 1.8.5.
This 4MB file shows the
The "resources" object, is it the one of a single page, or of the whole
PDF file?
Tilman
Am 23.05.2014 18:18, schrieb Allison, Timothy B.:
> I get an OOM when trying to write the embedded images to disk with straight
> PDFBox (no Tika) with -Xmx2g (tested on Java 1.7). My wr
Maybe a newbie answer...are you seeing any different values on the
PDRadioCollection or its kids with:
getAlternateNameField()
getFullyQualifiedName()
getPartialName()
Or did your client really use the same name for all three field name types for
the two buttons?
Is setValue(String value) on
All,
Over on Tika, it looks like we copied
org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles to extract embedded
files. As I look at the source code for PDComplexFileSpecification, I notice
that getEmbeddedFile() does not behave like getFilename(); that is, it doesn't
iterate through
46 matches
Mail list logo