[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression
[ https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937518#comment-13937518 ] Tilman Hausherr commented on PDFBOX-1975: - I added an error log output if writeImage() returns false in rev 1578259. Improve TestImageIOUtils unit tests to check image resolution and compression - Key: PDFBOX-1975 URL: https://issues.apache.org/jira/browse/PDFBOX-1975 Project: PDFBox Issue Type: Task Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Labels: imageio, test, tiff Fix For: 2.0.0 Because of the problems with recent changes (see PDFBOX-1963), I will improve the unit tests so that image resolution and compression is checked. I found out that JPEGs don't have a resolution, BMP had the wrong resolution. The fault wasn't in the java TIFF writer as I thought before, it is in the java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets them as pixels per mm instead of mm per pixel as per specification. The JPEG writer throws an exception JFIF APP0 must be first marker after SOI. The BMP writer can set the resolution, but the BMP reader doesn't read it. (Some of this might be different depending on the version) -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Visible signature image
I'have just updated pdfbox and test this feature. Everything works well. On Sat, Mar 15, 2014 at 10:35 AM, Tilman Hausherr thaush...@t-online.dewrote: I believe that somebody mentioned somewhere that creating the signature image didn't work properly, but I just can't find out who it was. While working on a test for JPEGFactory (PDFBOX-1969) I noticed that JPEGFactory.createFromImage() was temporarly broken (now hopefully no more), and this method is only used by PDVisibleSigBuilder. createSignatureImage(). I see now that this was created in PDFBOX-1766 by Thomas and Vakhtang - please test whether it still works. Tilman
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937894#comment-13937894 ] Craig Strong commented on PDFBOX-1988: -- Thank you John and Tilman. That was very quick and effective work. PDFBox ExtractText issue of PDF with no embedded fonts -- Key: PDFBOX-1988 URL: https://issues.apache.org/jira/browse/PDFBOX-1988 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Affects Versions: 1.8.4 Environment: Windows 7 Also, PASE on IBM i Reporter: Craig Strong Labels: patch Fix For: 1.8.5, 2.0.0 Attachments: Test1.pdf Original Estimate: 120h Remaining Estimate: 120h I have been using PDFBox 1.8.4 to extract text from several different PDF files fine. I use the latest PDFBox app with ExtractText command line. There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf. java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt. I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application. I think the error relates to an unknown font type and having to use the few fonts installed in the jar file. I tried running the API classes and trying to force a font from a certain location but I still got errors. I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font. I would really like to have this working straight from the ExtractText class anyway. Here are the errors I am getting. I tried this from both a Windows 7 PC and our IBM i in the PASE environment but I get the same errors. The section starting processEncodedText and on repeats a few times so I just included the first entries. Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont WARNING: Substituting TrueType for unknown font subtype= Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:119) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
[jira] [Created] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources
Tilman Hausherr created PDFBOX-1989: --- Summary: Save LZW and other encoded PDImageXObject resources Key: PDFBOX-1989 URL: https://issues.apache.org/jira/browse/PDFBOX-1989 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 2.0.0 The logo image of the file from PDFBOX-1147.png isn't extracted because PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it returns png brings us a correct file. With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings an NPE when getPDStream().getFilters() returns null. This happens with images that are uncompressed. Returning png for this case also brings us a nice image. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources
[ https://issues.apache.org/jira/browse/PDFBOX-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-1989. - Resolution: Fixed Done in rev 1578481. Save LZW and other encoded PDImageXObject resources --- Key: PDFBOX-1989 URL: https://issues.apache.org/jira/browse/PDFBOX-1989 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 2.0.0 The logo image of the file from PDFBOX-1147.png isn't extracted because PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it returns png brings us a correct file. With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings an NPE when getPDStream().getFilters() returns null. This happens with images that are uncompressed. Returning png for this case also brings us a nice image. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (PDFBOX-1990) Support creating PDF from lossless encoded images
Tilman Hausherr created PDFBOX-1990: --- Summary: Support creating PDF from lossless encoded images Key: PDFBOX-1990 URL: https://issues.apache.org/jira/browse/PDFBOX-1990 Project: PDFBox Issue Type: Improvement Reporter: Tilman Hausherr Priority: Minor Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. We can pass a BufferedImage, but this one will be JPEG compressed which is not a good thing for graphics with sharp edges. I suggest that we support PNG as well. It is possible because the Flate Filter supports both directions. My implementation (coming in a few minutes) is just an RGB based start that begs for improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1990) Support creating PDF from lossless encoded images
[ https://issues.apache.org/jira/browse/PDFBOX-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938188#comment-13938188 ] Tilman Hausherr commented on PDFBOX-1990: - Done in rev 1578489 and 1578492 and 1578503. I also added a NullOutputStream. Support creating PDF from lossless encoded images - Key: PDFBOX-1990 URL: https://issues.apache.org/jira/browse/PDFBOX-1990 Project: PDFBox Issue Type: Improvement Reporter: Tilman Hausherr Priority: Minor Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. We can pass a BufferedImage, but this one will be JPEG compressed which is not a good thing for graphics with sharp edges. I suggest that we support PNG as well. It is possible because the Flate Filter supports both directions. My implementation (coming in a few minutes) is just an RGB based start that begs for improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-1990) Support creating PDF from lossless encoded images
[ https://issues.apache.org/jira/browse/PDFBOX-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938188#comment-13938188 ] Tilman Hausherr edited comment on PDFBOX-1990 at 3/17/14 6:43 PM: -- Done in rev 1578489 and 1578492 and 1578503 and 1578505. I also added a NullOutputStream. was (Author: tilman): Done in rev 1578489 and 1578492 and 1578503. I also added a NullOutputStream. Support creating PDF from lossless encoded images - Key: PDFBOX-1990 URL: https://issues.apache.org/jira/browse/PDFBOX-1990 Project: PDFBox Issue Type: Improvement Reporter: Tilman Hausherr Priority: Minor Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. We can pass a BufferedImage, but this one will be JPEG compressed which is not a good thing for graphics with sharp edges. I suggest that we support PNG as well. It is possible because the Flate Filter supports both directions. My implementation (coming in a few minutes) is just an RGB based start that begs for improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression
[ https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938322#comment-13938322 ] Tilman Hausherr commented on PDFBOX-1975: - I added a test to save PDImageXObject objects from PDF within TestImageIOUtils in rev 1578544. Improve TestImageIOUtils unit tests to check image resolution and compression - Key: PDFBOX-1975 URL: https://issues.apache.org/jira/browse/PDFBOX-1975 Project: PDFBox Issue Type: Task Components: Utilities Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Labels: imageio, test, tiff Fix For: 2.0.0 Because of the problems with recent changes (see PDFBOX-1963), I will improve the unit tests so that image resolution and compression is checked. I found out that JPEGs don't have a resolution, BMP had the wrong resolution. The fault wasn't in the java TIFF writer as I thought before, it is in the java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets them as pixels per mm instead of mm per pixel as per specification. The JPEG writer throws an exception JFIF APP0 must be first marker after SOI. The BMP writer can set the resolution, but the BMP reader doesn't read it. (Some of this might be different depending on the version) -- This message was sent by Atlassian JIRA (v6.2#6252)
PDFTextStripper.pageSeparator has no effect
Hi, I tried to use the parameter pageSeparator on PDFTextStripper and noticed that it has no effect. I checked the sources and discovered that in all versions up to the current trunk, the setting is simply not used anywhere. The only method using a set separator is writePageSeperator(), which also includes a typo worth fixing, but this method isn’t called anywhere. It should probably be called in processPages(). However, and this is why I didn’t go ahead and submit a patch myself, what does happen is that the pageEnd marker is written, which is initialized to the value of pageSeparator. So if both get used, this will probably end up in the same marker emitted twice on each page break. As a result, I’m unsure what to do about this and thought I’d leave it to the core team maintaining this, so I’m just reporting it here. Regards Maik
[jira] [Commented] (PDFBOX-1847) TSA Time Signature
[ https://issues.apache.org/jira/browse/PDFBOX-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938516#comment-13938516 ] John Hewson commented on PDFBOX-1847: - [~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I made some significant changes to the patch so that the singing functionality can be moved into pdfbox proper, rather than being part of the examples. Currently the code remains part of the examples until we're sure it works. Can you test out the new code and see if signing is working as you expected? *Technical Notes* Revision 1578650 includes changes to various other files, COSStandardOutputStream assumed that the OutputStream was always a FileOutputStream, which is obviously an unsafe assumption, in fact, output streams do not generally have a position at all, so I removed all code which broke that contract. COSWriter was treating its incremental update streams in a strange manner, it wanted the InputStream and OutputStream to be backed by the same underlying data, which is not generally possible, so I had to write new code to perform incremental writing in order not to break the Input/Output stream contract. This allows the incremental file to be written to a different stream from the one which was read. I also added some new loading and saving methods to PDDocument to make incremental updating easier, and to automatically keep track of File objects, when relevant. TSA Time Signature -- Key: PDFBOX-1847 URL: https://issues.apache.org/jira/browse/PDFBOX-1847 Project: PDFBox Issue Type: Improvement Components: Signing Affects Versions: 2.0.0 Reporter: vakhtang koroghlishvili Assignee: John Hewson Fix For: 2.0.0 Attachments: CreateSignature-updated.java.patch, TSATimeSignature.patch, resultOfSigning.jpg When we was signing document, we was using time from our time. For more security we can use Time Stamp server. Trusted timestamping is the process of securely keeping track of the creation and modification time of a document. Security here means that no one — not even the owner of the document — should be able to change it once it has been recorded provided that the timestamper's integrity is never compromised.(wiki) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (PDFBOX-1847) TSA Time Signature
[ https://issues.apache.org/jira/browse/PDFBOX-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938516#comment-13938516 ] John Hewson edited comment on PDFBOX-1847 at 3/17/14 10:55 PM: --- [~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I made some significant changes to the patch so that the singing functionality can be moved into pdfbox proper, rather than being part of the examples. Currently the code remains part of the examples until we're sure it works. Can you test out the new code and see if signing is working as you expected? I've added a command line flag to CreateSignature to allow passing a TSA server URL: {code} usage: java org.apache.pdfbox.examples.signature.CreateSignature pkcs12_keystore password pdf_to_sign options: -tsa urlsign timestamp using the given TSA server {code} *Technical Notes* Revision 1578650 includes changes to various other files, COSStandardOutputStream assumed that the OutputStream was always a FileOutputStream, which is obviously an unsafe assumption, in fact, output streams do not generally have a position at all, so I removed all code which broke that contract. COSWriter was treating its incremental update streams in a strange manner, it wanted the InputStream and OutputStream to be backed by the same underlying data, which is not generally possible, so I had to write new code to perform incremental writing in order not to break the Input/Output stream contract. This allows the incremental file to be written to a different stream from the one which was read. I also added some new loading and saving methods to PDDocument to make incremental updating easier, and to automatically keep track of File objects, when relevant. was (Author: jahewson): [~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I made some significant changes to the patch so that the singing functionality can be moved into pdfbox proper, rather than being part of the examples. Currently the code remains part of the examples until we're sure it works. Can you test out the new code and see if signing is working as you expected? *Technical Notes* Revision 1578650 includes changes to various other files, COSStandardOutputStream assumed that the OutputStream was always a FileOutputStream, which is obviously an unsafe assumption, in fact, output streams do not generally have a position at all, so I removed all code which broke that contract. COSWriter was treating its incremental update streams in a strange manner, it wanted the InputStream and OutputStream to be backed by the same underlying data, which is not generally possible, so I had to write new code to perform incremental writing in order not to break the Input/Output stream contract. This allows the incremental file to be written to a different stream from the one which was read. I also added some new loading and saving methods to PDDocument to make incremental updating easier, and to automatically keep track of File objects, when relevant. TSA Time Signature -- Key: PDFBOX-1847 URL: https://issues.apache.org/jira/browse/PDFBOX-1847 Project: PDFBox Issue Type: Improvement Components: Signing Affects Versions: 2.0.0 Reporter: vakhtang koroghlishvili Assignee: John Hewson Fix For: 2.0.0 Attachments: CreateSignature-updated.java.patch, TSATimeSignature.patch, resultOfSigning.jpg When we was signing document, we was using time from our time. For more security we can use Time Stamp server. Trusted timestamping is the process of securely keeping track of the creation and modification time of a document. Security here means that no one — not even the owner of the document — should be able to change it once it has been recorded provided that the timestamper's integrity is never compromised.(wiki) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1983) Unable to add TIF images, CCITTFactory not working
[ https://issues.apache.org/jira/browse/PDFBOX-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938547#comment-13938547 ] John Hewson commented on PDFBOX-1983: - Cool, it looks like PDMemoryStream is the weak link, it's not really doing what it says it is. Unable to add TIF images, CCITTFactory not working -- Key: PDFBOX-1983 URL: https://issues.apache.org/jira/browse/PDFBOX-1983 Project: PDFBox Issue Type: Bug Components: PDModel Affects Versions: 2.0.0 Reporter: Joel Kääpä Assignee: Tilman Hausherr Fix For: 2.0.0 Attachments: G4.tif, huhu.pdf As used in the AddImageToPDF example, the following line generates an error with tif image: PDImageXObject ximage = CCITTFactory.createFromRandomAccess(document, new RandomAccessFile(new File(imagePath), r)); java.io.IOException: Stream was not read at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:235) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.init(PDImageXObject.java:80) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.init(PDImageXObject.java:70) at org.apache.pdfbox.pdmodel.graphics.image.CCITTFactory.createFromRandomAccess(CCITTFactory.java:50) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938560#comment-13938560 ] John Hewson commented on PDFBOX-1987: - {quote} An are which I kept out is how to handle malformed tokens such as strings which have an unbalanced number of parenthesis. {quote} Do you have any sample PDF files with this problem? Provide a PDF Lexer as a base for PDF parsing - Key: PDFBOX-1987 URL: https://issues.apache.org/jira/browse/PDFBOX-1987 Project: PDFBox Issue Type: Improvement Components: Parsing Reporter: Maruan Sahyoun Priority: Minor Fix For: 2.0.0 Attachments: src.zip In order to enhance the parsing process and as a foundation for a combination of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1969) JPEGFactory bug
[ https://issues.apache.org/jira/browse/PDFBOX-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938571#comment-13938571 ] John Hewson commented on PDFBOX-1969: - Ok, well if someone really wants support for JPEGs which use ARGB we can follow up on this, given that it has probably never worked (quite a bit of the 1.8 image parsing code was like that). JPEGFactory bug --- Key: PDFBOX-1969 URL: https://issues.apache.org/jira/browse/PDFBOX-1969 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Steven Burg Fix For: 2.0.0 Attempted to run the RubberStampWithImage sample and received the following errors: Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:72) at org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.doIt(RubberStampWithImage.java:93) at org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.main(RubberStampWithImage.java:185) This happens with any jog I tested with. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1969) JPEGFactory bug
[ https://issues.apache.org/jira/browse/PDFBOX-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938572#comment-13938572 ] John Hewson commented on PDFBOX-1969: - Shall we close this issue? JPEGFactory bug --- Key: PDFBOX-1969 URL: https://issues.apache.org/jira/browse/PDFBOX-1969 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.0 Reporter: Steven Burg Fix For: 2.0.0 Attempted to run the RubberStampWithImage sample and received the following errors: Exception in thread main java.lang.NullPointerException at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:72) at org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.doIt(RubberStampWithImage.java:93) at org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.main(RubberStampWithImage.java:185) This happens with any jog I tested with. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1594) Add support for AES256 Encryption
[ https://issues.apache.org/jira/browse/PDFBOX-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938576#comment-13938576 ] John Hewson commented on PDFBOX-1594: - The problem is that this patch has been made against 1.8.4 rather than the trunk, and there are differences between the two. [~neon1] is it possible for you to make a new patch against the trunk? Add support for AES256 Encryption -- Key: PDFBOX-1594 URL: https://issues.apache.org/jira/browse/PDFBOX-1594 Project: PDFBox Issue Type: Improvement Reporter: Maruan Sahyoun Fix For: 2.0.0 Attachments: pdfbox-1.8.4-aes256.diff Adobe 9 added support for AES 256 encryption. Further information is available at http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf (specially 3.5.1) or ISO 32000-2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1512) TextPositionComparator is not compatible with Java 7
[ https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938585#comment-13938585 ] John Hewson commented on PDFBOX-1512: - Perhaps we should migrate away from using Collections.sort altogether and use some other sorting algorithm? TextPositionComparator is not compatible with Java 7 Key: PDFBOX-1512 URL: https://issues.apache.org/jira/browse/PDFBOX-1512 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.7.1 Environment: Java 7 Reporter: Benjamin Papez Assignee: Andreas Lehmkühler Attachments: FOP-2252.pdf, TextPositionComparator.java, WFI_PDFParser_TextPostionComparator.txt, immo-kurier_arsenal_93x62.pdf The TextPostionCompartor causes the following exception running on Java 7: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison method violates its general contract! I think the problem is with this check: if ( yDifference .1 || (pos2YBottom = pos1YTop pos2YBottom = pos1YBottom) || (pos1YBottom = pos2YTop pos1YBottom = pos2YBottom)) as it violates the contract requirement: The implementor must also ensure that the relation is transitive: ((compare(x, y)0) (compare(y, z)0)) implies compare(x, z)0. Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z. Java 7 now is strict and throws exceptions when the contract is violated. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Visible signature image
I just made a unit test for CreateSignature and I’ll add one for visible signatures soon. -- John On 17 Mar 2014, at 07:31, Vakhtang koroghlishvili vakhtang.koroghlishv...@gmail.com wrote: I'have just updated pdfbox and test this feature. Everything works well. On Sat, Mar 15, 2014 at 10:35 AM, Tilman Hausherr thaush...@t-online.dewrote: I believe that somebody mentioned somewhere that creating the signature image didn't work properly, but I just can't find out who it was. While working on a test for JPEGFactory (PDFBOX-1969) I noticed that JPEGFactory.createFromImage() was temporarly broken (now hopefully no more), and this method is only used by PDVisibleSigBuilder. createSignatureImage(). I see now that this was created in PDFBOX-1766 by Thomas and Vakhtang - please test whether it still works. Tilman
[jira] [Commented] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources
[ https://issues.apache.org/jira/browse/PDFBOX-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938596#comment-13938596 ] John Hewson commented on PDFBOX-1989: - +1 Save LZW and other encoded PDImageXObject resources --- Key: PDFBOX-1989 URL: https://issues.apache.org/jira/browse/PDFBOX-1989 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 2.0.0 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Priority: Minor Fix For: 2.0.0 The logo image of the file from PDFBOX-1147.png isn't extracted because PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it returns png brings us a correct file. With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings an NPE when getPDStream().getFilters() returns null. This happens with images that are uncompressed. Returning png for this case also brings us a nice image. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938606#comment-13938606 ] Craig Strong commented on PDFBOX-1988: -- I tested the fix on the 2.0.0 build and it worked. Thanks again. PDFBox ExtractText issue of PDF with no embedded fonts -- Key: PDFBOX-1988 URL: https://issues.apache.org/jira/browse/PDFBOX-1988 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Affects Versions: 1.8.4 Environment: Windows 7 Also, PASE on IBM i Reporter: Craig Strong Labels: patch Fix For: 1.8.5, 2.0.0 Attachments: Test1.pdf Original Estimate: 120h Remaining Estimate: 120h I have been using PDFBox 1.8.4 to extract text from several different PDF files fine. I use the latest PDFBox app with ExtractText command line. There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf. java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt. I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application. I think the error relates to an unknown font type and having to use the few fonts installed in the jar file. I tried running the API classes and trying to force a font from a certain location but I still got errors. I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font. I would really like to have this working straight from the ExtractText class anyway. Here are the errors I am getting. I tried this from both a Windows 7 PC and our IBM i in the PASE environment but I get the same errors. The section starting processEncodedText and on repeats a few times so I just included the first entries. Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont WARNING: Substituting TrueType for unknown font subtype= Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:119) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
[jira] [Closed] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts
[ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Strong closed PDFBOX-1988. Closing the issue. PDFBox ExtractText issue of PDF with no embedded fonts -- Key: PDFBOX-1988 URL: https://issues.apache.org/jira/browse/PDFBOX-1988 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Affects Versions: 1.8.4 Environment: Windows 7 Also, PASE on IBM i Reporter: Craig Strong Labels: patch Fix For: 1.8.5, 2.0.0 Attachments: Test1.pdf Original Estimate: 120h Remaining Estimate: 120h I have been using PDFBox 1.8.4 to extract text from several different PDF files fine. I use the latest PDFBox app with ExtractText command line. There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf. java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt. I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application. I think the error relates to an unknown font type and having to use the few fonts installed in the jar file. I tried running the API classes and trying to force a font from a certain location but I still got errors. I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font. I would really like to have this working straight from the ExtractText class anyway. Here are the errors I am getting. I tried this from both a Windows 7 PC and our IBM i in the PASE environment but I get the same errors. The section starting processEncodedText and on repeats a few times so I just included the first entries. Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont WARNING: Substituting TrueType for unknown font subtype= Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.init(PDTrueTypeFont.java:119) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121) at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText WARNING: java.lang.NullPointerException Throwable occurred: java.lang.NullPointerException at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)