RE: Strange performance problem with certain PDF files

2016-03-21 Thread Stahle, Patrick
Hi John / Tillman,

I have reduced it down to be a difference between doing a PDDocument.save() 
using FileOutputStream. If I pass in Java File instead, the problem does not 
occur. Also we have only been able to reproduce it on some larger pdf files. It 
also seems to only happen in certain environments. On my linux virtual machine 
I have not been able to reproduce it at all. Windows and Solaris Server (3par 
drive cluster). I have some simple sample code that reproduces the problem but 
the 2 pdf files I have at hand I don't think I can send you. The one is a 3D 
PDF of ours (TE Classified) and the other ironically is IText v1 manual in pdf 
form. The times are pretty drastic, on Windows the 3D PDF with using Java File 
class is about 3 seconds vs.  29 seconds for the FileOutputStream. IText manual 
is not as bad at 2 vs. 20. 

Anyways, we have a workaround. We just converted our code to pass Java File 
class for use by PDFBox. If I can find a suitable PDF that reproduces the 
problem I will send it your way.

Thanks,
Patrick

-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Friday, March 18, 2016 4:45 PM
To: users@pdfbox.apache.org
Subject: Re: Strange performance problem with certain PDF files


> On 18 Mar 2016, at 12:01, Stahle, Patrick  wrote:
> 
> Hi all,
> 
> I am running into a lot of strange performance issues with certain PDF files.
> 
> Background info:
> The strange thing I can't reproduce this consistently. When I get a pdf being 
> generated on a particular environment it seems consistent. I do most of my 
> development inside VirtualBox virtual machine running fedora. These pdf files 
> I am having problems with never have performance issues when run on my 
> virtual machine local drive, but if I use a Virtual Box Shared drive as the 
> source / destination for the PDF, I see the problem. Another co-worker 
> working from pure windows environment experience the performance problem. We 
> are also seeing the same issue on our dev solaris servers. The performance 
> range can be quite drastic on one of our 3DPDF's (12meg) running on my local 
> environment it can be opened, stamped with some text, encrypted, and saved in 
> around 8 sec. Doing the same job pointing to a virtual box share drive or on 
> our solaris server that same work will take minutes. On my coworkers windows 
> environment it takes around 30 seconds. We really only reproduced this 
> consistently on the 12m 3D  PDF. I have a much smaller pdf (non 3d / convert 
> from msoffice) that does show similar performance issue but the times range 
> from 200ms local to 8 sec.

You need to isolate the problem, you’ve got too many variables to make any 
sense of it all. Get a reproducible problem on one, non-virtualised JVM first.

— John

> The one thing I see in common between the 2 files is I see a lot of the 
> following messages to the console:
> Using output from the 12m 3DPDF file:
> :
> :
> 1787 [main] DEBUG org.apache.pdfbox.pdfparser.PDFObjectStreamParser  - 
> parsed=COSObject{13166, 0}
> 
> These messages seem to happen on the PDDocument.open and from what I can 
> tell, I get 13,166 of these messages in this example PDF.
> The slowness does not happen until the following line:
> document.save(outputPDFStream);
> 
> Other PDF's including some quite large I do not see this performance issue 
> nor those log messages.
> 
> I know this is not much to go on, I am working on seeing if I can isolate 
> this down to something more concrete / reproducible point. But I thought I 
> would send this out to see if anyone has any ideas or have seen issues 
> similar to this? Suggestions?
> 
> Thanks,
> Patrick
> 


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0

2016-03-21 Thread Andreas Lehmkühler



 Ursprüngliche Nachricht 
Von: Sally Khudairi 
Gesendet: 21. März 2016 12:44:18 MEZ
An: Apache Announce List 
Betreff: The Apache® Software Foundation announces Apache PDFBox™ v2.0

>> this announcement is available online at https://s.apache.org/Ly9B

Milestone release of Open Source Java tool for working with PDF documents 
features dozens of improvements and enhancements

Forest Hill, MD —21 March 2016— The Apache Software Foundation (ASF), the 
all-volunteer developers, stewards, and incubators of more than 350 Open Source 
projects and initiatives, announced today the availability of Apache® PDFBox™ 
v2.0, the Open Source Java tool for working with Portable Document Format (PDF) 
documents. 

PDF was first released by Adobe Systems in 1993, and became an ISO 
International Standard - ISO 32000-1 in 2008. Apache PDFBox allows for the 
creation of new PDF documents, manipulation, rendering, signing of existing 
documents and the ability to extract content from documents. In addition, 
PDFBox includes several command line utilities. In February 2015, the project 
became the first Open Source Partner Organization of the PDF Association. 

"PDF is a very popular and easy to use format for document exchange. It is used 
by millions of people every day, however the format itself is quite complicated 
and a real challenge to write a piece of software to work with it," said 
Andreas Lehmkühler, Vice President of Apache PDFBox. "This new major release of 
PDFBox includes a lot of improvements, fixes and new features which should make 
the life easier for our users." 

Under The Hood 
The Apache PDFBox library enables users to create new PDF documents, manipulate 
existing documents, extract content, digitally sign, print, and validate files 
against the PDF/A-1b standard. Its command line utilities include encrypt, 
decrypt, overlay, debugger, merger, PDFToImage, and TextToPDF. 

PDFBox v2.0 reflects 1,167 solved issues, 418 of which were back-ported to 
v1.8, as well as dozens of improvements and enhancements. Highlights include: 

 - improved rendering and text extraction 
 - Unicode support for PDF creation 
 - overhauled interactive forms support 
 - extended signing and encryption support 
 - overhauled parser including a self-healing mechanism for malformed or 
corrupted PDFs 
 - reduced memory/resources footprint including fine grained control of memory 
usage 
 - enhanced preflight module for PDF/A-1b conformance checking 
 - rearranged package structure to allow smaller runtime environments 

A guide to migrating to v2.0 is available at 
http://pdfbox.apache.org/2.0/migration.html , with community support at 
http://pdfbox.apache.org/mailinglists.html 

"We thank all the people from our small but fine community for their support," 
explained Lehmkühler. "Special thanks also goes to our fellow colleagues from 
the Apache Tika project for their cooperation in stress-testing with a corpus 
of 250,000 PDF files." 

"We are grateful for the Google Summer of Code program," said PDFBox committer 
Tilman Hausherr. "The project allowed us to hire students to improve 3D 
rendering and the PDFDebugger stand-alone application, which also sped up our 
own bug finding." 

"Apache PDFBox v2.0 is a significant milestone as it took us several years to 
complete," added Lehmkühler. "This long-awaited release is the collective 
achievement of more than 150 individuals who have contributed code to date. 
Without their frequent contributions it wouldn't be possible to drive a project 
like PDFBox." 

Availability and Oversight 
Apache PDFBox software is released under the Apache License v2.0 and is 
overseen by a self-selected team of active contributors to the project. A 
Project Management Committee (PMC) guides the Project's day-to-day operations, 
including community development and product releases. For downloads, 
documentation, and ways to become involved with Apache PDFBox, visit 
http://pdfbox.apache.org/ 

About The Apache Software Foundation (ASF) 
Established in 1999, the all-volunteer Foundation oversees more than 350 
leading Open Source projects, including Apache HTTP Server --the world's most 
popular Web server software. Through the ASF's meritocratic process known as 
"The Apache Way," more than 550 individual Members and 5,300 Committers 
successfully collaborate to develop freely available enterprise-grade software, 
benefiting millions of users worldwide: thousands of software solutions are 
distributed under the Apache License; and the community actively participates 
in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's 
official user conference, trainings, and expo. The ASF is a US 501(c)(3) 
charitable organization, funded by individual donations and corporate sponsors 
including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, 
Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, 

Aw: Re: JBIG2 Images

2016-03-21 Thread Felix Hermann
I opened an issue on github. (https://github.com/levigo/jbig2-imageio/issues/9)


 

Gesendet: Samstag, 19. März 2016 um 08:35 Uhr
Von: "Tilman Hausherr" 
An: users@pdfbox.apache.org
Betreff: Re: JBIG2 Images
Am 14.03.2016 um 10:24 schrieb Felix Hermann:
> My interpretation: The compiler finds 
> org.jpedal.jbig2.jai.JBIG2ImageReaderSpi. However, it does not realize, that 
> there is an ImageReader ...

That's the jpedal plugin. We tried to use that one a few years ago and
failed. It's the levigo plugin that works (at least for us). I'm
wondering why you haven't opened an issue on their site, re the deadlock
you got.
https://github.com/levigo/jbig2-imageio/issues


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
 

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org