Jenkins build became unstable: PDFBox » PDFBox-1.8.x #13

2021-03-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build became unstable: PDFBox » PDFBox-1.8.x » Apache PDFBox #13

2021-03-11 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300091#comment-17300091
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887530 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887530 ]

PDFBOX-4892: set individual initial ArrayList size and simplify code, as 
suggested by valerybokov

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300089#comment-17300089
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887528 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1887528 ]

PDFBOX-4892: set individual initial ArrayList size and simplify code, as 
suggested by valerybokov

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300090#comment-17300090
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887529 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1887529 ]

PDFBOX-4892: set individual initial ArrayList size and simplify code, as 
suggested by valerybokov

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.22 vs 2.0.23

2021-03-11 Thread Tilman Hausherr

Am 11.03.2021 um 09:00 schrieb sahy...@fileaffairs.de:

The three new exceptions weren't in earlier reports.

IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
is
a non standard schema.

would it make sense to add that support? If yes could we get samles of
various schema to support that development? Could look into that if we
think that's worth the effort



Here's an example:

https://issues.apache.org/jira/browse/PDFBOX-3440


Tilman





Maruan



Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Closed] (PDFBOX-5125) Slightly slanted line with right side higher than the left confuses PDFTextStripper with sortByPosition=true

2021-03-11 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-5125.
---
Resolution: Not A Bug

> Slightly slanted line with right side higher than the left confuses 
> PDFTextStripper with sortByPosition=true
> 
>
> Key: PDFBOX-5125
> URL: https://issues.apache.org/jira/browse/PDFBOX-5125
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.22
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: BB-8541-1-ocr.pdf
>
>
> The attached PDF, when run through PDFTextStripper with sortByPosition=true, 
> yields improperly ordered text: the beginnings of lines are printed after the 
> ends of the same lines, after a superfluous linebreak. There are also some 
> additional erroneous linebreaks that do not result in the text reversing, 
> like the one in "keretmegállapodásos".
> PDFBox extracts:
> {noformat}
> lőállító eszközök szállítása és kapcsolódó szolgáltatások 2013”
> „Nyomat e
> árgyban lefolytatott központosított közbeszerzési keretmegállapodáso
> s eljárás 2. része
> t
> (Általános Multifunkciós eszközök) eredményeképpen a Beszerző és El
> adó között
> keretmegállapodás jött létre (továbbiakban: KM).{noformat}
> The same PDF opened in Adobe Reader, and all the text in it copied out:
> {noformat}
> „Nyomat előállító eszközök szállítása és kapcsolódó szolgáltatások 2013”
> tárgyban lefolytatott központosított közbeszerzési keretmegállapodásos 
> eljárás 2. része
> (Általános Multifunkciós eszközök) eredményeképpen a Beszerző és Eladó között
> keretmegállapodás jött létre (továbbiakban: KM).{noformat}
> (The word "teljesítése" is missing in both extractions due to an OCR error; 
> that's an issue with Tesseract an unrelated to this issue.)
> In Firefox (pdf.js), we get:
> {noformat}
> „Nyomatelőállítóeszközökszállításaés 
> kapcsolódószolgáltatások2013”tárgybanlefolytatottközpontosítottközbeszerzésikeretmegállapodásoseljárás2.
>   része(ÁltalánosMultifunkcióseszközök)eredményeképpena  Beszerzőés  
> Eladóközöttkeretmegállapodásjöttlétre(továbbiakban:KM).{noformat}
> (The missing spaces are a well-known incompatibility between Tesseract 4.0 
> and pdf.js, workarounded in Tesseract 4.1, but the order of the text remains 
> correct.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: asking for help regarding ClassNotFoundException: PreflightParser

2021-03-11 Thread Tilman Hausherr
Likely a classpath problem. Are there any other external jars that you 
use? Do these work? If yes, look at how these are used in the pom.xml file.


Tilman

Am 11.03.2021 um 14:15 schrieb Szendi, Bence:

Hello,

I would like to ask for your help regarding the following problem I am facing.
I am getting this exception, at runtime: java.lang.ClassNotFoundException: 
org.apache.pdfbox.preflight.parser.PreflightParser
The code is based on Java 8 and runs on a Weblogic 12c, built with maven 3.6.0.

I have the following dependencies in my project:

 
 
org.apache.pdfbox
 pdfbox
 2.0.22
 compile
 

 
 
 
org.apache.pdfbox
 
preflight
 2.0.22
 compile
 

 
 
 
org.apache.pdfbox
 
fontbox
 2.0.22
 compile
 

My code:
import org.apache.pdfbox.preflight.Format;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.ValidationResult;
import org.apache.pdfbox.preflight.ValidationResult.ValidationError;
import org.apache.pdfbox.preflight.parser.PreflightParser;
...
public void validatePDF(DataHandler pdf) ...
{
...
   PreflightParser parser = new PreflightParser(pdf.getDataSource());
   parser.parse(Format.PDF_A1A);
   PreflightDocument document = parser.getPreflightDocument();
   document.validate();
   ValidationResult result = document.getResult()
...

( The exception is thrown at the line PreflightParser parser = new 
PreflightParser(pdf.getDataSource()); )

Remark: I have added fontbox to the dependencies based on an answer in this 
post: 
https://stackoverflow.com/questions/18503159/getting-java-lang-noclassdeffounderror-org-pdfbox-pdfparser,
 but it did not solve the problem.



Can you help me find the solution?



Thanks in advance and best regards,
Bence Szendi




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] [pdfbox] valerybokov edited a comment on pull request #107: potential memory leaks and small performance improvements

2021-03-11 Thread GitBox


valerybokov edited a comment on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-796882339


   Thanks. I've been busy too



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] [pdfbox] valerybokov commented on pull request #107: potential memory leaks and small performance improvements

2021-03-11 Thread GitBox


valerybokov commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-796882339


   Thanks. I've been busy for too



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



asking for help regarding ClassNotFoundException: PreflightParser

2021-03-11 Thread Szendi, Bence
Hello,

I would like to ask for your help regarding the following problem I am facing.
I am getting this exception, at runtime: java.lang.ClassNotFoundException: 
org.apache.pdfbox.preflight.parser.PreflightParser
The code is based on Java 8 and runs on a Weblogic 12c, built with maven 3.6.0.

I have the following dependencies in my project:



org.apache.pdfbox
pdfbox
2.0.22
compile





org.apache.pdfbox

preflight
2.0.22
compile





org.apache.pdfbox
fontbox
2.0.22
compile


My code:
import org.apache.pdfbox.preflight.Format;
import org.apache.pdfbox.preflight.PreflightDocument;
import org.apache.pdfbox.preflight.ValidationResult;
import org.apache.pdfbox.preflight.ValidationResult.ValidationError;
import org.apache.pdfbox.preflight.parser.PreflightParser;
...
public void validatePDF(DataHandler pdf) ...
{
...
  PreflightParser parser = new PreflightParser(pdf.getDataSource());
  parser.parse(Format.PDF_A1A);
  PreflightDocument document = parser.getPreflightDocument();
  document.validate();
  ValidationResult result = document.getResult()
...

( The exception is thrown at the line PreflightParser parser = new 
PreflightParser(pdf.getDataSource()); )

Remark: I have added fontbox to the dependencies based on an answer in this 
post: 
https://stackoverflow.com/questions/18503159/getting-java-lang-noclassdeffounderror-org-pdfbox-pdfparser,
 but it did not solve the problem.



Can you help me find the solution?



Thanks in advance and best regards,
Bence Szendi




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com


Re: 2.0.22 vs 2.0.23

2021-03-11 Thread sahy...@fileaffairs.de
Am Donnerstag, dem 11.03.2021 um 07:56 +0100 schrieb Tilman Hausherr:
> Am 11.03.2021 um 07:46 schrieb Andreas Lehmkuehler:
> > Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
> > > new report
> > > http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz
> > > 
> > > The content differences part is now the smallest ever, likely due
> > > to 
> > > my change in tika-eval (TIKA-3314) and restoring a PDFBox code 
> > > segment I accidentally deleted (PDFBOX-5115).
> > Cool!!
> > 
> > > There are three new exceptions. Two are in jempbox and one is in
> > > tika 
> > > itself so I suspect PDFBox isn't to blame. I'll look at it too if
> > > I 
> > > have the time.
> > As far as I remember the jempbox issue isn't new, Tim mentioned it 
> > some time ago. Just out of curiosity does it make sense to use an
> > old 
> > lib to extract metadata? Is there anything missing in xmpbox but 
> > available in jempbox?
> > 
> The three new exceptions weren't in earlier reports.
> 
> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
> is 
> a non standard schema.

would it make sense to add that support? If yes could we get samles of
various schema to support that development? Could look into that if we
think that's worth the effort

Maruan


> 
> Tilman
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org