Re: Hello, I have a question in extracting Texts from PDF file.

Tilman Hausherr Wed, 18 May 2016 00:11:53 -0700

Am 18.05.2016 um 04:21 schrieb Kay_Lee:

Hello,
I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting Texts from PDF files.
Name: Su-Sang, Lee (English name: Kay Lee)
Cell Phone: +82-10-3180-7976
Residence: Seoul, South Korea, Asia
E-mail: [email protected] (or [email protected])
My software development environment is,Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library for .NET binaries, available as Nuget pacakage.)I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.However, what I concern most is that PDFBox takes little bit longer time in extracting than iTextSharp and other competitors.What I need is only extracting Korean Text from PDF file and no more purposes.
I tried to research on internet like google and stackoverflow but no specific 
solution and limited cases.

1) How can I extract text faster?


You can't. Unless you have a "turbo" or "nitro" button on the computer.

make sure you opening the files as files and not as streams. But I seebelow, you already do that, i.e. your code is good.

2) And do I need all the library wtih more than 30 MB files, if I only need to 
extract Texts ?

Of PDFBox itself, you need pdfbox and fontbox and logging. If files areencrypted, then also bouncy castle. You won't need xmp and the imagelibraries. See also here

https://pdfbox.apache.org/1.8/dependencies.html

If I only need some specific dll library files among all PDFBOX dll library 
files, could you please kindly let me know which ones ?

3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 
and 2.0.1.

indeed. However there is no official .net release, i.e. none of the"very active developers" is currently using that one (an older releaseis here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster.However they'll extract better.


There is a guide from 2012 to create the dlls:
https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
but I don't know if it works.

See also this: http://www.squarepdf.net/pdfbox-in-net
https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net

I don't belong to any company and organization but just a private person and developing a software to be distributed and used for free for 5 years as public profit purpose. As my major is not software-related but just bio-chemistry, please understand kindly and explain me in detail as possible as you'd be able.

If you're non profit and willing to distribute the source code, you canuse iText, see here: http://itextpdf.com/AGPL


My simple code to extract Text from PDF file is,

internal static string ExtractTextFromPdf(string path)
         {
             PDDocument doc = null;
             try
             {
                 doc = PDDocument.load(path);
                 PDFTextStripper stripper = new PDFTextStripper();
                 stripper.setSuppressDuplicateOverlappingText(false);
                 return stripper.getText(doc);
             }
             finally
             {
                 if (doc != null)
                 {
                     doc.close();
                 }
             }
         }


Yes that code is fine.

Tilman

Hope kind and excellent support.


Thank you so much !

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
[email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Hello, I have a question in extracting Texts from PDF file.

Reply via email to