Hello, I have a question in extracting Texts from PDF file.

Kay_Lee Tue, 17 May 2016 19:21:27 -0700

Hello,
 
I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in 
extracting Texts from PDF files.
Name: Su-Sang, Lee (English name: Kay Lee)
Cell Phone: +82-10-3180-7976
Residence: Seoul, South Korea, Asia
E-mail: [email protected] (or [email protected])
 
My software development environment is,
 
Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX 
library for .NET binaries, available as Nuget pacakage.)
 
I can extract Texts (our Korean language) from PDF file with many thanks to 
Apache Foundation.
 
However, what I concern most is that PDFBox takes little bit longer time in 
extracting than iTextSharp and other competitors.
 
What I need is only extracting Korean Text from PDF file and no more purposes.


I tried to research on internet like google and stackoverflow but no specific 
solution and limited cases.

1) How can I extract text faster?
 
2) And do I need all the library wtih more than 30 MB files, if I only need to 
extract Texts ?
If I only need some specific dll library files among all PDFBOX dll library 
files, could you please kindly let me know which ones ?

3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 
and 2.0.1.
 
I don't belong to any company and organization but just a private person and 
developing a software to be distributed and used for free for 5 years as public 
profit purpose. As my major is not software-related but just bio-chemistry, 
please understand kindly and explain me in detail as possible as you'd be able.

My simple code to extract Text from PDF file is,

internal static string ExtractTextFromPdf(string path)
        {
            PDDocument doc = null;
            try
            {
                doc = PDDocument.load(path);
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setSuppressDuplicateOverlappingText(false);
                return stripper.getText(doc);
            }
            finally
            {
                if (doc != null)
                {
                    doc.close();
                }
            }
        }
 
Hope kind and excellent support.

Thank you so much !

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
[email protected]

Hello, I have a question in extracting Texts from PDF file.

Reply via email to