Hello,
I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in
extracting Texts from PDF files.
Name: Su-Sang, Lee (English name: Kay Lee)
Cell Phone: +82-10-3180-7976
Residence: Seoul, South Korea, Asia
E-mail: [email protected] (or [email protected])
My software development environment is,
Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX
library for .NET binaries, available as Nuget pacakage.)
I can extract Texts (our Korean language) from PDF file with many thanks to
Apache Foundation.
However, what I concern most is that PDFBox takes little bit longer time in
extracting than iTextSharp and other competitors.
What I need is only extracting Korean Text from PDF file and no more purposes.
I tried to research on internet like google and stackoverflow but no specific
solution and limited cases.
1) How can I extract text faster?
2) And do I need all the library wtih more than 30 MB files, if I only need to
extract Texts ?
If I only need some specific dll library files among all PDFBOX dll library
files, could you please kindly let me know which ones ?
3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12
and 2.0.1.
I don't belong to any company and organization but just a private person and
developing a software to be distributed and used for free for 5 years as public
profit purpose. As my major is not software-related but just bio-chemistry,
please understand kindly and explain me in detail as possible as you'd be able.
My simple code to extract Text from PDF file is,
internal static string ExtractTextFromPdf(string path)
{
PDDocument doc = null;
try
{
doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSuppressDuplicateOverlappingText(false);
return stripper.getText(doc);
}
finally
{
if (doc != null)
{
doc.close();
}
}
}
Hope kind and excellent support.
Thank you so much !
Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
[email protected]