Hi, we are students from Korea
We've tried to extract text from PDF
While English was well extracted from PDF but when we tried to extract from
Korean, the order of the text have been broken.
This does not mean that Korean is not extracted from PDF, it is well extracted,
but sequence has the problem
This Problem occurred when
1. if PDF have chart
2. size of the text is different
when we extracted PDF that have chart then the text in the lowest row shows
first and the text in the highest row shows last
ex) | 가 | 나 |
and when PDF have multiple text size and font
the smallest and the the most simple font text have been extracted first and
the largest and less simple text font text have been extracted last
please check if this is a bug when extracting Korean
public static void extractStringfromPDF() throws IOException{
final FileChooser filechooser = new FileChooser();
File file = filechooser.showOpenDialog(null);
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
File txtFile = new File(file.getPath() + ".txt");
FileWriter fw = new FileWriter(txtFile, true);
fw.write(text);
fw.flush();
fw.close();
System.out.println(text);
document.close();
}catch(Exception e) {e.printStackTrace();}
}
the above code is that we've used in our program
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]