Hi, we are students from Korea
We've tried to extract text from PDF 
While English was well extracted from PDF but when we tried to extract from
Korean, the order of the text have been broken.
This does not mean that Korean is not extracted from PDF, it is well extracted,
but sequence has the problem 
This Problem occurred when 
1. if PDF have chart
2. size of the text is different

when we extracted PDF that have chart then the text in the lowest row shows 
first and the text in the highest row shows last
ex) | 가 | 나 |
    

and when PDF have multiple text size and font 
the smallest and the the most simple font text have been extracted first and 
the largest and less simple text font text have been extracted last

please check if this is a bug when extracting Korean


public static void extractStringfromPDF() throws IOException{   
                final FileChooser filechooser = new FileChooser();
                File file = filechooser.showOpenDialog(null);
                try {
                        PDDocument document = PDDocument.load(file);
                        PDFTextStripper pdfStripper = new PDFTextStripper();
                        String text = pdfStripper.getText(document);
                        
                        File txtFile = new File(file.getPath() + ".txt");
                        FileWriter fw = new FileWriter(txtFile, true);
                        fw.write(text);
                        fw.flush();
                        fw.close();
                        System.out.println(text);
                        document.close();
                }catch(Exception e) {e.printStackTrace();}
}
the above code is that we've used in our program

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to