hi:
I am trying to extract the textual content of PDF files from my Java code.
I (am trying to) use PDFBox 0.7.3 and the examples I have found online so
far are rather limited. Basically, I did something like this:
1. PDDocument doc = null;
2. try {
3. doc = PDDocument.load("sample.pdf");
4. PDFTextStripper stripper = new PDFTextStripper();
5. String text=stripper.getText(doc);
6.
7. } finally {
8. if (doc != null) {
9. doc.close();
10. }
11. }
and unfortunately,most of text I extract from pdf are good, chinese is
good.but some of pdf files are bad,the chinese show like "□", and some show
like "?".
I guess the reason, invalid chinese charset is no ttf files? why some good,
some bad?I really want to konw the reason..
ps: I'm sorry for my bad English :)
thanks.