chinese invalid charset

liyg Fri, 11 Dec 2009 01:44:27 -0800

hi:
  I am trying to extract the textual content of PDF files from my Java code.
I (am trying to) use PDFBox 0.7.3 and the examples I have found online so
far are rather limited. Basically, I did something like this:


   1. PDDocument doc = null;
   2.         try {
   3.             doc = PDDocument.load("sample.pdf");
   4.             PDFTextStripper stripper = new PDFTextStripper();
   5.             String text=stripper.getText(doc);
   6.
   7.         } finally {
   8.             if (doc != null) {
   9.                 doc.close();
   10.             }
   11.         }

and unfortunately,most of text I extract from pdf are good, chinese is
good.but some of pdf files are bad,the chinese show  like "□", and some show
like "?".
I guess the reason, invalid chinese charset is no ttf files? why some good,
some bad?I really want to konw the reason..
ps: I'm sorry for my bad English :)
thanks.

chinese invalid charset

Reply via email to