Hello I tried to get texts from below pdf.
http://jpdb.nihs.go.jp/jp17e/000217651.pdf On first page, there were some characters that I could retrieve locations, but there were also characters that I couldn't. What is reason of this problem? ======================== my source to retrieve character's locations ======================== ===================== //class extends PDFTextStripper class PDFTextCordinateStripper extends PDFTextStripper { public List<TextPosition> list_text = new ArrayList<TextPosition>(); public PDFTextCordinateStripper() throws IOException { super(); } protected void processTextPosition(TextPosition text) { super.processTextPosition(text); list_text.add(text); } } ===================== // main(omited) PDFTextCordinateStripper stripper = new PDFTextCordinateStripper(); int len_page = doc.getNumberOfPages(); for (int ind = 1; ind <= len_page; ind++) { PDPage pg = doc.getPage(ind - 1); String str_page_num = "PageNum: " + ind; String str_page_size = "Width: " + pg_w + "\tHeight: " + pg_h; System.out.println(str_page_num + "\t" + str_page_size); stripper.list_text.clear(); stripper.setStartPage(ind); stripper.setEndPage(ind); stripper.getText(doc); String p_text = stripper.getText(doc); Iterator<String> it_str = Arrays.asList(p_text.split("")).iterator(); int ind_tp = 0; List<TextPosition> list_tp = stripper.list_text; int len_list_tp = list_tp.size(); while (it_str.hasNext()) { String ch = it_str.next(); String str_rec = "Text: " + ch; if (ind_tp < len_list_tp) { TextPosition tp = list_tp.get(ind_tp); if (ch.equals(tp.toString())){ str_rec += "\tx: " + tp.getX() + "\ty: " + tp.getY() + "\tw: " + tp.getWidth() + "\th: " + tp.getHeight() + "\tfont_size: " + tp.getFontSizeInPt(); ind_tp++; } } System.out.println(str_rec); } --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

