Am 04.06.2017 um 01:49 schrieb 二川村田:
Hello

I tried to extract below pdf.
http://jpdb.nihs.go.jp/jp17e/000217650.pdf
http://jpdb.nihs.go.jp/jp17e/

But no spaces between words.
What should I do to extract correctly?

=================
I search the solutions on google using below words.
"pdfbox space between words"
But I couldn't find the solutions.

=================
I used below method, but result is not changed.
PDFTextStripper.setSpacingTolerance(float spacingToleranceValue)

Here's what I got with 2.0.6 using the ExtractText command line application:

JP XVII
THE JAPANESE PHARMACOPOEIA
SEVENTEENTH EDITION
Official from April 1, 2016
English Version
THE MINISTRY OF HEALTH, LABOUR AND WELFARE
Notice: This English Version of the Japanese Pharmacopoeia is published
for the convenience of users unfamiliar with the Japanese language. When
and if any discrepancy arises between the Japanese original and its English
translation, the former is authentic.
The Ministry of Health, Labour and
Welfare Ministerial Notification No. 64
Pursuant to Paragraph 1, Article 41 of the Law on Securing Quality, Efficacy and Safety of Products including Pharmaceuticals and Medical Devices (Law No. 145, 1960), the Japanese Pharmacopoeia (Ministerial Notification No. 65, 2011), which has been established as follows*, shall be applied on April 1, 2016. However, in the case of drugs which are listed in the Pharmacopoeia (hereinafter referred to as ``previ- ous Pharmacopoeia'') [limited to those listed in the Japanese Pharmacopoeia whose standards are changed in accordance with this notification (hereinafter referred to as ``new Pharmacopoeia'')] and have been approved as of April 1, 2016 as prescribed under Paragraph 1, Article 14 of the same law [including drugs the Minister of Health, Labour and Welfare specifies (the Ministry of Health and Welfare Ministerial Notification No. 104, 1994) as of March 31, 2016 as those exempted from marketing approval pursuant to Paragraph 1, Article 14 of the Same Law (hereinafter referred to as ``drugs exempted from approval'')], the Name and Standards established in the previous Pharmacopoeia (limited to part of the Name and Standards for the drugs concerned) may be accepted to conform to the Name and Standards established in the new Pharmacopoeia before and on September 30, 2017. In the case of drugs which are listed in the new Pharmacopoeia (excluding those listed in the previous Phar- macopoeia) and drugs which have been approved as of April 1, 2016 as prescribed
under Paragraph 1, Article 14 of the same law (including those exempted from
approval), they may be accepted as those being not listed in the new Pharmacopoeia
before and on September 30, 2017.

(...)

What code did you use?

Or did you use code to get the TextPosition objects directly? In that case, you'll get spaces only if there are in the PDF itself. But that PDF doesn't have any, at least not on page 3. The spaces you see in the text above are created by PDFBox by using heuristics. Most PDF files don't have any spaces, they just position the non-space glyphs. Example from the content stream:

  BT
    0 Tr
    0.0011 Tc
    /TT2 1 Tf
    21.8597 0 0 21.8653 151.92 746.5394 Tm
[ (The) -331.3 (M) -1.5 (inistry) -330.6 (o) -0.4 (f) -329.4 (H) -2.6 (ealth,) -329.6 (Labour) -332.1 (and) ] TJ
  ET


Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to