Am 04.06.2017 um 01:49 schrieb 二川村田:
Hello
I tried to extract below pdf.
http://jpdb.nihs.go.jp/jp17e/000217650.pdf
http://jpdb.nihs.go.jp/jp17e/
But no spaces between words.
What should I do to extract correctly?
=================
I search the solutions on google using below words.
"pdfbox space between words"
But I couldn't find the solutions.
=================
I used below method, but result is not changed.
PDFTextStripper.setSpacingTolerance(float spacingToleranceValue)
Here's what I got with 2.0.6 using the ExtractText command line application:
JP XVII
THE JAPANESE PHARMACOPOEIA
SEVENTEENTH EDITION
Official from April 1, 2016
English Version
THE MINISTRY OF HEALTH, LABOUR AND WELFARE
Notice: This English Version of the Japanese Pharmacopoeia is published
for the convenience of users unfamiliar with the Japanese language. When
and if any discrepancy arises between the Japanese original and its English
translation, the former is authentic.
The Ministry of Health, Labour and
Welfare Ministerial Notification No. 64
Pursuant to Paragraph 1, Article 41 of the Law on Securing Quality,
Efficacy and
Safety of Products including Pharmaceuticals and Medical Devices (Law
No. 145,
1960), the Japanese Pharmacopoeia (Ministerial Notification No. 65,
2011), which
has been established as follows*, shall be applied on April 1, 2016.
However, in the
case of drugs which are listed in the Pharmacopoeia (hereinafter
referred to as ``previ-
ous Pharmacopoeia'') [limited to those listed in the Japanese
Pharmacopoeia whose
standards are changed in accordance with this notification (hereinafter
referred to as
``new Pharmacopoeia'')] and have been approved as of April 1, 2016 as
prescribed
under Paragraph 1, Article 14 of the same law [including drugs the
Minister of
Health, Labour and Welfare specifies (the Ministry of Health and Welfare
Ministerial
Notification No. 104, 1994) as of March 31, 2016 as those exempted from
marketing
approval pursuant to Paragraph 1, Article 14 of the Same Law
(hereinafter referred
to as ``drugs exempted from approval'')], the Name and Standards
established in the
previous Pharmacopoeia (limited to part of the Name and Standards for
the drugs
concerned) may be accepted to conform to the Name and Standards
established in the
new Pharmacopoeia before and on September 30, 2017. In the case of drugs
which
are listed in the new Pharmacopoeia (excluding those listed in the
previous Phar-
macopoeia) and drugs which have been approved as of April 1, 2016 as
prescribed
under Paragraph 1, Article 14 of the same law (including those exempted from
approval), they may be accepted as those being not listed in the new
Pharmacopoeia
before and on September 30, 2017.
(...)
What code did you use?
Or did you use code to get the TextPosition objects directly? In that
case, you'll get spaces only if there are in the PDF itself. But that
PDF doesn't have any, at least not on page 3. The spaces you see in the
text above are created by PDFBox by using heuristics. Most PDF files
don't have any spaces, they just position the non-space glyphs. Example
from the content stream:
BT
0 Tr
0.0011 Tc
/TT2 1 Tf
21.8597 0 0 21.8653 151.92 746.5394 Tm
[ (The) -331.3 (M) -1.5 (inistry) -330.6 (o) -0.4 (f) -329.4 (H)
-2.6 (ealth,) -329.6 (Labour) -332.1 (and) ] TJ
ET
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]