Re: space between words

Tilman Hausherr Sat, 03 Jun 2017 22:18:07 -0700

Am 04.06.2017 um 01:49 schrieb 二川村田:

Hello


I tried to extract below pdf.
http://jpdb.nihs.go.jp/jp17e/000217650.pdf
http://jpdb.nihs.go.jp/jp17e/

But no spaces between words.
What should I do to extract correctly?

=================
I search the solutions on google using below words.
"pdfbox space between words"
But I couldn't find the solutions.

=================
I used below method, but result is not changed.
PDFTextStripper.setSpacingTolerance(float spacingToleranceValue)


Here's what I got with 2.0.6 using the ExtractText command line application:

JP XVII
THE JAPANESE PHARMACOPOEIA
SEVENTEENTH EDITION
Official from April 1, 2016
English Version
THE MINISTRY OF HEALTH, LABOUR AND WELFARE
Notice: This English Version of the Japanese Pharmacopoeia is published
for the convenience of users unfamiliar with the Japanese language. When
and if any discrepancy arises between the Japanese original and its English
translation, the former is authentic.
The Ministry of Health, Labour and
Welfare Ministerial Notification No. 64

Pursuant to Paragraph 1, Article 41 of the Law on Securing Quality,Efficacy andSafety of Products including Pharmaceuticals and Medical Devices (LawNo. 145,1960), the Japanese Pharmacopoeia (Ministerial Notification No. 65,2011), whichhas been established as follows*, shall be applied on April 1, 2016.However, in thecase of drugs which are listed in the Pharmacopoeia (hereinafterreferred to as ``previ-ous Pharmacopoeia'') [limited to those listed in the JapanesePharmacopoeia whosestandards are changed in accordance with this notification (hereinafterreferred to as``new Pharmacopoeia'')] and have been approved as of April 1, 2016 asprescribedunder Paragraph 1, Article 14 of the same law [including drugs theMinister ofHealth, Labour and Welfare specifies (the Ministry of Health and WelfareMinisterialNotification No. 104, 1994) as of March 31, 2016 as those exempted frommarketingapproval pursuant to Paragraph 1, Article 14 of the Same Law(hereinafter referredto as ``drugs exempted from approval'')], the Name and Standardsestablished in theprevious Pharmacopoeia (limited to part of the Name and Standards forthe drugsconcerned) may be accepted to conform to the Name and Standardsestablished in thenew Pharmacopoeia before and on September 30, 2017. In the case of drugswhichare listed in the new Pharmacopoeia (excluding those listed in theprevious Phar-macopoeia) and drugs which have been approved as of April 1, 2016 asprescribed

under Paragraph 1, Article 14 of the same law (including those exempted from

approval), they may be accepted as those being not listed in the newPharmacopoeia

before and on September 30, 2017.

(...)

What code did you use?

Or did you use code to get the TextPosition objects directly? In thatcase, you'll get spaces only if there are in the PDF itself. But thatPDF doesn't have any, at least not on page 3. The spaces you see in thetext above are created by PDFBox by using heuristics. Most PDF filesdon't have any spaces, they just position the non-space glyphs. Examplefrom the content stream:


  BT
    0 Tr
    0.0011 Tc
    /TT2 1 Tf
    21.8597 0 0 21.8653 151.92 746.5394 Tm

[ (The) -331.3 (M) -1.5 (inistry) -330.6 (o) -0.4 (f) -329.4 (H)-2.6 (ealth,) -329.6 (Labour) -332.1 (and) ] TJ

  ET


Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: space between words

Reply via email to