PDFBox - Problem with fragmented extraction result

Heike Johannsen Thu, 30 Sep 2010 23:26:19 -0700

Hello everybody!

Is there any chance that someone can help me with the following problem:


I'm trying to extract text from a PDF document but what I get as a result is 
highly fragmented.

Is there a way to overcome this?

Reproduce with JUnit test:

    @Test
    public void testTryThings() throws Exception
    {
        final String filename = 
"http://www.junkers.com/de/pmdb/brochures/Brennwert_7_181_465_853.pdf";;

        final PDDocument document = PDDocument.load(new URL(filename));
        final PDFTextStripper stripper = new PDFTextStripper();

        final String text = stripper.getText(document);

        System.out.println(text);

    }

See console output:

Wärme fürs Leben
 Gas-Brennwertheizungen für Etagen,
Ein- und Mehrfamilienhäuser
Energiesparende Behaglichkeit zum Rundum-Wohlfühlen
 Gas-Brennwert-Programm

Für
  Bauhe
 rr
 en
  und
  R
 en
 o
 vie
 re
 r
Lieber Leser,
 wir bieten Ihnen für jede Wohnsituation und für jeden Komfortbedarf die 
passende
 Heiz- und Warmwasserlösung. Unsere Auswahl ist daher genauso vielfältig wie die
 verschiedenen Wünsche unserer Kunden. Um Ihnen den Überblick trotzdem
ganz leicht zu machen, haben wir für Sie Piktogramme entworfen - einprägsame
Abbildungen, die wichtige Produktmerkmale auf einen Blick zeigen.
 Was das genau bedeutet, erfahren Sie auf der Innenseite dieser Klappe.
 Unser Tipp: Lassen Sie die Leiste aufgeschlagen, wenn Sie sich unsere
 Broschüre ansehen. Dann haben Sie alle wichtigen Infos stets vor Augen.
 2

Für
  Bauhe
 rr
 en
  und
  R
 en
 o
 vie
 re
 r


Parts of the output are highly fragmented. With other documents in my 
collection, this affects the major part of the text. If there is no setting 
that can be made to fix this, do you perhaps have an explanation for this 
phenomenon (e.g. the input file having some exotic encoding)?

Thanks in advance!

Heike

PDFBox - Problem with fragmented extraction result

Reply via email to