Thanks for the reply Til. Then I need to find a way to group the TextPosition objects in terms of each word based on the text received from ExtractText. Is there any other way which helps me in fetching a word as a TextPosition object?
Thanks Ankit On Sat, 20 Oct 2018 at 11:08 PM, Tilman Hausherr <thaush...@t-online.de> wrote: > You get the space in ExtractText but the spaces are often not in the PDF > itself so they won't be in the TextPosition objects. PDFBox uses > heuristics to insert spaces in the final extracted text, i.e. assume > there is a space due to the distance between glyphs. > > Tilman > > Am 20.10.2018 um 19:35 schrieb Ankit Inkollu: > > *Scenario:* > > To get each word details such as 'Text', 'Font', 'Size' etc from a PDF. > > > > *Approach:* > > *1. *Get 'charactersByArticle' available in the PDFTextStripper class for > > each page in the PDF. > > *2. *It returns a list of TextPosition objects which contains each > > characters' text, font, font-size etc. > > > > *Query:* > > I am able to get the TextPosition object for each character in the PDF > text > > but in order to define words I required the default word-separator (" ") > > from 'charactersByArticle'. Why doesn't it print the space character or > is > > there a flag which I can set in the PDFTextStripper so that it prints the > > text along with the space character. > > > > Thanks > > Ankit > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >