Sure TIL, will do. Thanks for your time.
On Sat, 20 Oct 2018 at 11:47 PM, Tilman Hausherr <thaush...@t-online.de> wrote: > Am 20.10.2018 um 20:02 schrieb Ankit Inkollu: > > Thanks for the reply Til. Then I need to find a way to group the > > TextPosition objects in terms of each word based on the text received > from > > ExtractText. Is there any other way which helps me in fetching a word as > a > > TextPosition object? > > No, you'd need to write your own logic, just grab the source code of > PDFTextStripper. There may have been an answer in stackoverflow some > time ago but I can't find it. > > An older one is here: > > https://stackoverflow.com/questions/13971656/how-to-avoid-pdfbox-appending-separate-words > > IMHO it is oversimplified - the problem is that the space width is > relative. > > You can easily group the words gotten from the text extraction but that > one doesn't have the positions. > > Tilman > > > > > Thanks > > Ankit > > > > > > On Sat, 20 Oct 2018 at 11:08 PM, Tilman Hausherr <thaush...@t-online.de> > > wrote: > > > >> You get the space in ExtractText but the spaces are often not in the PDF > >> itself so they won't be in the TextPosition objects. PDFBox uses > >> heuristics to insert spaces in the final extracted text, i.e. assume > >> there is a space due to the distance between glyphs. > >> > >> Tilman > >> > >> Am 20.10.2018 um 19:35 schrieb Ankit Inkollu: > >>> *Scenario:* > >>> To get each word details such as 'Text', 'Font', 'Size' etc from a PDF. > >>> > >>> *Approach:* > >>> *1. *Get 'charactersByArticle' available in the PDFTextStripper class > for > >>> each page in the PDF. > >>> *2. *It returns a list of TextPosition objects which contains each > >>> characters' text, font, font-size etc. > >>> > >>> *Query:* > >>> I am able to get the TextPosition object for each character in the PDF > >> text > >>> but in order to define words I required the default word-separator (" > ") > >>> from 'charactersByArticle'. Why doesn't it print the space character or > >> is > >>> there a flag which I can set in the PDFTextStripper so that it prints > the > >>> text along with the space character. > >>> > >>> Thanks > >>> Ankit > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >