Greetings,

This is not a direct PDF box question, but hoping someone here knows the
answer.

I've noticed many PDF viewers such as the built-in viewer in chrome/firefox
have a concept of a block when you start selecting the text. Given a
visually tabular structure, many times, multi row text selected is from one
column and at a certain point it spills over to the next column.

This makes me think somehow viewers are able to detect the block. I've
examined a few PDFs and text blocks or rows are multiple text operations -
usually one per row but sometimes for a word or in few cases one per
character. Hopefully, the attached picture shows what I mean. In this case,
only left hand entries are selected, but amounts are not selected even
though it's part of the same visual row.

How does the viewer sense this is one text block? I think this is useful
functionality and would like to grab interesting blocks from PDF using PDF
box. I know about area based text stripper, but feel like viewers are
extending that concept further.

[image: image.png]

Regards,

Niranjan

Reply via email to