A while back I sent an email to this list with a code snippet that makes it
to extract text from table-formatted columns. I'll re-include the code below
to be inserted in TextStripper:
Insert at line 438
/**
* This sorting is handles text
aligned into columns by using
* column-based alignment to
determine the text ordering.
* Specifically, vertically adjacent items items are grouped into sets,
* where each set contains
adjacent items with same x (left horizontal)
* coordinate. If a
horizontally left-adjacent text item is part of a set
*
containing other vertically adjacent text items at the same x
coordinate,
* then the items in the first set are separate
column and are all added to
* the list first, followed by the
horizontally adjacent set.
*
* @author George Van Treeck
*
* @param textList
*/
@SuppressWarnings("unchecked")
protected void
sortByPosition(List<TextPosition> textList) {
/**
* An array of sets, each set containing a sublist of text items
* all starting at the same column border.
*/
final
HashMap<Float, ArrayList<TextPosition>> set_map =
new HashMap<Float, ArrayList<TextPosition>>();
final int TEXT_LIST_SIZE = textList.size();
if (TEXT_LIST_SIZE <= 1)
return; // nothing to sort
//
Group into sets.
Iterator<TextPosition> textIter =
textList.iterator();
while( textIter.hasNext() )
{
TextPosition position = textIter.next();
float positionX = position.getXDirAdj();
ArrayList<TextPosition> set = set_map.get( positionX );
if (set == null)
{
set = new ArrayList<TextPosition>();
set_map.put(
positionX, set );
}
set.add( position );
}
// Sort each set
final int MAP_SIZE =
set_map.size();
if (MAP_SIZE > 0) {
// First,
sort the sets.
Iterator<Float> mapIter =
set_map.keySet().iterator();
final ArrayList<Float>
map_index = new ArrayList<Float>(MAP_SIZE);
while (
mapIter.hasNext() )
map_index.add( mapIter.next() );
// Sort by x coordinate of column margin.
Collections.sort(map_index);
// Second, sort within each set.
for (int i = 0; i < MAP_SIZE; i++)
{
ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
if (set.size() > 1)
{
TextPositionComparator comparator = new TextPositionComparator();
Collections.sort( set, comparator );
}
}
// Third, coalesce horizontally adjacent text items.
//
Fourth, re-order the textList.
for (int i = 0; i <
MAP_SIZE; i++)
{
ArrayList<TextPosition>
set = set_map.get( map_index.get(i) );
Iterator<TextPosition> setIter = set.iterator();
while ( setIter.hasNext() )
textList.add( setIter.next()
);
}
}
}
Lines 462, 463:
<<
TextPositionComparator comparator = new TextPositionComparator();
<< Collections.sort( textList, comparator );
>>
sortByPosition(textList);
George Van Treeck
----- Original Message ----
From: Andreas Lehmkühler <[email protected]>
To: [email protected]
Sent: Fri, January 29, 2010 12:34:46 AM
Subject: Re: trying to do better text extraction
Hi,
Gesendet: Fr, 29. Jan 2010 Von: Ted Dunning<[email protected]>
> I am working on text extraction from some text. As you might expect,
> results are pretty for very simple documents and very bad for some fancy
> ones.
>
> Two column documents with headers and footers and text insets are
> particularly ugly. Using the -sort option to TextExtract makes things much
> worse since lines from the insets and columns are all mixed together.
>
> I have an idea that I could build a classifier using simple machine
> learning
> that would quickly get the idea of what is a header and footer and would be
> able to block columns together. Given a set of non-header blocks of text,
> it should be pretty simple to discern the text flow.
>
> Thus my problem is how to find out the locations and rough presentation
> information about blocks of text in a PDF document. If there is an easy
> way
> to hook in during the text extraction process, that would be great. Also,
> if there is a way to get more verbose structural information out of the
> textExtract system that would be great.
>
> Does anybody have any suggestions?
Have a look at [1]. Mel and some others implemented an alternative version of
the TextStripper with additional features concerning the text structure.
> Ted Dunning, CTO
> DeepDyve
BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-521