Re: trying to do better text extraction

George Van Treeck Sat, 30 Jan 2010 14:16:57 -0800

A while back I sent an email to this list with a code snippet that makes it
to extract text from table-formatted columns. I'll re-include the code below
to be inserted in TextStripper:


Insert at line 438
    /**
     * This sorting is handles text 
aligned into columns by using
     * column-based alignment to 
determine the text ordering.
     * Specifically, vertically adjacent items items are grouped into sets,
     * where each set contains 
adjacent items with same x (left horizontal)
     * coordinate. If a 
horizontally left-adjacent text item is part of a set
     * 
containing other vertically adjacent text items at the same x 
coordinate,
     * then the items in the first set are separate 
column and are all added to
     * the list first, followed by the 
horizontally adjacent set.
     *
     * @author George Van Treeck
     * 
     * @param textList
     
*/
    @SuppressWarnings("unchecked")
    protected void 
sortByPosition(List<TextPosition> textList) {
      /**
       * An array of sets, each set containing a sublist of text items
       * all starting at the same column border.
       */
      final 
HashMap<Float, ArrayList<TextPosition>> set_map =
        new HashMap<Float, ArrayList<TextPosition>>();
      
      final int TEXT_LIST_SIZE = textList.size();
      if (TEXT_LIST_SIZE <= 1)
        return; // nothing to sort
      
      // 
Group into sets.
      Iterator<TextPosition> textIter = 
textList.iterator();
      while( textIter.hasNext() )
      {
          TextPosition position = textIter.next();
          float positionX = position.getXDirAdj();
          ArrayList<TextPosition> set = set_map.get( positionX );
          if (set == null)
          {
            set = new ArrayList<TextPosition>();
            set_map.put( 
positionX, set );
          }
          set.add( position );
      }
      
      // Sort each set
      final int MAP_SIZE = 
set_map.size();
      if (MAP_SIZE > 0) {
        // First, 
sort the sets.
        Iterator<Float> mapIter = 
set_map.keySet().iterator();
        final ArrayList<Float> 
map_index = new ArrayList<Float>(MAP_SIZE);
        while ( 
mapIter.hasNext() )
          map_index.add( mapIter.next() );
        // Sort by x coordinate of column margin.
        
Collections.sort(map_index);
        // Second, sort within each set.
        for (int i = 0; i < MAP_SIZE; i++)
        {
          
ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
          if (set.size() > 1)
          {
            
TextPositionComparator comparator = new TextPositionComparator();
            Collections.sort( set, comparator );
          }
        }
        // Third, coalesce horizontally adjacent text items.
        // 
Fourth, re-order the textList.
        for (int i = 0; i < 
MAP_SIZE; i++)
        {
          ArrayList<TextPosition> 
set = set_map.get( map_index.get(i) );
          
Iterator<TextPosition> setIter = set.iterator();
          
while ( setIter.hasNext() )
            textList.add( setIter.next() 
);
        }
      }
    }

Lines 462, 463:
<< 
TextPositionComparator comparator = new TextPositionComparator();
<< Collections.sort( textList, comparator );
>> 
sortByPosition(textList);

George Van Treeck



----- Original Message ----
From: Andreas Lehmkühler <[email protected]>
To: [email protected]
Sent: Fri, January 29, 2010 12:34:46 AM
Subject: Re: trying to do better text extraction

Hi,

Gesendet: Fr, 29. Jan 2010 Von: Ted Dunning<[email protected]>

> I am working on text extraction from some text.  As you might expect,
> results are pretty for very simple documents and very bad for some fancy
> ones.
> 
> Two column documents with headers and footers and text insets are
> particularly ugly.  Using the -sort option to TextExtract makes things much
> worse since lines from the insets and columns are all mixed together.
> 
> I have an idea that I could build a classifier using simple machine
> learning
> that would quickly get the idea of what is a header and footer and would be
> able to block columns together.  Given a set of non-header blocks of text,
> it should be pretty simple to discern the text flow.
> 
> Thus my problem is how to find out the locations and rough presentation
> information about blocks of text in a PDF document.  If there is an easy
> way
> to hook in during the text extraction process, that would be great.  Also,
> if there is a way to get more verbose structural information out of the
> textExtract system that would be great.
> 
> Does anybody have any suggestions?
Have a look at [1]. Mel and some others implemented an alternative version of
the TextStripper with additional features concerning the text structure.

> Ted Dunning, CTO
> DeepDyve

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-521

Re: trying to do better text extraction

Reply via email to