Howdy!

So I am working with TIKA to help me parse Office types to pull out text. I
would like to preserve the structure of the text as much as possible.

When I play with the TIKA jar file with a simple excel file I get something
like what I have below. Code I write to do the parsing pulls out something
similar.  The data is generally correct. But, in the parsing the position
of cells is completely lost. For example, the xls that I used here
contained the cells in the middle of the spreadsheet. But that positioning
is lost in the output here. I would like to say I found a bit of text at a
specific row/column/sheet.

Is this possible with TIKA? I have google around and have not found much.
Do I have to drop down to POI to do this?

Thanks! Sorry if this is a super obvious question.

-Matt

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>

<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
  <meta name="cp:revision" content="1" />
  <meta name="date" content="2013-08-10T00:26:18Z" />
  <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
  <meta name="X-Parsed-By"
content="org.apache.tika.parser.microsoft.OfficeParser" />
  <meta name="meta:creation-date" content="2013-08-10T00:25:37Z" />
  <meta name="Last-Printed" content="1601-01-01T00:00:00Z" />
  <meta name="Creation-Date" content="2013-08-10T00:25:37Z" />
  <meta name="meta:print-date" content="1601-01-01T00:00:00Z" />
  <meta name="resourceName" content="spreadsheet.xls" />
  <meta name="dcterms:created" content="2013-08-10T00:25:37Z" />
  <meta name="dcterms:modified" content="2013-08-10T00:26:18Z" />
  <meta name="Last-Modified" content="2013-08-10T00:26:18Z" />
  <meta name="Last-Save-Date" content="2013-08-10T00:26:18Z" />
  <meta name="Revision-Number" content="1" />
  <meta name="meta:save-date" content="2013-08-10T00:26:18Z" />
  <meta name="modified" content="2013-08-10T00:26:18Z" />
  <meta name="Content-Length" content="5632" />
  <meta name="Content-Type" content="application/vnd.ms-excel" />

  <title></title>
</head>

<body>
  <div class="page">
    <h1>Sheet1</h1>

    <table>
      <tbody>
        <tr>
          <td>A</td>

          <td>B</td>

          <td>D</td>
        </tr>

        <tr>
          <td>1</td>

          <td>2</td>

          <td>3</td>
        </tr>
      </tbody>
    </table>
  </div>
</body>
</html>

Reply via email to