Howdy!
So I am working with TIKA to help me parse Office types to pull out text. I
would like to preserve the structure of the text as much as possible.
When I play with the TIKA jar file with a simple excel file I get something
like what I have below. Code I write to do the parsing pulls out something
similar. The data is generally correct. But, in the parsing the position
of cells is completely lost. For example, the xls that I used here
contained the cells in the middle of the spreadsheet. But that positioning
is lost in the output here. I would like to say I found a bit of text at a
specific row/column/sheet.
Is this possible with TIKA? I have google around and have not found much.
Do I have to drop down to POI to do this?
Thanks! Sorry if this is a super obvious question.
-Matt
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="cp:revision" content="1" />
<meta name="date" content="2013-08-10T00:26:18Z" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By"
content="org.apache.tika.parser.microsoft.OfficeParser" />
<meta name="meta:creation-date" content="2013-08-10T00:25:37Z" />
<meta name="Last-Printed" content="1601-01-01T00:00:00Z" />
<meta name="Creation-Date" content="2013-08-10T00:25:37Z" />
<meta name="meta:print-date" content="1601-01-01T00:00:00Z" />
<meta name="resourceName" content="spreadsheet.xls" />
<meta name="dcterms:created" content="2013-08-10T00:25:37Z" />
<meta name="dcterms:modified" content="2013-08-10T00:26:18Z" />
<meta name="Last-Modified" content="2013-08-10T00:26:18Z" />
<meta name="Last-Save-Date" content="2013-08-10T00:26:18Z" />
<meta name="Revision-Number" content="1" />
<meta name="meta:save-date" content="2013-08-10T00:26:18Z" />
<meta name="modified" content="2013-08-10T00:26:18Z" />
<meta name="Content-Length" content="5632" />
<meta name="Content-Type" content="application/vnd.ms-excel" />
<title></title>
</head>
<body>
<div class="page">
<h1>Sheet1</h1>
<table>
<tbody>
<tr>
<td>A</td>
<td>B</td>
<td>D</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>