On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you already working on this problem (I saw Date object in
metadata in Tika 0.8) but if not, how can I configure Tika to use single
Date format?
This has only recently been fixed:
https://issues.apache.org/jira/browse/TIKA-451
You'll want to upgrade to a recent svn checkout / nightly build to get
these improvements
- Date in Excel file content.
As we know, Excel have Date fields, and Tika extract it well. But format is
not acceptable for me.
For example
I have field 03/10/2005
Tika extracts it as 10/03/2005
But, I need "yyyy-MM-dd HH:mm:ss.SSSZ" - 2005-10-03 00:00:00.000+0300
Tika does its best to return the dates in the format that they show up in
Excel.
If you want the dates to be in ISO8601 format, you have two options:
* Set all your date cells in excel to be formatted as iso8601, rather
than whatever they currently are
* Write your own excel parser for Tika, which ignores the date formatting
set for cells, and always uses iso8601
For the latter, you'd probably start with Tika's ExcelExtractor, then in
the NumberRecord switch case, use POI's DateUtils class to detect if the
cell is a date cell or not. If it is, have the cell value turned into a
date object, and format it as you require. If it isn't, then let the
default "format this like excel does" logic kick in
For the former, if your users don't want to reformat all their date cells
for you, you could probably pre-process the file with POI and change all
the formats.
Nick