On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you already working on this problem (I saw Date object in
metadata in Tika 0.8) but if not, how can I configure Tika to use single
Date format?

This has only recently been fixed:
        https://issues.apache.org/jira/browse/TIKA-451

You'll want to upgrade to a recent svn checkout / nightly build to get these improvements


- Date in Excel file content.
As we know, Excel have Date fields, and Tika extract it well. But format is
not acceptable for me.

For example
I have field 03/10/2005
Tika extracts it as  10/03/2005
But, I need "yyyy-MM-dd HH:mm:ss.SSSZ"   - 2005-10-03 00:00:00.000+0300

Tika does its best to return the dates in the format that they show up in Excel.

If you want the dates to be in ISO8601 format, you have two options:
* Set all your date cells in excel to be formatted as iso8601, rather
  than whatever they currently are
* Write your own excel parser for Tika, which ignores the date formatting
   set for cells, and always uses iso8601

For the latter, you'd probably start with Tika's ExcelExtractor, then in the NumberRecord switch case, use POI's DateUtils class to detect if the cell is a date cell or not. If it is, have the cell value turned into a date object, and format it as you require. If it isn't, then let the default "format this like excel does" logic kick in

For the former, if your users don't want to reformat all their date cells for you, you could probably pre-process the file with POI and change all the formats.

Nick

Reply via email to