Thanks for quick responce. I will make own Excel parser.

2010/9/8 Nick Burch <[email protected]>

> On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
>
>> When I test content and metadata extraction by Tika, I met next usecases:
>> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
>> MSOffice.CREATION_DATE)
>> Date returned as String, but format is different for different document
>> types. Probably you already working on this problem (I saw Date object in
>> metadata in Tika 0.8) but if not, how can I configure Tika to use single
>> Date format?
>>
>
> This has only recently been fixed:
>        https://issues.apache.org/jira/browse/TIKA-451
>
> You'll want to upgrade to a recent svn checkout / nightly build to get
> these improvements
>
>
>
>  - Date in Excel file content.
>> As we know, Excel have Date fields, and Tika extract it well. But format
>> is
>> not acceptable for me.
>>
>> For example
>> I have field 03/10/2005
>> Tika extracts it as  10/03/2005
>> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ"   - 2005-10-03 00:00:00.000+0300
>>
>
> Tika does its best to return the dates in the format that they show up in
> Excel.
>
> If you want the dates to be in ISO8601 format, you have two options:
> * Set all your date cells in excel to be formatted as iso8601, rather
>  than whatever they currently are
> * Write your own excel parser for Tika, which ignores the date formatting
>   set for cells, and always uses iso8601
>
> For the latter, you'd probably start with Tika's ExcelExtractor, then in
> the NumberRecord switch case, use POI's DateUtils class to detect if the
> cell is a date cell or not. If it is, have the cell value turned into a date
> object, and format it as you require. If it isn't, then let the default
> "format this like excel does" logic kick in
>
> For the former, if your users don't want to reformat all their date cells
> for you, you could probably pre-process the file with POI and change all the
> formats.
>
> Nick
>

Reply via email to