Thanks for quick responce. I will make own Excel parser. 2010/9/8 Nick Burch <[email protected]>
> On Wed, 8 Sep 2010, Sergiy Karpenko wrote: > >> When I test content and metadata extraction by Tika, I met next usecases: >> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED, >> MSOffice.CREATION_DATE) >> Date returned as String, but format is different for different document >> types. Probably you already working on this problem (I saw Date object in >> metadata in Tika 0.8) but if not, how can I configure Tika to use single >> Date format? >> > > This has only recently been fixed: > https://issues.apache.org/jira/browse/TIKA-451 > > You'll want to upgrade to a recent svn checkout / nightly build to get > these improvements > > > > - Date in Excel file content. >> As we know, Excel have Date fields, and Tika extract it well. But format >> is >> not acceptable for me. >> >> For example >> I have field 03/10/2005 >> Tika extracts it as 10/03/2005 >> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ" - 2005-10-03 00:00:00.000+0300 >> > > Tika does its best to return the dates in the format that they show up in > Excel. > > If you want the dates to be in ISO8601 format, you have two options: > * Set all your date cells in excel to be formatted as iso8601, rather > than whatever they currently are > * Write your own excel parser for Tika, which ignores the date formatting > set for cells, and always uses iso8601 > > For the latter, you'd probably start with Tika's ExcelExtractor, then in > the NumberRecord switch case, use POI's DateUtils class to detect if the > cell is a date cell or not. If it is, have the cell value turned into a date > object, and format it as you require. If it isn't, then let the default > "format this like excel does" logic kick in > > For the former, if your users don't want to reformat all their date cells > for you, you could probably pre-process the file with POI and change all the > formats. > > Nick >
