[ https://issues.apache.org/jira/browse/TIKA-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dave Meikle resolved TIKA-103. ------------------------------ Resolution: Fixed As stated above, initial support has been applied for this release therefore marking as 'resolved'. Outstanding formatting issues have been moved to TIKA-360 for progression, via POI, in a later version. Cheers, Dave > Excel parsing ignores cell formating > ------------------------------------ > > Key: TIKA-103 > URL: https://issues.apache.org/jira/browse/TIKA-103 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Niall Pemberton > Assignee: Dave Meikle > Fix For: 0.6 > > Attachments: testEXCEL-formats.xls, tika-103_initial_patch.diff > > > Unfortunately Excel stores dates as the number of days since 1900 (or 1904, > but ignore that atm) with the time element being stored in the fractional > part of the numeric value. So for example 19 Jan 2008 04:35:01 is stored as > Double value 39466.190980358806. The only way to make sense of the data is > to look at the formatting on the cell. Although dates are the worst case, it > also affects other numeric values - currencies, percentages, scientific, > fractions and worst of all custom formats. > POI recognises 49 "built in" formats of excel and for those it has the > limited capability of determining whether a numeric cell is a date or not and > if it is, a utility to convert to a java date, something like: > if (HSSFDateUtil.isCellDateFormatted(cell)) { > Date date = HSSFDateUtil.getJavaDate(cell.getNumericCellValue()); > } > The current ExcelParser implementation takes no account of the data format > and IMO is going to severly limit how useful that implementation is. I'm also > think that the above while improving the situation slightly is still not > great. I asked about this on the POI dev list a couple of days ago[1] and the > only light is someone posted a format parser a few months back. It sounds > like POI will accept that contribution if it has unit tests. So I'm going to > try and find time to do that. If the data format can be properly parsed then > it means being able to extract it in the format the users sees it within > Excel - which IMO would be the ideal situation. > [1] http://www.mail-archive.com/d...@poi.apache.org/msg00582.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.