[ https://issues.apache.org/jira/browse/TIKA-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796474#action_12796474 ]
Dave Meikle commented on TIKA-103: ---------------------------------- I am not sure how others feel about this issue but for me it is one I would like addressed as I have some applications that are parsing excel files containing various types of formatting and it would be good to have the 'as-is' value within the parsed content. As can be seen above, I have attached an initial patch to have TikaHSSFListener use the FormatTrackingHSSFListener proxy class to handle the current POI supported formatting - this moves us on some what leaving the outstanding support something to progress within POI[1]. The reason I have attached the patch instead of committing directly is that I would like to propose the following: * This initial support is included in the up-coming 0.6 release * An issue is raised against POI, and any fixes to support the other formatting is progressed there. Not sure what you all think? I am going to have a track through to see how Niall got on, if he managed to get the time. Cheers, Dave [1] I will continue to see if there are any user model features we can use to add further support from the current POI code base. > Excel parsing ignores cell formating > ------------------------------------ > > Key: TIKA-103 > URL: https://issues.apache.org/jira/browse/TIKA-103 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Niall Pemberton > Attachments: testEXCEL-formats.xls, tika-103_initial_patch.diff > > > Unfortunately Excel stores dates as the number of days since 1900 (or 1904, > but ignore that atm) with the time element being stored in the fractional > part of the numeric value. So for example 19 Jan 2008 04:35:01 is stored as > Double value 39466.190980358806. The only way to make sense of the data is > to look at the formatting on the cell. Although dates are the worst case, it > also affects other numeric values - currencies, percentages, scientific, > fractions and worst of all custom formats. > POI recognises 49 "built in" formats of excel and for those it has the > limited capability of determining whether a numeric cell is a date or not and > if it is, a utility to convert to a java date, something like: > if (HSSFDateUtil.isCellDateFormatted(cell)) { > Date date = HSSFDateUtil.getJavaDate(cell.getNumericCellValue()); > } > The current ExcelParser implementation takes no account of the data format > and IMO is going to severly limit how useful that implementation is. I'm also > think that the above while improving the situation slightly is still not > great. I asked about this on the POI dev list a couple of days ago[1] and the > only light is someone posted a format parser a few months back. It sounds > like POI will accept that contribution if it has unit tests. So I'm going to > try and find time to do that. If the data format can be properly parsed then > it means being able to extract it in the format the users sees it within > Excel - which IMO would be the ideal situation. > [1] http://www.mail-archive.com/d...@poi.apache.org/msg00582.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.