I have created a ticket : https://issues.apache.org/bugzilla/show_bug.cgi?id=50779 I'll update more there.
On Tue, Feb 15, 2011 at 3:39 PM, Taro App <[email protected]> wrote: > Okay, I understand what was going on now. Consulting to > OpenOffice.org's documentation of Excel file format > (http://sc.openoffice.org/excelfileformat.pdf) chapter 2.5.3 Unicode > Strings (BIFF8), I checked the data carefully with a binary editor. I > confirmed that CONTINUE(003Ch) records can appear right in between two > bytes that construct a single unicode character. Here is an example: > > ---------------------------------------------------------------------- > Binary Data: > 0B 00 05 54 00 00 00 AC 4E FD 90 02 5E F3 53 AC > 4E 3A 53 AC 4E 17 53 64 8D F3 77 3A 75 01 00 50 > 00 09 00 35 00 05 00 14 00 14 00 AD 30 E7 30 A6 > 30 C8 30 B7 30 A6 30 AD 30 E7 30 A6 3C 00 1E 20 > 30 AF 30 B1 30 A4 30 DB 30 AF 30 A2 30 AB 30 A4 > 30 B7 30 DE 30 C1 30 00 00 00 00 03 00 05 00 03 > 00 03 00 0A 00 06 00 02 00 0E 00 08 00 02 00 12 > 00 0A 00 01 00 > > UnicodeString: > 0B 00 -> 11 characters > 05 -> not compressed, extended > 54 00 00 00 -> length of ExtRst is 0x54 bytes > AC 4E FD 90 02 5E F3 53 AC 4E 3A 53 AC 4E 17 53 > 64 8D F3 77 3A 75 -> string data (11 characters) > > ExtRst: > 01 00 -> reserved = 1 > 50 00 -> length of phonetic string is 0x50 bytes > 09 00 35 00 -> formatting infomation > 05 00 -> number of phonetic text runs is 5 > 14 00 -> number of characters is 20 > 14 00 -> number of characters is 20 > AD 30 E7 30 A6 30 C8 30 B7 30 A6 30 AD 30 E7 30 > A6 [3C 00 1E 20] 30 AF 30 B1 30 A4 30 DB 30 AF 30 > A2 30 AB 30 A4 30 B7 30 DE 30 C1 30 > -> phonetic data (20 characters) > * 0x30A6 is a unicode character, but a CONTINUE record is in the middle > 00 00 00 00 03 00 -> portion 1 > 05 00 03 00 03 00 -> portion 2 > 0A 00 06 00 02 00 -> portion 3 > 0E 00 08 00 02 00 -> portion 4 > 12 00 0A 00 01 00 -> portion 5 > ---------------------------------------------------------------------- > > I also found that CONTINUE records can appear in the middle of portion > data. Here is an example: > > ---------------------------------------------------------------------- > Binary Data: > 09 00 05 36 00 00 00 F3 77 DD 5D 0C 77 7D 76 71 > 5C 02 5E 0A 4E CE 91 3A 75 01 00 32 00 09 00 35 > 00 03 00 0B 00 0B 00 A4 30 B7 30 AB 30 EF 30 B1 > 30 F3 30 CF 30 AF 30 B5 30 F3 30 B7 30 00 00 00 > 00 03 00 06 00 03 00 02 3C 00 20 20 00 0A 00 05 > 00 01 00 > > UnicodeString: > 09 00 -> 9 characters > 05 -> not compressed, extended > 36 00 00 00 -> length of ExtRst is 0x36 bytes > F3 77 DD 5D 0C 77 7D 76 71 5C 02 5E 0A 4E CE 91 > 3A 75 -> string data (9 characters) > > ExtRst: > 01 00 -> reserved = 1 > 32 00 -> length of phonetic string is 0x32 bytes > 09 00 35 00 -> formatting infomation > 03 00 -> number of phonetic text runs is 3 > 0B 00 -> number of characters is 11 > 0B 00 -> number of characters is 11 > A4 30 B7 30 AB 30 EF 30 B1 30 F3 30 CF 30 AF 30 > B5 30 F3 30 B7 30 -> phonetic data (11 characters) > 00 00 00 00 03 00 -> portion 1 > 06 00 03 00 02 {3C 00 20 20} 00 -> portion 2 > * a CONTINUE record is in the middle > 0A 00 05 00 01 00 -> portion 3 > ---------------------------------------------------------------------- > > Excel files in the above patterns are not readable by POI 3.7. > Now I know the cause of the issue, so maybe I can create a Excel file > to duplicate the POI error. > But I don't think I can fix POI myself. I hope someone can help with that. > Nick? > > apptaro > > > > On Tue, Feb 15, 2011 at 12:21 PM, Taro App <[email protected]> wrote: >> Hi Nick, >> >> I tested if "Save As" fixes the issue, and found out very interesting >> results. >> For example, there is an Excel file A, which POI can't read at a string ABC. >> Open the file A in Excel and save as a file B. Then, POI error occurs >> at a different string DEF when reading the file B. >> Open the file B in Excel and save as a file C. Then, POI error occurs >> at a different string GHI when reading the file C. >> Open the file C in Excel and save as a file D. Then, POI error occurs >> at a different string JKL when reading the file D. >> Open the file A again in Excel and save as a file B2. Then, POI error >> occurs at the string DEF when reading the file B2. >> Open the file B again in Excel and save as a file C2. Then, POI error >> occurs at the string GHI when reading the file C2. >> Open the file C again in Excel and save as a file D2. Then, POI error >> occurs at the string JKL when reading the file D2. >> >> So, "Save As" creates a file which POI can't read at a different >> string. Its pattern is reproducible, but I have not been able to >> create new such file from scratch. My problematic Excel file contains >> about 50 sheets, and each sheet has 300-1000 non-empty cells. Maybe >> this issue occurs only for a large file. >> >> Only workaround is to use debugger to find strings that causes the >> error, and re-input those strings in Excel and save. There are many >> strings that cause errors, so I have to repeat debug and edit until >> POI can successfully read the file. Another very interesting fact is >> that, after I "clean" the problematic file until POI can read it, open >> the cleaned file in Excel and do "Save As," then POI can't read the >> newly saved file! >> >> Of course all the files I explained above can be opened in Excel, and >> text and phonetic is not corrupted at all. (I checked phonetic with >> =PHONETIC(cell) function.) >> >> apptaro >> >> >> On Mon, Feb 14, 2011 at 9:04 PM, Nick Burch <[email protected]> wrote: >>> On Thu, 10 Feb 2011, Taro App wrote: >>>> >>>> Oops, there's a comment that isContinueNext "Should never be called >>>> before end of current record" so the code must be correct. >>> >>> Or at least correct to the best of our knowledge... >>> >>>> I'm not sure how it happens, but the exception is raised when POI tries to >>>> read phonetic text in ExtRst of UnicodeString. POI tries to read 18 >>>> double-byte characters, but RecordInputStream has only 29 bytes (14 >>>> characters + 1 byte.) If I check the Excel file with binary editor, >>>> everything up to here seems correct. >>> >>> Hmm, the phonetic text stuff went in much later to the file format. It's not >>> impossible that the team working on it had different ideas about continue >>> records to the team who did the original work >>> >>>> When I open the excel file and deleted sheets which do not contain the >>>> problematic string, then the problem disappeared. I tried deleting >>>> different >>>> sheets, then the problem sometimes disappeared and sometimes not. I also >>>> tried re-inputting the problematic string by cut & pasting in Excel, then >>>> the problem always disappeared. My best guess is that Excel sometimes saves >>>> corrupted data. Corruption occurs in phonetic texts which is not usually >>>> visible to users, it is not very obvious. >>> >>> If you open the file in Excel and do "Save As", does it fix the issue, or is >>> it only changing text / removing sheets that fixes it? >>> >>> Nick >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
