Okay, I understand what was going on now. Consulting to
OpenOffice.org's documentation of Excel file format
(http://sc.openoffice.org/excelfileformat.pdf) chapter 2.5.3 Unicode
Strings (BIFF8), I checked the data carefully with a binary editor. I
confirmed that CONTINUE(003Ch) records can appear right in between two
bytes that construct a single unicode character. Here is an example:
----------------------------------------------------------------------
Binary Data:
0B 00 05 54 00 00 00 AC 4E FD 90 02 5E F3 53 AC
4E 3A 53 AC 4E 17 53 64 8D F3 77 3A 75 01 00 50
00 09 00 35 00 05 00 14 00 14 00 AD 30 E7 30 A6
30 C8 30 B7 30 A6 30 AD 30 E7 30 A6 3C 00 1E 20
30 AF 30 B1 30 A4 30 DB 30 AF 30 A2 30 AB 30 A4
30 B7 30 DE 30 C1 30 00 00 00 00 03 00 05 00 03
00 03 00 0A 00 06 00 02 00 0E 00 08 00 02 00 12
00 0A 00 01 00
UnicodeString:
0B 00 -> 11 characters
05 -> not compressed, extended
54 00 00 00 -> length of ExtRst is 0x54 bytes
AC 4E FD 90 02 5E F3 53 AC 4E 3A 53 AC 4E 17 53
64 8D F3 77 3A 75 -> string data (11 characters)
ExtRst:
01 00 -> reserved = 1
50 00 -> length of phonetic string is 0x50 bytes
09 00 35 00 -> formatting infomation
05 00 -> number of phonetic text runs is 5
14 00 -> number of characters is 20
14 00 -> number of characters is 20
AD 30 E7 30 A6 30 C8 30 B7 30 A6 30 AD 30 E7 30
A6 [3C 00 1E 20] 30 AF 30 B1 30 A4 30 DB 30 AF 30
A2 30 AB 30 A4 30 B7 30 DE 30 C1 30
-> phonetic data (20 characters)
* 0x30A6 is a unicode character, but a CONTINUE record is in the middle
00 00 00 00 03 00 -> portion 1
05 00 03 00 03 00 -> portion 2
0A 00 06 00 02 00 -> portion 3
0E 00 08 00 02 00 -> portion 4
12 00 0A 00 01 00 -> portion 5
----------------------------------------------------------------------
I also found that CONTINUE records can appear in the middle of portion
data. Here is an example:
----------------------------------------------------------------------
Binary Data:
09 00 05 36 00 00 00 F3 77 DD 5D 0C 77 7D 76 71
5C 02 5E 0A 4E CE 91 3A 75 01 00 32 00 09 00 35
00 03 00 0B 00 0B 00 A4 30 B7 30 AB 30 EF 30 B1
30 F3 30 CF 30 AF 30 B5 30 F3 30 B7 30 00 00 00
00 03 00 06 00 03 00 02 3C 00 20 20 00 0A 00 05
00 01 00
UnicodeString:
09 00 -> 9 characters
05 -> not compressed, extended
36 00 00 00 -> length of ExtRst is 0x36 bytes
F3 77 DD 5D 0C 77 7D 76 71 5C 02 5E 0A 4E CE 91
3A 75 -> string data (9 characters)
ExtRst:
01 00 -> reserved = 1
32 00 -> length of phonetic string is 0x32 bytes
09 00 35 00 -> formatting infomation
03 00 -> number of phonetic text runs is 3
0B 00 -> number of characters is 11
0B 00 -> number of characters is 11
A4 30 B7 30 AB 30 EF 30 B1 30 F3 30 CF 30 AF 30
B5 30 F3 30 B7 30 -> phonetic data (11 characters)
00 00 00 00 03 00 -> portion 1
06 00 03 00 02 {3C 00 20 20} 00 -> portion 2
* a CONTINUE record is in the middle
0A 00 05 00 01 00 -> portion 3
----------------------------------------------------------------------
Excel files in the above patterns are not readable by POI 3.7.
Now I know the cause of the issue, so maybe I can create a Excel file
to duplicate the POI error.
But I don't think I can fix POI myself. I hope someone can help with that. Nick?
apptaro
On Tue, Feb 15, 2011 at 12:21 PM, Taro App <[email protected]> wrote:
> Hi Nick,
>
> I tested if "Save As" fixes the issue, and found out very interesting results.
> For example, there is an Excel file A, which POI can't read at a string ABC.
> Open the file A in Excel and save as a file B. Then, POI error occurs
> at a different string DEF when reading the file B.
> Open the file B in Excel and save as a file C. Then, POI error occurs
> at a different string GHI when reading the file C.
> Open the file C in Excel and save as a file D. Then, POI error occurs
> at a different string JKL when reading the file D.
> Open the file A again in Excel and save as a file B2. Then, POI error
> occurs at the string DEF when reading the file B2.
> Open the file B again in Excel and save as a file C2. Then, POI error
> occurs at the string GHI when reading the file C2.
> Open the file C again in Excel and save as a file D2. Then, POI error
> occurs at the string JKL when reading the file D2.
>
> So, "Save As" creates a file which POI can't read at a different
> string. Its pattern is reproducible, but I have not been able to
> create new such file from scratch. My problematic Excel file contains
> about 50 sheets, and each sheet has 300-1000 non-empty cells. Maybe
> this issue occurs only for a large file.
>
> Only workaround is to use debugger to find strings that causes the
> error, and re-input those strings in Excel and save. There are many
> strings that cause errors, so I have to repeat debug and edit until
> POI can successfully read the file. Another very interesting fact is
> that, after I "clean" the problematic file until POI can read it, open
> the cleaned file in Excel and do "Save As," then POI can't read the
> newly saved file!
>
> Of course all the files I explained above can be opened in Excel, and
> text and phonetic is not corrupted at all. (I checked phonetic with
> =PHONETIC(cell) function.)
>
> apptaro
>
>
> On Mon, Feb 14, 2011 at 9:04 PM, Nick Burch <[email protected]> wrote:
>> On Thu, 10 Feb 2011, Taro App wrote:
>>>
>>> Oops, there's a comment that isContinueNext "Should never be called
>>> before end of current record" so the code must be correct.
>>
>> Or at least correct to the best of our knowledge...
>>
>>> I'm not sure how it happens, but the exception is raised when POI tries to
>>> read phonetic text in ExtRst of UnicodeString. POI tries to read 18
>>> double-byte characters, but RecordInputStream has only 29 bytes (14
>>> characters + 1 byte.) If I check the Excel file with binary editor,
>>> everything up to here seems correct.
>>
>> Hmm, the phonetic text stuff went in much later to the file format. It's not
>> impossible that the team working on it had different ideas about continue
>> records to the team who did the original work
>>
>>> When I open the excel file and deleted sheets which do not contain the
>>> problematic string, then the problem disappeared. I tried deleting different
>>> sheets, then the problem sometimes disappeared and sometimes not. I also
>>> tried re-inputting the problematic string by cut & pasting in Excel, then
>>> the problem always disappeared. My best guess is that Excel sometimes saves
>>> corrupted data. Corruption occurs in phonetic texts which is not usually
>>> visible to users, it is not very obvious.
>>
>> If you open the file in Excel and do "Save As", does it fix the issue, or is
>> it only changing text / removing sheets that fixes it?
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]