Re: About reading and writing Chinese Text into word2003 using POI library

Scott Zhang Tue, 09 Aug 2011 22:35:58 -0700

Hi. Sergey and all.
    It is interesting. I used the empty.doc from POI test-data and rebuild
the POI svn code again. Now the Chinese word are inserted into document
correctly.
    So it works. Although I'm not sure what it doesn't work some other way.
For the moment, only empty.doc can be used to generate correct doc file with
Chinese characters in.



Regards.
Scott

2011/8/10 Scott Zhang <[email protected]>

> hello. All
>     I am wondering is POI support read non ascii wod file now.
>     I tried with the empty.doc provided by POI test-data. Use that file as
> template, insert English word is ok. I can insert "hello, world" etc into
> doc correctly. But when insert Chinese character, it is wrong.
>     Now I created another empty1.doc with the office 2007 chinese edition.
> Use this file as template, even the english word inserted is mess too.  So I
> think there should be some properties parse issue.  I checked the generated
> binary code. Looks the encoding of generated code is correct. Just word
> doesn't display it correctly. So I think there should be some CP or
> something wrong in the header.
>
>
>
> Regards.
> Scott
>
>
> 2011/8/10 Scott Zhang <[email protected]>
>
>> I read the specification .
>>
>> http://msdn.microsoft.com/en-us/library/gg615596.aspx
>> I guess it should set *Pcd.Fc.fCompressed*
>> "
>>
>>    1.
>>
>>    For each *Pcd* structure in *PlcPcd.aPcd*:
>>    1.
>>
>>       Read the value of the *Pcd.Fc.fCompressed* field at bit 46 of the
>>       current Pcd structure. If 0, the *Pcd* structure refers to a 16-bit
>>       Unicode character. If 1, it refers to an 8-bit ANSI character.
>>       2.
>>
>>       Read the value of *Pcd.Fc*, which is bytes 2-5 of the current Pcd,
>>       and the corresponding CP value.
>>       - If Unicode, the text at the character position specified by the
>>          current CP value starts at on offset equal to the value of Pcd.Fc 
>> in the
>>          Word Document stream, and occupies two bytes per character.
>>
>>          - If ANSI, The text at the current CP starts at an offset of
>>          half the value of *Pcd.Fc*, and occupies one byte per character.
>>
>>          In either case, the number of characters specified by the
>>       current CP is equal to the value of the next CP in the array minus 
>> that of
>>       the current CP
>>
>>
>>
>> 2011/8/9 Scott Zhang <[email protected]>
>>
>>> Hi. Sergey and all.
>>>
>>>   I have checked out the code from svn and build it myself. The insert
>>> function is working. But Chinese word is not working either.
>>>   So I did following checking into the document data POI generated.
>>>
>>> Here is what I found. I see we are not far, just need few more effort.
>>> 1. where I input "hello, world 你好" in doc, then edit doc using a hex
>>> editor.
>>> I found the text is saved as
>>> 68 00 65 00 6c 00 6c 00 6f 00 2c 00 77 00
>>> h       e       l          l        o      ,        w
>>> 60  4f  7d 59
>>> 你       好
>>>
>>> So the truth is simple, the doc internally is using UTF-16LE to save
>>> content. I tried to manually input 60 4f 7d 59 following the text I input in
>>> doc. Then save and open in office again. The 60 4f 7d 59 is correctly
>>> displayed as "你好“.
>>>
>>> 2. When I use POI to insert text into word
>>> range.insertAfter("hello,world");
>>>
>>> The binary code POI generated is
>>> 68 65 6c 6c 6f
>>> h   e   l   l    o
>>> And if I use range.insertAfter("hello,world你好").  The "你好" was translate
>>> to a code I can't figure out.
>>> So I am using
>>> range.insertAfter(new String("hello,world你好").getBytes("UTF-16LE"));
>>> Good news is it is generate correctly in doc as
>>> 68 00 65 00 6c 00 6c 00
>>> The binary is same as expected. But office word display 'h' as a wide
>>> 'h'  and display "你好" as mess code.
>>>
>>> So what I am thinking is, as we have generated the correct binary
>>> representation of character. There should be somewhere setting the default
>>> encoding of characters in word.
>>>
>>> Can anyone point it out?
>>> I know we are nearly solve this now.
>>>
>>>
>>> Regards.
>>> Scott
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 9, 2011 at 2:13 PM, Scott Zhang <[email protected]>wrote:
>>>
>>>> hi. Sergey.
>>>>
>>>> Checking out svn code now.
>>>>
>>>> Thanks.
>>>> Regards.
>>>> Scott
>>>>
>>>>
>>>> On Tue, Aug 9, 2011 at 1:23 PM, Sergey Vladimirov 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hi, Scott.
>>>>>
>>>>> I've just fixed text editing issue in trunk. Please check using latest
>>>>> code from SVN trunk or wait until tomorrow to test with beta4-20110810
>>>>> :)
>>>>>
>>>>> Best regards,
>>>>> Sergey
>>>>>
>>>>> On Tue, Aug 9, 2011 at 9:08 AM, Scott Zhang <[email protected]>
>>>>> wrote:
>>>>> > Hi. Sergey.
>>>>> >
>>>>> >  I download the latest jar file.
>>>>> >
>>>>> > poi-scratchpad-3.8-beta4-20110808.jar
>>>>> > poi-3.8-beta4-20110808.jar
>>>>> > poi-excelant-3.8-beta4-20110808.jar
>>>>> >
>>>>> > and replace with my existing jars.
>>>>> >
>>>>> > Now the read is still the correct. But the write,
>>>>> > range.insertAfter("helloworld");
>>>>> > even the English word can't be insert into doc file either. nothing
>>>>> was
>>>>> > inserted into the document.
>>>>> >
>>>>> > Can you check this too?
>>>>> >
>>>>> >
>>>>> > I don't link with
>>>>> >
>>>>> > poi-dependencies-<version>-<date>.jar
>>>>> > because I see without it my compilation works fine too. Will it be
>>>>> the issue?
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Regards.
>>>>> > Scott
>>>>> >
>>>>> > On Tue, Aug 9, 2011 at 12:32 PM, Scott Zhang <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >> Sure.
>>>>> >>
>>>>> >> Doing
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Aug 9, 2011 at 12:29 PM, Sergey Vladimirov <
>>>>> [email protected]>wrote:
>>>>> >>
>>>>> >>> Hi,
>>>>> >>>
>>>>> >>> Could you try to do it using latest version of POI, available at
>>>>> >>> http://encore.torchbox.com/poi-cvs-build/ ?
>>>>> >>>
>>>>> >>> Best regards,
>>>>> >>> Sergey
>>>>> >>>
>>>>> >>> On Tue, Aug 9, 2011 at 7:41 AM, Scott Zhang <
>>>>> [email protected]>
>>>>> >>> wrote:
>>>>> >>> > Hello.
>>>>> >>> >    I am using POI library to read/write text from word2003 files.
>>>>> >>> >    I use
>>>>> >>> >    range = document.getRange();
>>>>> >>> >    system.out.println(range.text());
>>>>> >>> >
>>>>> >>> > the Chinese character is output correctly.
>>>>> >>> >   But when I try to insert the same text back.
>>>>> >>> >    range.insertAfter(range.text());
>>>>> >>> >
>>>>> >>> > I can only see mess code in output doc. How can I solve this?
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > Thanks.
>>>>> >>> > Regards.
>>>>> >>> > Scott
>>>>> >>> >
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Sergey Vladimirov
>>>>> >>>
>>>>> >>>
>>>>> ---------------------------------------------------------------------
>>>>> >>> To unsubscribe, e-mail: [email protected]
>>>>> >>> For additional commands, e-mail: [email protected]
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sergey Vladimirov
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: About reading and writing Chinese Text into word2003 using POI library

Reply via email to