hello. All
I am wondering is POI support read non ascii wod file now.
I tried with the empty.doc provided by POI test-data. Use that file as
template, insert English word is ok. I can insert "hello, world" etc into
doc correctly. But when insert Chinese character, it is wrong.
Now I created another empty1.doc with the office 2007 chinese edition.
Use this file as template, even the english word inserted is mess too. So I
think there should be some properties parse issue. I checked the generated
binary code. Looks the encoding of generated code is correct. Just word
doesn't display it correctly. So I think there should be some CP or
something wrong in the header.
Regards.
Scott
2011/8/10 Scott Zhang <[email protected]>
> I read the specification .
>
> http://msdn.microsoft.com/en-us/library/gg615596.aspx
> I guess it should set *Pcd.Fc.fCompressed*
> "
>
> 1.
>
> For each *Pcd* structure in *PlcPcd.aPcd*:
> 1.
>
> Read the value of the *Pcd.Fc.fCompressed* field at bit 46 of the
> current Pcd structure. If 0, the *Pcd* structure refers to a 16-bit
> Unicode character. If 1, it refers to an 8-bit ANSI character.
> 2.
>
> Read the value of *Pcd.Fc*, which is bytes 2-5 of the current Pcd,
> and the corresponding CP value.
> - If Unicode, the text at the character position specified by the
> current CP value starts at on offset equal to the value of Pcd.Fc in
> the
> Word Document stream, and occupies two bytes per character.
>
> - If ANSI, The text at the current CP starts at an offset of half
> the value of *Pcd.Fc*, and occupies one byte per character.
>
> In either case, the number of characters specified by the current
> CP is equal to the value of the next CP in the array minus that of the
> current CP
>
>
>
> 2011/8/9 Scott Zhang <[email protected]>
>
>> Hi. Sergey and all.
>>
>> I have checked out the code from svn and build it myself. The insert
>> function is working. But Chinese word is not working either.
>> So I did following checking into the document data POI generated.
>>
>> Here is what I found. I see we are not far, just need few more effort.
>> 1. where I input "hello, world 你好" in doc, then edit doc using a hex
>> editor.
>> I found the text is saved as
>> 68 00 65 00 6c 00 6c 00 6f 00 2c 00 77 00
>> h e l l o , w
>> 60 4f 7d 59
>> 你 好
>>
>> So the truth is simple, the doc internally is using UTF-16LE to save
>> content. I tried to manually input 60 4f 7d 59 following the text I input in
>> doc. Then save and open in office again. The 60 4f 7d 59 is correctly
>> displayed as "你好“.
>>
>> 2. When I use POI to insert text into word
>> range.insertAfter("hello,world");
>>
>> The binary code POI generated is
>> 68 65 6c 6c 6f
>> h e l l o
>> And if I use range.insertAfter("hello,world你好"). The "你好" was translate
>> to a code I can't figure out.
>> So I am using
>> range.insertAfter(new String("hello,world你好").getBytes("UTF-16LE"));
>> Good news is it is generate correctly in doc as
>> 68 00 65 00 6c 00 6c 00
>> The binary is same as expected. But office word display 'h' as a wide 'h'
>> and display "你好" as mess code.
>>
>> So what I am thinking is, as we have generated the correct binary
>> representation of character. There should be somewhere setting the default
>> encoding of characters in word.
>>
>> Can anyone point it out?
>> I know we are nearly solve this now.
>>
>>
>> Regards.
>> Scott
>>
>>
>>
>>
>>
>>
>> On Tue, Aug 9, 2011 at 2:13 PM, Scott Zhang <[email protected]>wrote:
>>
>>> hi. Sergey.
>>>
>>> Checking out svn code now.
>>>
>>> Thanks.
>>> Regards.
>>> Scott
>>>
>>>
>>> On Tue, Aug 9, 2011 at 1:23 PM, Sergey Vladimirov <[email protected]>wrote:
>>>
>>>> Hi, Scott.
>>>>
>>>> I've just fixed text editing issue in trunk. Please check using latest
>>>> code from SVN trunk or wait until tomorrow to test with beta4-20110810
>>>> :)
>>>>
>>>> Best regards,
>>>> Sergey
>>>>
>>>> On Tue, Aug 9, 2011 at 9:08 AM, Scott Zhang <[email protected]>
>>>> wrote:
>>>> > Hi. Sergey.
>>>> >
>>>> > I download the latest jar file.
>>>> >
>>>> > poi-scratchpad-3.8-beta4-20110808.jar
>>>> > poi-3.8-beta4-20110808.jar
>>>> > poi-excelant-3.8-beta4-20110808.jar
>>>> >
>>>> > and replace with my existing jars.
>>>> >
>>>> > Now the read is still the correct. But the write,
>>>> > range.insertAfter("helloworld");
>>>> > even the English word can't be insert into doc file either. nothing
>>>> was
>>>> > inserted into the document.
>>>> >
>>>> > Can you check this too?
>>>> >
>>>> >
>>>> > I don't link with
>>>> >
>>>> > poi-dependencies-<version>-<date>.jar
>>>> > because I see without it my compilation works fine too. Will it be the
>>>> issue?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Regards.
>>>> > Scott
>>>> >
>>>> > On Tue, Aug 9, 2011 at 12:32 PM, Scott Zhang <[email protected]>
>>>> wrote:
>>>> >
>>>> >> Sure.
>>>> >>
>>>> >> Doing
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Tue, Aug 9, 2011 at 12:29 PM, Sergey Vladimirov <
>>>> [email protected]>wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>>
>>>> >>> Could you try to do it using latest version of POI, available at
>>>> >>> http://encore.torchbox.com/poi-cvs-build/ ?
>>>> >>>
>>>> >>> Best regards,
>>>> >>> Sergey
>>>> >>>
>>>> >>> On Tue, Aug 9, 2011 at 7:41 AM, Scott Zhang <[email protected]
>>>> >
>>>> >>> wrote:
>>>> >>> > Hello.
>>>> >>> > I am using POI library to read/write text from word2003 files.
>>>> >>> > I use
>>>> >>> > range = document.getRange();
>>>> >>> > system.out.println(range.text());
>>>> >>> >
>>>> >>> > the Chinese character is output correctly.
>>>> >>> > But when I try to insert the same text back.
>>>> >>> > range.insertAfter(range.text());
>>>> >>> >
>>>> >>> > I can only see mess code in output doc. How can I solve this?
>>>> >>> >
>>>> >>> >
>>>> >>> > Thanks.
>>>> >>> > Regards.
>>>> >>> > Scott
>>>> >>> >
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Sergey Vladimirov
>>>> >>>
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: [email protected]
>>>> >>> For additional commands, e-mail: [email protected]
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Sergey Vladimirov
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>