Serialization problem

Sripathy Subramania 9 Apr 2001 15:21:41 -0000

> Hi,
> 
> Could any of the developers be kind enough to read this email and 
> do the needful. Was only trying to help the development team in creating a
> good product.
> Should any one of you feel that it is incorrect, please say so.
> 
> Regards,
> -sripathy
> 
> 
>  -----Original Message-----
> From:         Sripathy Subramania  
> Sent: Wednesday, April 04, 2001 2:03 PM
> To:   '[email protected]'
> Cc:   '[EMAIL PROTECTED]'
> Subject:      BaseMarkupSerializer bug
> 
> Hi,
> 
> xerces-1_1_3, BaseMarkupSerializer.characters(char[], int, int)
> inserts escape sequence "]]<![CDATA[" for embedded string
> pattern "]]>", at the wrong location.
> This results in incorrect XML data serialization from the DOM.
> 
> I Have proposed a fix in this mail.
> 
> Xerces version : 1.1.3
> JDK version : 1.3
> 
> I had a requirement of serializing the DOM conforming to the
> following DTD.
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!ELEMENT Sample (Id, Messages+)>
> <!ELEMENT Id (#PCDATA)>
> <!ELEMENT Messages (MsgId, MsgDesc?, Msg)>
> <!ELEMENT MsgId (#PCDATA)>
> <!ELEMENT MsgDesc (#PCDATA)>
> <!ELEMENT Msg (#PCDATA)>
> 
> Xml file conforming to this dtd may be
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE Sample SYSTEM "Sample.dtd">
> <Sample>
>   <Id>Doc 1</Id>
>   <Messages>
>     <MsgId>Msg 1</MsgId>
>     <MsgDesc>Testing document</MsgDesc>
>     <Msg><![CDATA[This is a test message having patterns ]]>. This message
> may cotain multiple occurrences of patterns ]]>. The End]]></Msg>
>   </Messages>
> </Sample>
>   
> In the above mentioned DTD, 'Msg' element value will be a
> CDATA section. This element value may contain the string "]]>"
> embedded in it(as shown in the saple xml document above).
> BaseMarkupSerializer identifies this pattern and
> escapes it by prepending the string "<![CDATA[", to "]]>". But the
> code logic for escaping seems to have a bug.
> 
> Original source from
> Xerces-1_1_3\src\org\apache\xml\serialize\BaseMarkupSerializer
> (Lines 457~491)
> *********************************************************
>     public void characters( char[] chars, int start, int length )
>     {
>         ElementState state;
>         
>         state = content();
>         // Check if text should be print as CDATA section or unescaped
>         // based on elements listed in the output format (the element
>         // state) or whether we are inside a CDATA section or entity.
>         
>         if ( state.inCData || state.doCData ) {
>             int          saveIndent;
>             
>             // Print a CDATA section. The text is not escaped, but ']]>'
>             // appearing in the code must be identified and dealt with.
>             // The contents of a text node is considered space
>             // preserving.
>             if ( ! state.inCData ) {
>                 _printer.printText( "<![CDATA[" );
>                 state.inCData = true;
>             }
>             saveIndent = _printer.getNextIndent();
>             _printer.setNextIndent( 0 );
>             for ( int index = 0 ; index < length ; ++index ) {
>                 if ( index + 2 < length && chars[ index ] == ']' && 
>                      chars[ index + 1 ] == ']' &&
>                      chars[ index + 2 ] == '>') {
>                     
>                     printText( chars, start, index + 2, true, true );
>                     _printer.printText( "]]><![CDATA[" );
>                     start += index + 2;
>                     length -= index + 2;
>                     index = 0;
>                 }
>             }
>             if ( length > 0 )
>                 printText( chars, start, length, true, true );
>             _printer.setNextIndent( saveIndent );
> *************************************************************
> Proposed changes for the above block
> 
>     public void characters( char[] chars, int start, int length )
>     {
>         ElementState state;
>         
>         state = content();
>         // Check if text should be print as CDATA section or unescaped
>         // based on elements listed in the output format (the element
>         // state) or whether we are inside a CDATA section or entity.
>         
>         if ( state.inCData || state.doCData ) {
>             int          saveIndent;
>             int          index = 0;
>             int          endIndex = 0;
>             
>             // Print a CDATA section. The text is not escaped, but ']]>'
>             // appearing in the code must be identified and dealt with.
>             // The contents of a text node is considered space
>             // preserving.
>             if ( ! state.inCData ) {
>                 _printer.printText( "<![CDATA[" );
>                 state.inCData = true;
>             }
>             saveIndent = _printer.getNextIndent();
>             _printer.setNextIndent( 0 );
>             endIndex = start + length;
>             for ( index = start ; index < endIndex ; ++index ) {
>                 if ( index + 2 < endIndex && chars[ index ] == ']' && 
>                      chars[ index + 1 ] == ']' &&
>                      chars[ index + 2 ] == '>') {
>                     
>                     printText( chars, start, index + 2 - start,
>                                true, true);
>                     _printer.printText( "]]><![CDATA[" );
>                     start = index + 2;
>                     index = start;
>                 }
>             }
>             if ( index > start )
>                 printText( chars, start, index-start, true, true );
>             _printer.setNextIndent( saveIndent );
> ********************************************************************
> 
> NOTE : However this fix does not handle the case when the string
>        pattern "]]>" does not fall within the buffer boundary.
>        This might require more changes.
> 
> I checked the source for Xerces-1_2_3 and observed that this bug is
> not fixed yet. Moreover I couldn't find mails discussing this problem/fix
> in
> 'xerces-j-dev'/'xerces-j-user' mailing list.
> I don't know whether this bug has been already identified by the
> development team or not.
> 
> Would appreciate, if someone familiar with the code can verify the
> bug and baseline the changes. Would be glad to provide more
> information, in this regard.
> 
> Thanks,
> -sripathy
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Serialization problem

Reply via email to