On 24 Dec 2007, at 17:04, Paul Fremantle wrote:

One more improvement.... I think we should make it possible to change
the default size that triggers a file using a config file (e.g.
synapse.properties).

I agree. Please raise a JIRA :)

Anything else?

I had a look at the code that handles the case where the output of the transformation is text rather than XML. I think there are multiple issues:

1) There are multiple places where character streams are converted to byte streams and vice versa:

* Since the XSLT processor is configured with a StreamResult writing to an OutputStream (ByteArrayOutputStream or FileOutputStream), it will convert the output to a byte stream. * The output is then converted back to a character stream using ByteArrayOutputStream#toString or using TextFileDataSource. * In VFSTransportSender it is converted back to a byte stream using String#getBytes or OMNode#serializeAndConsume.

The problem is that nowhere the code cares about the character encoding that is used in these conversions. I opened SYNAPSE-215 to describe the issue with ByteArrayOutputStream#toString. Probably in many cases the different issues tend to compensate each other so that the end result is correct. For example, ByteArrayOutputStream#toString and String#getBytes both use the platform's default encoding, so that the original byte stream is reconstructed. However this will fail if the byte stream contains sequences that are not valid in the default encoding (this may happen e.g. in UTF-8). Anyway, Synapse should be fixed to handle character encodings properly from end to end.

2) There are specific issues with TextFileDataSource:

* When a ByteArrayOutputStream is used, the result is parsed as plain text (since an OMText object is created directly from the result of ByteArrayOutputStream#toString). On the other hand, when TextFileDataSource is used, the result is parsed as XML (more precisely as an external parsed general entity). For example, the ampersand (&) is considered as the start character for an XML entity. I opened SYNAPSE-216 for this issue. Note however that when the data is consumed by VFSTransportSender, this problem is circumvented by the fact that the serialize method bypasses the XML parsing...

* TextFileDataSource implements OMDataSource but doesn't respect the contract (the Javadoc of OMDataSource is not very explicit but this can be seen from various examples in the Axis 2 source code): - serialize(OutputStream, OMOutputFormat) doesn't output the <text> wrapper element (actually the code is commented out) and doesn't take into account the character encoding specified by the OMOutputFormat. - serialize(Writer, OMOutputFormat) only outputs an empty <text> element. While this is exactly what is expected by VFSTransportSender, this might lead to unexpected results in other situations.

The purpose of TextFileDataSource is actually to implement a text node (+ <text> wrapper element) that is backed by a temporary file rather than a String/char[] object, thereby avoiding to load the entire file into memory. Maybe we should consider another solution that avoids the problems described above. The idea would be to use a custom implementation of OMText (that again is backed by a temporary file). I think that if the custom implementation extends OMNodeImpl and implements OMText, an instance can be added to the Axiom tree without problem, given that the Axiom code never casts OMText to OMTextImpl. An alternative (but less clean) solution would be to extend OMTextImpl.

3) Before solving some of the issues described above, another question needs to be addressed. XSLTMediator actually uses the following strategy to handle text output: it first tries to parse the output as an XML document and when this fails it will consider the output as text. There are however two different cases where this happens:

* The stylesheet specifies "text" as output method. In this case the output is plain text and will be parsed correctly when a ByteArrayOutputStream is used, but not when a TextFileDataSource is used (see above). * The stylesheet specifies "xml" as output method (or doesn't specify an output method at all), but produces output that is not well formed (typically text only). In this case XSLTMediator will also consider the output as text. However, since the output method is XML, some characters are replaced by their corresponding XML entities (such as &amp;), which will be parsed correctly by TextFileDataSource, but not when ByteArrayOutputStream is used.

While it is probably possible to handle both cases correctly, this would introduce unnecessary complexity to the code. I think it is not necessary to support the second case. I would expect that when a stylesheet specifies XML as output method but fails to produce a well formed XML document, the mediation fails with an error. However, some of the example stylesheets (see java/repository/conf/sample/resources/ transform/transform_load.xml) that come with the Synapse source code don't specify "text" as output method but produce text only.


Regards,

Andreas


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to