AFAIK the source of this problem is a "bug" in the File.toURL method of
the Java API. (in the javadoc of jdk 1.4 I see this limitation is now
documented and an alternative method is provided).

The trouble with fixing this is that there are probably already people
depending on this incorrect behaviour, and doing the encoding
themselves, and thus fixing it would lead for them to double-encoding.

But all this doesn't immediately help you of course...

I guess trying to avoid filenames containing non-ascii characters is a
bad suggestion? ;-)

On Mon, 2004-02-16 at 12:04, Jan Hoskens wrote:
> Nobody has any remarks about this? Or is it because it was posted at the end
> of the week;-)
> 
> Or should I ask dev list?
> 
> Kind Regards,
> Jan
> 
> ----- Original Message ----- 
> From: "Jan Hoskens" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, February 13, 2004 11:32 AM
> Subject: Bug? Reading File Source
> 
> 
> > Hi,
> >
> > I've had some problems concerning special characters in my flow, but could
> > fix it. One of my problems occurred when loading a document. When I used
> the
> > proposed way of loading (in woody binding sample):
> >
> >         source = resolver.resolveURI(uri); (resolve is ok)
> >         var is = new
> > Packages.org.xml.sax.InputSource(source.getInputStream());
> >         is.setSystemId(source.getURI());
> >         return parser.parseDocument(is); (crashes here)
> >
> > I got an error concerning special characters. When an '�' appeared in the
> > filename I got an exception concerning UTF-8 illegal characters. I created
> > this workaround with an encoding function to make sure that the string is
> in
> > UTF-8:
> >
> >         source = resolver.resolveURI(uri);
> >         var file = new java.io.File(new
> > java.net.URI(encodeURI(source.getURI()))); // just another way to access
> the
> > file
> >         var is = new Packages.org.xml.sax.InputSource(new
> > java.io.FileReader(file));
> >         return parser.parseDocument(is);
> >
> > The encodeURI() function essentially does this:
> >     split up the uri so that eg '/' is preserved, take the pieces (thus
> the
> > directories and filenames) and do
> java.net.URLEncoder.encode(part,"UTF-8"),
> > then replace the '+' (stands for whitespaces) with '%20'
> >
> > This does work and my file is loaded correctly.
> > I thought that I had overcome this special character problem, but no, I
> > hadn't! I tried to read a directory with xml files and aggregat them to
> one
> > big xml so I can create one pdf file, but again this failed because of the
> > special character '�' appearing in my filename. I tried two combinations:
> > A) dir generator with xls that creates includes and then include
> transformer
> > B) easy way: XPathDirectoryGenerator
> >
> > The first combination just crashes on the include, the second one ignores
> > the problem:
> >
> > XPathDirectoryGenerator: Warning: Problem while reading the file
> AYG�L.xml.
> > Ignoring.
> > java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
> >  at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
> >
> > It seems to me that the same method of reading a file is used as I get the
> > same UTF error (that would be logical, reusing parts). So I think that the
> > inputSource doesn't take the special characters into account and when
> trying
> > to set an inputstream, it simply crashes because no conversion is done.
> > Isn't this a bug? Isn't it the responsibility of the InputSource object to
> > give a valid inputstream, even when special characters are used? (Or maybe
> > the Source gives an incorrect InputSource?)
> >
> > Greetings,
> > Jan
> >

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
[EMAIL PROTECTED]                          [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to