Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-24 Thread Sylvie Perrin

Christopher, André,

Christopher Schultz a écrit :



And (just to anticipate the next issue), Sylvie, does your program
actually need to read the content of the file and do something with that
content ?



Yeah, remember to use a Reader and specify the character encoding.
  
Yes, my program needs to do something with the content of files of the 
shared Windows directory.
Actually, the main action is to parse each files and read content 
throught an InputStreamReader(new FileInputStream(file)).


According to what Christopher says, I need to always specify the 
character encoding, so doing InputStreamReader(new 
FileInputStream(file), encoding)


Thanks for your help.

Sylvie.


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



RE: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-24 Thread Martin Gainty

implement the same charset that your CIFS drive is configured for

Martin 
__ 
Note de déni et de confidentialité
 
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le 
destinataire prévu, nous te demandons avec bonté que pour satisfaire informez 
l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est 
interdite. Ce message sert à l'information seulement et n'aura pas n'importe 
quel effet légalement obligatoire. Étant donné que les email peuvent facilement 
être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité 
pour le contenu fourni.




 Date: Thu, 24 Sep 2009 09:44:34 +0200
 From: sylvie.per...@continew.fr
 To: users@tomcat.apache.org
 Subject: Re: Create FileInputStream in servlet from remote file with 
 accentuated character name
 
 Christopher, André,
 
 Christopher Schultz a écrit :
 
  And (just to anticipate the next issue), Sylvie, does your program
  actually need to read the content of the file and do something with that
  content ?
  
 
  Yeah, remember to use a Reader and specify the character encoding.

 Yes, my program needs to do something with the content of files of the 
 shared Windows directory.
 Actually, the main action is to parse each files and read content 
 throught an InputStreamReader(new FileInputStream(file)).
 
 According to what Christopher says, I need to always specify the 
 character encoding, so doing InputStreamReader(new 
 FileInputStream(file), encoding)
 
 Thanks for your help.
 
 Sylvie.
 
 
 -
 To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
 For additional commands, e-mail: users-h...@tomcat.apache.org
 
  
_
Insert movie times and more without leaving Hotmail®.
http://windowslive.com/Tutorial/Hotmail/QuickAdd?ocid=TXT_TAGLM_WL_HM_Tutorial_QuickAdd_062009

Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martin,

On 9/24/2009 8:04 AM, Martin Gainty wrote:
 implement the same charset that your CIFS drive is configured for

No filesystem that I know of has a standard encoding for file contents.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq7kMwACgkQ9CaO5/Lv0PBcfACgkWTMPDZ/FQ98jBtYpYKyNE93
A0AAnityyjnhfaIFQOJc2LLCZY9YdfTa
=UzWc
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-24 Thread André Warnier

Sylvie Perrin wrote:

Christopher, André,

Christopher Schultz a écrit :



And (just to anticipate the next issue), Sylvie, does your program
actually need to read the content of the file and do something with that
content ?



Yeah, remember to use a Reader and specify the character encoding.
  
Yes, my program needs to do something with the content of files of the 
shared Windows directory.
Actually, the main action is to parse each files and read content 
throught an InputStreamReader(new FileInputStream(file)).


According to what Christopher says, I need to always specify the 
character encoding, so doing InputStreamReader(new 
FileInputStream(file), encoding)



Yes.
If you know that all the files dropped there will be UTF-8 encoded, then 
specify UTF-8 as the encoding.
The problem is that, if you do not control who puts files there or how, 
then at some point you may encounter a file whose content is encoded in, 
say, iso-8859-1 instead of UTF-8.  In that case, at some point your 
InputStreamReader may trigger an exception (when it encounters something 
that is not valid UTF-8).

You have to be prepared to deal with that.

The general point of this all is : as long as the whole computing world 
will not have agreed to use Unicode/UTF-8 encoding everywhere (in 
directories, in text files, in URLs, in program source code,..), dealing 
with a priori unknown directory entries and text files is messy, and 
without additional constraints on the clients or additional information 
provided separately, there is no 100% sure way to determine what you are 
going to get.


If as you indicate above, you are being asked to parse these files, 
there I suppose that they must have some pre-defined form.  Does that 
form also impose a given character set and encoding ? If not yet, I 
strongly suggest that you try to add this to the requirements, because 
otherwise the application will be unreliable.  Not because your programs 
would be bad, but because it is just impossible to be 100% reliable in 
such cases.




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-23 Thread André Warnier

Christopher Schultz wrote:
...


I dunno. This is pretty ugly. Again, setting everything to UTF-8
dramatically reduces headaches in these areas.


Thanks, Christopher.
I fully agree.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread André Warnier

Christopher Schultz wrote:
...


What is the source of that file name? Is it hard-coded into your Java
code? If so, how? Did you just type fichié.txt into your .java file,
or did you use \uxyz syntax to specify the UNICODE character you intended?

If you are reading the filename from a remote client, then all the
request URI encodings and all that stuff are definitely relevant (ion
spite of my previous statements to the contrary).


...
Honestly, I think the above should not be a problem. 

...
Christopher,

what I am trying to say is that such matters are horrible, because 
*everything* matters.


One cannot even be sure that the logfile message, as seen by the user 
and as pasted in the email to the list, and as further seen by the 
reader on this list, is really how the message is physically stored in 
the logfile.  That's because in-between, there can be umpteen layers of 
decoding/encoding which can make matters really confusing.
(Even the encoding used by the process which writes the logfile may 
matter, because fichié.txt may already have been re-encoded right there.)


Your note about making sure, in the source code of the program, that the 
filename is really made out of the bytes which the OP thinks it is made 
of, is a good example. If, to create this program source, one uses an 
editor which is set to save its files in the iso-latin-1 charset, then 
fichié.txt will be saved, in the program source, as a string of 10 
bytes.  Conversely, if one uses an editor set to save its files in 
Unicode/UTF-8, then this same string will be saved as 11 bytes (the é 
occupying 2 bytes).

Then comes the compiler..
I don't know how a Java compiler handles source code respectively saved 
as an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does it 
tell the difference ? does it make assumptions based on the locale it is 
running under ?


About the creation and subsequent finding of a file :
Generally-speaking, filesystems are encoding agnostic, in the precise 
sense that :
- if on a given platform and with a given programming language, you 
arrange for a string variable S to contain a precise series of bytes 
(for example, the UTF-8 encoding of the string fichié.txt, 11 bytes long)
- if you then use that variable as the name of a file which you create 
on disk
- then no matter where this file directory ultimately resides, the name 
of the file in it will generally be these same exact 11 bytes.
- if you then, from the same platform and using the same programming 
languages, use this same variable A as the name of a file which you try 
to open, it will work.


However, as soon as you deviate from the strict case above, what looks 
to you like fichié.txt /may/ not be the same series of bytes anymore, 
and that's where the problems start.


How the filename will look like is however another matter, depending 
on what you use to display it and from where you do it.


In the case of Sylvie (and I am talking here about the final issue she 
is trying to handle, not just about the test case)


- presumably, some (other) users and/or applications, running on some 
(other) platform and using some (other) tools, are creating files inside 
of a Windows host's directory.
One item of interest here would be to know how these files are created, 
and if that process is consistent (meaning, are these files always 
created by the same programs, running always on the same platform, using 
the same encoding etc..).  That is to make sure that when a file named 
fichié.txt is created there by whatever, it will always be created the 
same way, with a name of either 10 or 11 bytes (it does not matter 
which, just that it be consistent).


- then, some program created by Sylvie, has to access that directory, 
and pick up files from there.  So this program may have to know how a 
filename fichié.txt will be encoded in that directory (either as 10 or 
11 bytes). It also does not matter which, as long as Sylvie's program 
has a way to consistently spell this name correctly.


The problem is generally unsolvable, if the original entry in the 
directory can be created in several ways, because there are multiple 
agents capable of creating it, and these agents use inconsistent encodings.


The issue can be simpler, if Sylvie's program just opens the directory, 
reads the filenames that it finds there (whatever their encoding is), 
into some variable, and then just uses this variable as the filename to 
open the file and that's it.
But if, in Sylvie's program, the filename itself has to be compared to 
some pre-defined other string stored in the program, and some action 
taken or not whether it is considered equal or not, then there may be a 
problem.


Yet another aspect to consider, is whether Sylvie is really testing the 
right thing.
For instance, when Sylvie runs her Java test program, she does this from 
inside a Linux session, which is set for a specific locale.
However, the Tomcat server may well be started under a different 

Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Sylvie Perrin

André,

I follow your tutorial and all outputs in Widows Explorer, DOS Command 
Window and Linux Window are consistents concerning file names display.


For locale set under Linux, here is the output:

LANG=fr_FR.UTF-8
LC_CTYPE=fr_FR.UTF-8
LC_NUMERIC=fr_FR.UTF-8
LC_TIME=fr_FR.UTF-8
LC_COLLATE=fr_FR.UTF-8
LC_MONETARY=fr_FR.UTF-8
LC_MESSAGES=fr_FR.UTF-8
LC_PAPER=fr_FR.UTF-8
LC_NAME=fr_FR.UTF-8
LC_ADDRESS=fr_FR.UTF-8
LC_TELEPHONE=fr_FR.UTF-8
LC_MEASUREMENT=fr_FR.UTF-8
LC_IDENTIFICATION=fr_FR.UTF-8
LC_ALL=

I just remind that I have these lines in my tomcat auto-start script :
LC_ALL=fr_FR
export LC_ALL

André Warnier a écrit :


The problem is generally unsolvable, if the original entry in the 
directory can be created in several ways, because there are multiple 
agents capable of creating it, and these agents use inconsistent 
encodings.

That's my case.
Actually, entries in the Windows shared should become from everywhere, 
with I suppose various encoding. In fact, files I need to process are 
stored in an external support (CD, USB...) and under Windows, I share 
the corresponding drive. Then, this shared drive becomes the directory I 
mount under my Linux system.
Note that it is a key requierement having the external support loaded 
under Windows system ONLY.
The issue can be simpler, if Sylvie's program just opens the 
directory, reads the filenames that it finds there (whatever their 
encoding is), into some variable, and then just uses this variable as 
the filename to open the file and that's it.

I don't understand your point ?
I just try to open my file and read it with a FileInputStream.


Sylvie


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread André Warnier

Sylvie Perrin wrote:

André,

I follow your tutorial and all outputs in Widows Explorer, DOS Command 
Window and Linux Window are consistents concerning file names display.

That's good.



For locale set under Linux, here is the output:

LANG=fr_FR.UTF-8
LC_CTYPE=fr_FR.UTF-8
LC_NUMERIC=fr_FR.UTF-8
LC_TIME=fr_FR.UTF-8
LC_COLLATE=fr_FR.UTF-8
LC_MONETARY=fr_FR.UTF-8
LC_MESSAGES=fr_FR.UTF-8
LC_PAPER=fr_FR.UTF-8
LC_NAME=fr_FR.UTF-8
LC_ADDRESS=fr_FR.UTF-8
LC_TELEPHONE=fr_FR.UTF-8
LC_MEASUREMENT=fr_FR.UTF-8
LC_IDENTIFICATION=fr_FR.UTF-8
LC_ALL=


That's good too.



I just remind that I have these lines in my tomcat auto-start script :
LC_ALL=fr_FR
export LC_ALL

Thuis, you should probably change, to be the same as your own locale 
fr_FR.UTF-8 above.




André Warnier a écrit :


The problem is generally unsolvable, if the original entry in the 
directory can be created in several ways, because there are multiple 
agents capable of creating it, and these agents use inconsistent 
encodings.

That's my case.
Actually, entries in the Windows shared should become from everywhere, 
with I suppose various encoding. In fact, files I need to process are 
stored in an external support (CD, USB...) and under Windows, I share 
the corresponding drive. Then, this shared drive becomes the directory I 
mount under my Linux system.
Note that it is a key requierement having the external support loaded 
under Windows system ONLY.
The issue can be simpler, if Sylvie's program just opens the 
directory, reads the filenames that it finds there (whatever their 
encoding is), into some variable, and then just uses this variable as 
the filename to open the file and that's it.

I don't understand your point ?
I just try to open my file and read it with a FileInputStream.

Allright.  Let me see if I understand correctly your basic issue (not 
the test program, but the real application you need to create).


- miscellaneous agents create files, on some media, which is later 
connected to a Windows system and becomes a shared directory.
You do not control these agents, nor the file names that they choose to 
put there.


- your application, running (later) under Tomcat, is supposed to read 
these files and do something with them.


I suppose that you do not know in advance, what the names of these files 
will be, and you just have to take what is there. Is that correct ?




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

André,

On 9/22/2009 4:00 AM, André Warnier wrote:
 what I am trying to say is that such matters are horrible, because
 *everything* matters.

Eh.. well, yeah. :)

 Your note about making sure, in the source code of the program, that the
 filename is really made out of the bytes which the OP thinks it is made
 of, is a good example. If, to create this program source, one uses an
 editor which is set to save its files in the iso-latin-1 charset, then
 fichié.txt will be saved, in the program source, as a string of 10
 bytes.  Conversely, if one uses an editor set to save its files in
 Unicode/UTF-8, then this same string will be saved as 11 bytes (the é
 occupying 2 bytes).
 Then comes the compiler..
 I don't know how a Java compiler handles source code respectively saved
 as an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does it
 tell the difference ? does it make assumptions based on the locale it is
 running under ?

javac is documented to use the platform default encoding (for /Java/),
which may not be the default encoding of your editor. :(

http://java.sun.com/javase/6/docs/technotes/tools/windows/javac.html

Without any interference from me, my compiler chooses ANSI_X3.4-1968
which is roughly Latin-1, so any funny business in there like Thử
nghiệm Tiếng Việt isn't going to fly. It's always best in Java source
files to use something as close to ASCII as possible and use the \u
encoding of any special UNICODE characters.

The OP won't cough-up the source code, though, so we don't even know if
this is a source code problem or an HTTP-request-parameter
interpretation problem.

 One item of interest here would be to know how these files are created,
 and if that process is consistent (meaning, are these files always
 created by the same programs, running always on the same platform, using
 the same encoding etc..).  That is to make sure that when a file named
 fichié.txt is created there by whatever, it will always be created the
 same way, with a name of either 10 or 11 bytes (it does not matter
 which, just that it be consistent).

+1

 The problem is generally unsolvable, if the original entry in the
 directory can be created in several ways, because there are multiple
 agents capable of creating it, and these agents use inconsistent encodings.

Yup. Unless you read the directory entry from the filesystem and guess
at the right file (ha!), you might not get the one you want.

 However, the Tomcat server may well be started under a different locale
 setting, and this may have an impact as to how each one of them looks at
 the filename fichié.txt.

Unfortunately, the Java API says nothing about the encoding used to read
and write filenames. :(

 Then of course, after the above trivial matter of the filename is
 resolved, one may have to tackle the matter of how the file contents are
 encoded.

At least the programmer has some measure of control over that.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq45ewACgkQ9CaO5/Lv0PAVsQCgt9YnaEBJhRatVGgsUWjkmLlC
9yEAn03E+uM5bslLUZ1/sC4y3/3z1y0u
=pCP2
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Sylvie Perrin

André,

Thanks to you, my testcase is now running without any exception.

André Warnier a écrit :

Sylvie Perrin wrote:

I just remind that I have these lines in my tomcat auto-start script :
LC_ALL=fr_FR
export LC_ALL

Thuis, you should probably change, to be the same as your own locale 
fr_FR.UTF-8 above.



The cause was the LC_ALL variable in my script starting tomcat.
I set it to fr_FR.UTF-8 as you suggest and now, my test is OK !
Allright.  Let me see if I understand correctly your basic issue (not 
the test program, but the real application you need to create).


- miscellaneous agents create files, on some media, which is later 
connected to a Windows system and becomes a shared directory.
You do not control these agents, nor the file names that they choose 
to put there.


- your application, running (later) under Tomcat, is supposed to read 
these files and do something with them.


I suppose that you do not know in advance, what the names of these 
files will be, and you just have to take what is there. Is that correct ?

You perfectly undestood requirements of my real application.
I know that I will expect others wonderful problems :-)

Thank you again for your great help.

Sylvie.


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/22/2009 11:01 AM, Sylvie Perrin wrote:
 The cause was the LC_ALL variable in my script starting tomcat.
 I set it to fr_FR.UTF-8 as you suggest and now, my test is OK !

I wonder if Java uses the file.encoding system property (which is set by
the portion of $LC_ALL after the .) to convert bytes returned from the
filesystem into filenames and vice versa.

Yeah, that appears to be the case:

import java.io.*;

public class FileEncodingTest
{
public static void main(String[] args)
throws Exception
{
System.out.println(Using file.encoding= +
System.getProperty(file.encoding));

File file = new File(\u03c0); // That's a lowercase Greek pi
Writer out = new FileWriter(file);
out.write(A test file\n);
out.close();

file = new File(.);

File[] files = file.listFiles();

for(int i=0; ifiles.length; ++i)
{
file = files[i];

System.out.print(file.getName());
System.out.print(\tunicode: );

byte[] bytes =
file.getName().getBytes(UnicodeBigUnmarked); // Trust me

for(int j=0; jbytes.length; ++j)
{
String hex = Integer.toHexString(bytes[j]);
if(1 == hex.length())
System.out.print(0);
System.out.print(hex);
System.out.print( );
}

System.out.println();
}
}
}

Output on my system:

$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?   unicode: 00 3f

$ LC_ALL=en_US.UTF-8 java FileEncodingTest
Using file.encoding=UTF-8
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?   unicode: 00 3f
?   unicode: 03 c0  (/this correctly emitted the glyph for pi/)

Then, for good measure:

$ java FileEncodingTest
Using file.encoding=ANSI_X3.4-1968
FileEncodingTest.class  unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 63 00
6c 00 61 00 73 00 73
FileEncodingTest.java   unicode: 00 46 00 69 00 6c 00 65 00 45 00 6e 00
63 00 6f 00 64 00 69 00 6e 00 67 00 54 00 65 00 73 00 74 00 2e 00 6a 00
61 00 76 00 61
?   unicode: 00 3f
??  unicode: ff fd ff fd (/this did not/)

So, when running in ANSI_X3.4-1968-mode, Java takes the codepoint for pi
(0x03c0) and destroys it (note the two-character filename where the
first byte is NUL). I'm not really even sure how it does that... I'd
have expected some broken sign-extension or something but I have no idea
how 0x03c0 becomes 0x003f.

When running in UTF-8 mode, the correct code point is used for the
filename and read-back correctly using listFiles.

When running again in ANSI mode, the original (incorrect) filename is
(predictably) read- back in the same way as the original, but the
filename with the correct code point is again garbled (0x03c0 -
0xfffdfffd).

Somebody needs to write a virus that just converts everything to UTF-8
so we can be done with it.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq47lAACgkQ9CaO5/Lv0PCDjwCfWTArE2PRo2XTeBgd3yGD+AyZ
dCUAnAo8aSsYUdgT/eJBvqMjWA0KzXwF
=OEyH
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Len Popp


On 2009-09-22, at 11:33, Christopher Schultz ch...@christopherschultz.net 
 wrote:

Somebody needs to write a virus that just converts everything to UTF-8
so we can be done with it.


I hear you can contract out that sort of work these days. :-)
--
Len


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread André Warnier

Christopher Schultz wrote:
...



Then of course, after the above trivial matter of the filename is
resolved, one may have to tackle the matter of how the file contents are
encoded.


At least the programmer has some measure of control over that.


Not if she doesn't know what they have been created with though.
But let's leave that for a later stage, and first deal with the filenames.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread André Warnier

Sylvie Perrin wrote:
...
- your application, running (later) under Tomcat, is supposed to read 
these files and do something with them.


I suppose that you do not know in advance, what the names of these 
files will be, and you just have to take what is there. Is that correct ?

You perfectly undestood requirements of my real application.
I know that I will expect others wonderful problems :-)


Ok, then we need Christopher's Java knowledge now.
Christopher, how does one, in Java, read a directory item by item ?
We need this kind of thing :

- open the directory
- while (variable fn = next directory item) {
   - next if item is not a regular file
   - open the file named fn
   - do something to that file
   - close the file
   - delete the file ?
   }
- close the directory

And (just to anticipate the next issue), Sylvie, does your program 
actually need to read the content of the file and do something with that 
content ?




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

André,

On 9/22/2009 3:24 PM, André Warnier wrote:
 Ok, then we need Christopher's Java knowledge now.

Or you could look at the API ;)

 Christopher, how does one, in Java, read a directory item by item ?

See my other message on this thread which includes source code to do
just that.

 And (just to anticipate the next issue), Sylvie, does your program
 actually need to read the content of the file and do something with that
 content ?

Yeah, remember to use a Reader and specify the character encoding.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq5KHMACgkQ9CaO5/Lv0PBmwgCfSkP+muADl9MZz8wBoGyr2509
jloAoIqaM5pl46EV7PQyhVA2G3pXiCJl
=5MnR
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread André Warnier

Christopher Schultz wrote:
...


I wonder if Java uses the file.encoding system property (which is set by
the portion of $LC_ALL after the .) to convert bytes returned from the
filesystem into filenames and vice versa.

Yeah, that appears to be the case:


Christopher,
your detailed analysis is impressive and undoubtedly accurate, but 
beyond what I can swallow right now in Java and after 2 glasses of 
Spanish wine.

So let me ask a simple question :
- a file named fichié.txt has been created in a directory, by a 
process that spoke iso-8859-1 (so the filename is 10 bytes long).
- a Tomcat runs in a process whose locale is set to UTF-8, and an 
application inside this Tomcat reads the filename from the directory 
into a Java String variable S.

What happens ?
- does the application get an exception due to invalid encoding ?
- if not, why not ?
- if not, what is now the content, in bytes, of variable S ?




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-22 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

André,

On 9/22/2009 3:58 PM, André Warnier wrote:
 your detailed analysis is impressive and undoubtedly accurate, but
 beyond what I can swallow right now in Java and after 2 glasses of
 Spanish wine.

It's probably better than having 2 pints of Belgian beer. Wow.

 So let me ask a simple question :
 - a file named fichié.txt has been created in a directory, by a
 process that spoke iso-8859-1 (so the filename is 10 bytes long).

Ok.

 - a Tomcat runs in a process whose locale is set to UTF-8, and an
 application inside this Tomcat reads the filename from the directory
 into a Java String variable S.
 What happens ?
 - does the application get an exception due to invalid encoding ?

No. The results of my other test suggest that you basically just get
garbage characters in the filename.

 - if not, why not ?

Good question. Maybe the JVM authors decided that garbage characters
were better than an inaccessible file (and I tend to agree with that
trade-off).

 - if not, what is now the content, in bytes, of variable S ?

Heh. Beats me. I couldn't understand how the UTF-8 filename had been
mangled when in ANSI mode, so I'm not sure if such mangling is reversible.

I wonder if you could re-encode the filename something like this:

String encoding = System.getProperty(file.encoding);

String filename = file.getName(); // gets you junk
String recoded = new String(filename.getBytes(encoding), UTF-8);

Of course, this only works if:

1. the file was originally written in UTF-8 mode
2. The ANSI mangling that has occurred is reversible using
   the above method (duh)

If you have some suspicion as to the encoding used to encode the
filename in the first place, you could re-code the filename several
times and attempt a match (using String.equals).

Better yet, you could re-code the filename you /think/ you have into a
String and then use that to check against the filesystem.

I dunno. This is pretty ugly. Again, setting everything to UTF-8
dramatically reduces headaches in these areas.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq5L04ACgkQ9CaO5/Lv0PAlGQCdEjzO/3Ikf1ooQDVmkpzOiLl1
j0IAn1NiU8tbcdMGDra6thzvPFYml1m3
=yOp/
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread Sylvie Perrin




Christopher,

Here is the stack trace of the FileNotFoundException:

java.io.FileNotFoundException: /home/me/mountDir/fichi��.txt (No such
file or directory)
    at java.io.FileInputStream.open(Native Method)
    at
java.io.FileInputStream.init(FileInputStream.java:106)
    at SambaMountServletTest.doGet(SambaMountServletTest.java:102)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
    at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
    at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at
filters.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:127)
    at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:433)
    at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
    at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
    at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
    at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
    at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:619)


Note that last week, I also try to set the fileEncoding init parameter
to UTF-8 ( in the default servlet in my conf/web.xml) without positive
results.

Thanks,

Sylvie.

Christopher Schultz a écrit :

  -BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/18/2009 8:35 AM, Sylvie Perrin wrote:
  
  
So, I change this property in the servlet test, by adding
JAVA_OPTS=-Dsun.jnu.encoding=UTF-8

  
  
[snip]

  
  
But my issue is still here, ie. the FileNotFoundException.

  
  
I wonder if it has nothing to do with the encoding. Can you post the
entire stack trace of the exception, including any "caused by" clauses?

There are other reasons to get FileNotFoundException...

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqzrm0ACgkQ9CaO5/Lv0PBWqQCfXdeshqIZBpA6zPgj0T4Rxo/9
SoMAn0FMDbGWlwmqT+T79NvtGIv+i3f0
=Dnsy
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



  


-- 





Sylvie Perrin
T :  +33.4.77.23.78.12 
  





Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread André Warnier

Sylvie Perrin wrote:

Christopher,

Here is the stack trace of the FileNotFoundException:

java.io.FileNotFoundException: /home/me/mountDir/fichi��.txt (No such file or 
directory)


Sylvie,

maybe what appears above shows the origin of the problem, and explains
what I was trying previously to tell you.
It is difficult to be sure, because (again) there are several layers of
encoding/decoding between your logfile, and how it may show up in this 
email.


The problem is not your problem per se.  You are not necessarily doing 
anything wrong. The problem is basically in the lack of a common 
standard between different OS'es and filesystem types, about how to 
represent filenames containing non-US-ASCII characters.


Below, I am trying to explain the root of the problem, concisely but 
fully.  It *is* a complex matter, that's why it is confusing.  But you 
are not alone in being confused or puzzled.  Unless one has had to deal 
with such issues many times, it is really easy to get confused, because 
in this case, what one sees is not necessarily what one gets.


Assuming that what I see above is also what you see in the logfile 
(fichi + 2 strange characters + .txt) :


- java is trying to open a file named fichi + 2 strange characters +
.txt
- these two characters *may* be the Unicode/UTF-8 encoding of the
character é (e with acute accent)
- but java is not finding that file (obviously)

Furthermore :
The file is really located on a Windows server.
The Windows directory where the file is located, is mounted through 
the CIFS filesystem, onto a local mountpoint on your (Linux) Java and 
Tomcat host.

On your Java/Tomcat host, Java is seeing the contents of this directory
*through* this CIFS filesystem mount.
In principle (but that is only an assumption here), the CIFS filesystem 
code (running on the localhost) shows this (remote) directory content to 
a local application as is, without making any character set translation.


Now Java (on your local system) is trying to find this file 
fichiXX.txt, and not finding it. (XX being 2 the two unknown bytes)
That means that, on the remote system, this file fichXX.txt does not 
exist.


If you connect to that remote system via, for instance, a Remote Desktop 
or a VNC console (or even from your local station, just browse this 
share through the Windows Explorer), and examine the content of that 
directory, you probably see a file named fichié.txt.


But that is only what you *see*, through whatever interface you use.
In reality, the é in this filename may (or may not) be encoded, in the 
Windows directory entry, as 2 bytes. Or it may be encoded with (for 
instance) a Windows 8-bit codepage, as a single byte.
If so, that is why Java, which is trying to find this é as 2 bytes, 
does not find it.


Now comes the difficult part :

To solve your problem thus, you have to make sure that when Java is 
looking for a filename which, from the Java point of view, contains an 
é character, this Java é *character* (whatever its representation is 
as bytes in Java), matches the byte representation of the é character, 
in the filesystem of the remote host where the file actually resides.


And the problem is, that these two systems (Java and your current 
platform) and the remote OS, do not necessarily agree on what this byte 
representation of an é character is.


For example, suppose you find the right set of measures that make your 
Java program find the file in the end.
Then, you replace the Windows fileserver by a Linux server, sharing its 
files through Samba.
Well, the problem may then show up again, because the encoding may be 
different again.
That is why I was recommending to stick to US-ASCII names.  It was not a 
joke.






-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread Sylvie Perrin

André,

Thank you for your help but I can't follow your main recommendation, ie. 
avoid using non US-ASCII names.
Actually, file names are part of information my servlet have to process 
and they cannot be changed.

I am not the owner of these names and I must deal with them.

Sylvie.

André Warnier a écrit :
That is why I was recommending to stick to US-ASCII names.  It was not 
a joke.





-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread ramzi khlil
Sylvie,

I suggest you to create a mapping until you find a solution.

In your application, put the origianl file name as friendly name and save
the file without accent.
So, when user list the files, you show him the friendly name, but when you
load a file use the mapping entry to get the file name.

Ramzi




On Mon, Sep 21, 2009 at 10:45 AM, Sylvie Perrin
sylvie.per...@continew.frwrote:

 André,

 Thank you for your help but I can't follow your main recommendation, ie.
 avoid using non US-ASCII names.
 Actually, file names are part of information my servlet have to process and
 they cannot be changed.
 I am not the owner of these names and I must deal with them.

 Sylvie.

 André Warnier a écrit :

 That is why I was recommending to stick to US-ASCII names.  It was not a
 joke.



 -
 To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
 For additional commands, e-mail: users-h...@tomcat.apache.org




Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread André Warnier

Sylvie Perrin wrote:

André,

Thank you for your help but I can't follow your main recommendation, ie. 
avoid using non US-ASCII names.
Actually, file names are part of information my servlet have to process 
and they cannot be changed.

I am not the owner of these names and I must deal with them.

Ok, then : who is creating those files inside the Windows directory, and 
how do they create them ? (using which tool ?).
This is important, to figure out if the process(es) creating these files 
are consistent, and if you can always expect a specific character 
encoding scheme for those file names.


Let me give you an example, as a tutorial :
- with Windows Explorer :
  - inside your shared directory, create a test sub-directory
  - in this directory, use the right mouse click to create a new text 
document. Name it, for example, fichié.txt. Notice that you did this, 
from your workstation, using your keyboard, and under Windows Explorer. 
 The file in the directory looks like it has the name fichié.txt, right ?
- now on that same server, you will need a console window, to open a 
Windows Command Window (the black DOS window).  In that command 
window, use cd to navigate to your test directory. When you are there, 
enter dir and look at the file list.  How does your file name look like ?

- in the same command window, create a new file by using this command :
echo Hello André  fichié-deux.txt
- do a dir. How does that one look like ?
- then go back to Explorer and compare the two filenames. Do they look 
the same ? (as far as the é s are concerned)
- now go back to your Tomcat host, and using cd, navigate to your 
Windows test directory (should be /mnt/).

- use the same command
echo Hello André  fichié-trois.txt
to create a file (from linux) on the Windows server.  Do an ls -l to 
see what it looks like from Linux.


Then again, compare the names in (1) the Windows Explorer, (2) the DOS 
command window and (3) your Linux window.  Is everything still consistent ?
If not (you see different names depending on the interface), make a 
table showing what the filenames look like in the 3 cases.

Also, under Linux, enter the command locale and note the result.

The above is the first step, and concerns only the filenames.  Next, you 
should have a look at file contents, and check if accented text words in 
the contents also look consistent or not.







-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-21 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

André,

On 9/21/2009 5:45 AM, André Warnier wrote:
 Sylvie Perrin wrote:
 Christopher,

 Here is the stack trace of the FileNotFoundException:

 java.io.FileNotFoundException: /home/me/mountDir/fichi��.txt (No such
 file or directory)

[snip]

 Assuming that what I see above is also what you see in the logfile
 (fichi + 2 strange characters + .txt) :

+1

What is the source of that file name? Is it hard-coded into your Java
code? If so, how? Did you just type fichié.txt into your .java file,
or did you use \uxyz syntax to specify the UNICODE character you intended?

If you are reading the filename from a remote client, then all the
request URI encodings and all that stuff are definitely relevant (ion
spite of my previous statements to the contrary).

Can you post your servlet code?

 Furthermore :
 The file is really located on a Windows server.
 The Windows directory where the file is located, is mounted through
 the CIFS filesystem, onto a local mountpoint on your (Linux) Java and
 Tomcat host.
 On your Java/Tomcat host, Java is seeing the contents of this directory
 *through* this CIFS filesystem mount.
 In principle (but that is only an assumption here), the CIFS filesystem
 code (running on the localhost) shows this (remote) directory content to
 a local application as is, without making any character set translation.

Honestly, I think the above should not be a problem. I am the semi-proud
owner of the soundtrack to the film π (that's pi to those whose
email agents don't understand UTF-8 encoding). I used to have all my
albums in MP3 (or Ogg Vorbis) format on an ext3 partition shared using
Samba over a network to a Windows machine. The directory was created by
Windows over the network, and the directory name always showed correctly
in Windows. On my server, however, xterm and/or ls always showed it as
some unrecognizable garbage characters. But Windows consistently showed
it correctly.

I suppose that doesn't really prove anything, but it's probably worth
noting.

Furthermore, the OP is capable of opening said file with a non-servlet
Java example program, so it's not a
Java/java.io.File/CIFS/Samba/whatever problem.

I suspect your servlet is either misinterpreting the file name from a
remote client (most likely) or you have done something like use a
non-standard encoding for your .java files. The answers to the above
questions will definitely help.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkq4LbUACgkQ9CaO5/Lv0PAKVACgwpTZHCGgvZjMReQSOSKloblf
IL0AniRtubJcs3V4oObEMvQY0SwreVjs
=iXwU
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-18 Thread Sylvie Perrin

Christopher,

Thank you for your help and see inline the results of the test you suggest.
It shows that sun.jnu.encoding hasn't got the same value in standalone 
and servlet runs.


So, I change this property in the servlet test, by adding
JAVA_OPTS=-Dsun.jnu.encoding=UTF-8
in my Tomcat starting file and then I restart my Tomcat.

I verify that sun.jnu.encoding is now equals to UTF-8 in the output print.

But my issue is still here, ie. the FileNotFoundException.

Any other idea ?

Sylvie.


Christopher Schultz a écrit :

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/17/2009 9:12 AM, Sylvie Perrin wrote:
  

I have a shared directory on a windows system named SHAREDDIR and
containing one file named fichié.txt
I mount this shared directory on my Linux system with the following
command:


mount -t cifs -o iocharset=utf8 //IpWindows/SHAREDDIR /home/me/mountDir/
  

In a standalone Java application running on my Linux system, I can
create a FileInputStream from the file located in the remote directory
like this:

String mountPath = /home/me/mountDir;
File[] list = new File(mountPath).listFiles();
File file = list[0];
try {
   FileInputStream fStream = new FileInputStream(file);
}
catch (FileNotFoundException e) {
   e.printStackTrace();
}



Can you have your standalone Java program print the following information:

1. The full path of the file
2. The values for these system properties:
   a. file.encoding
   b. sun.jnu.encoding

  

For Standalone Java program :
File full path = /home/me/mountDir/fichié.txt
file.encoding System property = UTF-8
sun.jnu.encoding System property = UTF-8


When I execute the same code in a servlet running on the same machine,
the call to FileInputStream constructor always throws a
FileNotFountException because it  doesn't recognize the é character in
the path of the file.



Please post the above values within your servlet environment, too.
  

For Servlet program :
File full path = /home/me/mountDir/fichi��.txt
file.encoding System property = UTF-8
sun.jnu.encoding System property = ANSI_X3.4-1968

Note that my Firefox encoding display is set to UTF-8

Are you sure that it's because of the é, or is it because the user that
Tomcat is running under does not have permission to read that file?
Under what user /is/ Tomcat running?

  
I am sure that it's not user permission issue because when I rename my 
file, I can execute my servlet and create my FileInputStream without any 
exception.

Since I don't know what the problem is I have had a hard time tracking
down a solution online. I especialy take care to follow all steps
described in the FAQ/CharacterEncoding parts of wiki. Here is my
configuration:

I set URIEncoding in my port 8080 connector to UTF-8 (I use this port to
execute my servlet)
Connector port=8080 protocol=HTTP/1.1
  connectionTimeout=2
  redirectPort=8443
  URIEncoding=UTF-8
  useBodyEncodingForURI=true /



None of these settings matter. These are only relevant for HTTP
communication, and your code is not reading anything from the request.

  

I use a filter to set the default encoding to UTF-8 and my first line of
my doFilter method is
request.setCharacterEncoding(UTF-8);



Your filter sets /what/ default encoding? What does it set it to?

Setting the encoding of the request will not affect your code above.

  

I add in my servlet the set of content-type for responses to UTF-8 and
my first line of my doGet method is
response.setContentType(text/html;charset=UTF-8);



This will also have no effect.

  
I agree with that, but I was so lost that I applied all Character 
encoding tutorials I have found !

My tomcat is started with CATALINA_OPTS=-Dfile.encoding=UTF-8



Okay. Let's see what your command-line program reports for
file.encoding, etc.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqyZxQACgkQ9CaO5/Lv0PArBACdGM53y+0/2L1lkf3gvngXpnAz
8D8An3pjgMT4jBOk6jg+zRNEXGORzJ1G
=v9Bf
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



  


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-18 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/18/2009 8:35 AM, Sylvie Perrin wrote:
 So, I change this property in the servlet test, by adding
 JAVA_OPTS=-Dsun.jnu.encoding=UTF-8

[snip]

 But my issue is still here, ie. the FileNotFoundException.

I wonder if it has nothing to do with the encoding. Can you post the
entire stack trace of the exception, including any caused by clauses?

There are other reasons to get FileNotFoundException...

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqzrm0ACgkQ9CaO5/Lv0PBWqQCfXdeshqIZBpA6zPgj0T4Rxo/9
SoMAn0FMDbGWlwmqT+T79NvtGIv+i3f0
=Dnsy
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Create FileInputStream in servlet from remote file with accentuated character name

2009-09-17 Thread Sylvie Perrin
I have a problem with Tomcat 6.0 on Linux and I haven't been able to 
determine the cause or solution.


I have a shared directory on a windows system named SHAREDDIR and 
containing one file named fichié.txt

I mount this shared directory on my Linux system with the following command:
 mount -t cifs -o iocharset=utf8 //IpWindows/SHAREDDIR /home/me/mountDir/

In a standalone Java application running on my Linux system, I can 
create a FileInputStream from the file located in the remote directory 
like this:


String mountPath = /home/me/mountDir;
File[] list = new File(mountPath).listFiles();
File file = list[0];
try {
   FileInputStream fStream = new FileInputStream(file);
}
catch (FileNotFoundException e) {
   e.printStackTrace();
}

When I execute the same code in a servlet running on the same machine, 
the call to FileInputStream constructor always throws a 
FileNotFountException because it  doesn't recognize the é character in 
the path of the file.
When I rename my file on my windows shared directory in fichie.txt, 
the servlet is executed without any errors.


Since I don't know what the problem is I have had a hard time tracking 
down a solution online. I especialy take care to follow all steps 
described in the FAQ/CharacterEncoding parts of wiki. Here is my 
configuration:


I set URIEncoding in my port 8080 connector to UTF-8 (I use this port to 
execute my servlet)

Connector port=8080 protocol=HTTP/1.1
  connectionTimeout=2
  redirectPort=8443
  URIEncoding=UTF-8
  useBodyEncodingForURI=true /

I use a filter to set the default encoding to UTF-8 and my first line of 
my doFilter method is

request.setCharacterEncoding(UTF-8);

I add in my servlet the set of content-type for responses to UTF-8 and 
my first line of my doGet method is

response.setContentType(text/html;charset=UTF-8);

My tomcat is started with CATALINA_OPTS=-Dfile.encoding=UTF-8

My servlet displays some debug traces and I verify that Servlet response 
getCharacterEncoding = UTF-8 and Servlet request getCharacterEncoding = 
UTF-8


Any idea why FileInputStream doesn't work in a servlet?

Thanks



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-17 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/17/2009 9:12 AM, Sylvie Perrin wrote:
 I have a shared directory on a windows system named SHAREDDIR and
 containing one file named fichié.txt
 I mount this shared directory on my Linux system with the following
 command:
 mount -t cifs -o iocharset=utf8 //IpWindows/SHAREDDIR /home/me/mountDir/
 
 In a standalone Java application running on my Linux system, I can
 create a FileInputStream from the file located in the remote directory
 like this:
 
 String mountPath = /home/me/mountDir;
 File[] list = new File(mountPath).listFiles();
 File file = list[0];
 try {
FileInputStream fStream = new FileInputStream(file);
 }
 catch (FileNotFoundException e) {
e.printStackTrace();
 }

Can you have your standalone Java program print the following information:

1. The full path of the file
2. The values for these system properties:
   a. file.encoding
   b. sun.jnu.encoding

 When I execute the same code in a servlet running on the same machine,
 the call to FileInputStream constructor always throws a
 FileNotFountException because it  doesn't recognize the é character in
 the path of the file.

Please post the above values within your servlet environment, too.

Are you sure that it's because of the é, or is it because the user that
Tomcat is running under does not have permission to read that file?
Under what user /is/ Tomcat running?

 Since I don't know what the problem is I have had a hard time tracking
 down a solution online. I especialy take care to follow all steps
 described in the FAQ/CharacterEncoding parts of wiki. Here is my
 configuration:
 
 I set URIEncoding in my port 8080 connector to UTF-8 (I use this port to
 execute my servlet)
 Connector port=8080 protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443
   URIEncoding=UTF-8
   useBodyEncodingForURI=true /

None of these settings matter. These are only relevant for HTTP
communication, and your code is not reading anything from the request.

 I use a filter to set the default encoding to UTF-8 and my first line of
 my doFilter method is
 request.setCharacterEncoding(UTF-8);

Your filter sets /what/ default encoding? What does it set it to?

Setting the encoding of the request will not affect your code above.

 I add in my servlet the set of content-type for responses to UTF-8 and
 my first line of my doGet method is
 response.setContentType(text/html;charset=UTF-8);

This will also have no effect.

 My tomcat is started with CATALINA_OPTS=-Dfile.encoding=UTF-8

Okay. Let's see what your command-line program reports for
file.encoding, etc.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqyZxQACgkQ9CaO5/Lv0PArBACdGM53y+0/2L1lkf3gvngXpnAz
8D8An3pjgMT4jBOk6jg+zRNEXGORzJ1G
=v9Bf
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Create FileInputStream in servlet from remote file with accentuated character name

2009-09-17 Thread André Warnier

Christopher Schultz wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sylvie,

On 9/17/2009 9:12 AM, Sylvie Perrin wrote:

I have a shared directory on a windows system named SHAREDDIR and
containing one file named fichié.txt


Sylvie,
why do you not name your file fichier.txt, like it should be written 
in French ?  That would solve your problem immediately, save a lot of 
ink on this thread, and save you a lot of time in the end.


Seriously.

There are so many pieces that play their part between on the one side a 
browser that you do not control, on a workstation that you do not 
control, in the middle HTML and HTTP for which the default character set 
is iso-8859-1 and Java for which the internal character set is Unicode, 
a local Linux filesystem which is charset-agnostic, and on the other 
side a Windows system which stores its filenames in directories as 
Unicode, that you will never get a solution that is totally foolproof.
If you have to play with a web application which involves files on 
different platforms, stick with filenames that are purely made of 
US-ASCII characters.


André




Seriously now, let's start at the beginning.
You are, like many of us, the victim of these horrible English-speaking 
imperialists in the computer industry. They just don't understand 
alphabets with more than 27 letters, and get totally confused by our és 
and às and cédilles and sharfe s'eses. But since they got there first 
(mainly because of all the anti-competitive subsidies they gave to 
Boeing and GM), we are the ones who have to adapt.


So, you have a file, which on your Unix/Linux system looks like
/home/me/mountDir/fichié.txt.
Or, does it really ?

Try the following :
- open a console window on your Linux system
- enter the command locale -a, and find 2 result lines like :
fr_FR.iso8859-1
fr_FR.utf8
(or something similar, the point being to have one looking like it 
contains 8859-1 and the other looking like it contains utf8).


- now enter export LC_CTYPE=fr_FR.iso8859-1
(adapt this in function of what you found above with locale -a)

- now enter ls -l /home/me/mountDir/
How does the filename look like ?

- now enter export LC_CTYPE=fr_FR.utf8
(adapt this in function of what you found above with locale -a)

- now enter ls -l /home/me/mountDir/ again
How does the filename look like now ?

I would bet the file name looks different.

Now go to your Windows systems, open the Windows Explorer, and look at 
what this filename loks like.
Then on your Windows system, open a command window, navigate to the same 
directory, do a dir, and look at what the filename loks like.

A difference, also ?

Why is that ?
The filename itself did not change in the directory of your Windows system.

But the name of that file is going to look different, depending on how 
many layers of software there are between that directory entry and the 
process that uses that filename, and on the settings of each of these 
layers.


The above are simple cases, involving just a few layers : the original 
directory, the CIFS filesystem drivers on your Linux machine, the ls 
program itself, and the display interface between that program and your 
console.
Now you add Java and Tomcat on top of that, and you add HTTP, and you 
add URI encoding/decoding, and you add the browser, and you add the 
encoding of your html pages.


In other words, give it up.



I mount this shared directory on my Linux system with the following
command:

mount -t cifs -o iocharset=utf8 //IpWindows/SHAREDDIR /home/me/mountDir/

In a standalone Java application running on my Linux system, I can
create a FileInputStream from the file located in the remote directory
like this:

String mountPath = /home/me/mountDir;
File[] list = new File(mountPath).listFiles();
File file = list[0];
try {
   FileInputStream fStream = new FileInputStream(file);
}
catch (FileNotFoundException e) {
   e.printStackTrace();
}


Can you have your standalone Java program print the following information:

1. The full path of the file
2. The values for these system properties:
   a. file.encoding
   b. sun.jnu.encoding


When I execute the same code in a servlet running on the same machine,
the call to FileInputStream constructor always throws a
FileNotFountException because it  doesn't recognize the é character in
the path of the file.


Please post the above values within your servlet environment, too.

Are you sure that it's because of the é, or is it because the user that
Tomcat is running under does not have permission to read that file?
Under what user /is/ Tomcat running?


Since I don't know what the problem is I have had a hard time tracking
down a solution online. I especialy take care to follow all steps
described in the FAQ/CharacterEncoding parts of wiki. Here is my
configuration:

I set URIEncoding in my port 8080 connector to UTF-8 (I use this port to
execute my servlet)
Connector port=8080 protocol=HTTP/1.1
  connectionTimeout=2
  redirectPort=8443
  URIEncoding=UTF-8