Hi,

When running Taverna under Windows, the system I/O functions Read Text File
and Write Text File do not behave as expected,
in case of UTF8 files.

The reason is that the readers and writers assume the standard character
encoding, in cases nothing else is
specified; this is CP1252 under Windows; as a result, the UTF8 characters
are broken.

To avoid this, we overwrote the two scripts for reading and writing text
files, so instead of (in red TextFile):

reader = new FileReader(fileUrl);
...
reader = new InputStreamReader (url.openStream());

we wrote:

reader = new InputStreamReader(new FileInputStream(fileUrl),"UTF-8");
...
reader = new InputStreamReader (url.openStream(), "UTF-8");

and instead of (in Write textFile):

BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));

we wrote:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new
FileOutputStream(outputFile), "UTF-8"));

Now, while this can be done explicitly in cases where the Beanshell code can
be edited, it can not be
done in cases where an input port is used; there is no option to edit the
underlying code.
Modifying the batch scripts (executeworkflow.bat) by adding the line
set ARGS=%ARGS% -Dfile.encoding=UTF-8 was not successful either.

It would be nice if Taverna software is adapted so that it enforces UTF-8
processing by e.g.
taking over the character code setting as the above, or by any other action.
May be we miss something...

We are engaged in natural language processing projects and character
encoding is crucial in our domain.
(Note that, for example: we cannot read a file and send its content to a
named entity recognizer or a
translator system if we are in windows and we get unexpected results when
typing inputs)

Thanks!

-- 
Marta Villegas
[email protected]
------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/

Reply via email to