Re: CSV files with UTF8 BOM

Felix Schumacher Thu, 09 Aug 2018 03:02:25 -0700



Am 20.07.2018 um 01:47 schrieb Andrew Burton:

Hi list,

Is there any appetite for handling UTF-8 with BOM markers automatically
when loading CSV input files? These currently fail silently since the first
character in the file is the BOM marker, which means CSV files with headers
don't create the correct variable name.

I *know* that technically, the BOM variant isn't an official UTF variant,
but it is commonplace when exporting from MS SQL Server (which for a lot of
Windows-based users might be their way of generating data).

I know we can convert the encoding from UTF8 BOM to UTF8 using, e.g.
Notepad++ or dos2unix but this adds an extra step to fix a problem that a
lot of users would struggle to identify in the first place ("My data file
is not working, and it looks fine when I open it in Notepad!")

(SQL Server does provide an option to output Unicode but this is UTF16, not
UTF8, which is a whole other story).

I'd propose an additonal step of identifying the file's encoding using
getEncoding() method in InputStreamReader) and if UTF8, checking if it has
a BOM marker and if so, handling it with the BOMInputStream class in apache
commons-io (ref
https://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/BOMInputStream.html
).

One other thing that might be useful is changing the input field of the
CSVDataSet for encoding to be a drop down list with only the charset values
supported by InputStreamReader (ref
https://docs.oracle.com/javase/8/docs/api/?java/io/InputStreamReader.html
and https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html).
The documentation doesn't list which encodings are valid (I had to dig
through the code to find the relevant handling class) and there's always
the risk of a typo.

I'm happy to spend some time on this if it was something that core devs
would find useful.

Looks like a good idea, especially since we already have commons-io onthe classpath.If I read it correctly, it could be enough to use BOMInputStream and letit automatically decide the encoding based on the presence of the BOM.

Just open a bugzilla entry with an enhancement request and add a patchto it.


Regards,
 Felix

Regards

Andrew



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@jmeter.apache.org
For additional commands, e-mail: user-h...@jmeter.apache.org

Re: CSV files with UTF8 BOM

Reply via email to