Re: [CSV] Invalid char between encapsulated token and delimiter

Gary Gregory Fri, 03 Jan 2025 03:23:14 -0800

Chris,

If your input has a BOM in the middle, it likely means it was created by
concatenating more than one source as BOMs usually occur only at the start
of documents.


How the BOM got in a document is up to you, it's possible that an XSL
processor copied it from XML input for example.

Commons CSV can be made to ignore a BOM at the _start_ of a document using
a BOMInputStream from Commons IO as described in the user guide here:

https://commons.apache.org/proper/commons-csv/user-guide.html

under the heading "Handling Byte Order Marks". This technique does not
apply to input with BOMs randomly occurring in the middle.

final URL url = ...;
try (final Reader reader = new InputStreamReader(new
BOMInputStream(url.openStream()), "UTF-8");
     final CSVParser parser = CSVFormat.EXCEL.builder()
       .setHeader()
       .build()
       .parse(reader)) {
    for (final CSVRecord record : parser) {
        final String string = record.get("SomeColumn");
        ...
    }
}


You might find it handy to create something like this:

/**
 * Creates a reader capable of handling BOMs.
 */
public InputStreamReader newReader(final InputStream inputStream) {
    return new InputStreamReader(new BOMInputStream(inputStream),
StandardCharsets.UTF_8);
}

HTH,
Gary


On Thu, Jan 2, 2025 at 8:56 PM Christopher Dodunski (Apache Tomcat) <
chrisfromsquir...@christopher.net.nz> wrote:

> Hi all,
>
> I have a web application which uses Apache Commons CSV for processing
> uploaded CSV files.
>
> Occasionally, users are experiencing an error when using this feature:
>
>      "IOException reading next record: java.io.IOException: (line 6)
> invalid char between encapsulated token and delimiter"
>
> On inspecting the problematic CSV file, however, line 6 looks just fine.
>
> By gradually modifying this CSV along with another which uploaded fine,
> I eventually had two CSV files that were visually identical.  Though one
> would still throw the above error.
>
> Whilst both appeared identical, I noticed that one was 3 bytes larger.
> It turns out that the problematic CSV begins with "<EF><BB><BF>"
> (discovered using Linux 'less' command).  That's a byte-order mark
> (BOM).
>
> Here is my section of code that reads these CSVs:
>
>      Reader in = new FileReader(crewList);
>      CSVFormat csvFormat = CSVFormat.DEFAULT.builder().build();
>      Iterable<CSVRecord> records = csvFormat.parse(in);
>      Iterator<CSVRecord> iterator = records.iterator();
>
> 1) I'm puzzled as to why the presence of a BOM seems to have resulted in
> an erroneous error directed at line 6.
>
> 2) If the presence of a BOM is indeed the culprit, how best to resolve
> this without creating a problem for CSVs not containing a BOM.
>
> Your suggestions are much appreciated.
>
> Kind regards,
>
> Chris.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>
>

Re: [CSV] Invalid char between encapsulated token and delimiter

Reply via email to