File Encoding falls back to default encoding while grouping after split using tokenize

Karthick K R Mon, 17 Apr 2017 11:55:09 -0700

I am quite new to Apache Camel. But after using it for a month now, I really
feel it is a great Integration framework which makes solving various
enterprise problems very effectively with minimal effort.


Coming to the issue, I had been working on splitting a huge csv using the
splitter with tokenize & grouping N lines approach and ran into encoding
issues with the grouped content. 
A similar issue had been raised in StackOverflow:  Camel: UTF-8 Encoding is
lost after using Group
<http://stackoverflow.com/questions/36075063/camel-utf-8-encoding-is-lost-after-using-group>
  

I had also commented on the same issue with my usecase and observations
made. Including the same text here:

Sample csv file: (with Delimiter - '|')
CandidateNumber|CandidateLastName|CandidateFirstName|EducationLevel

CAND123C001|Wells|Jimmy|Bachelor's Degree (±16 years)

CAND123C002|Wells|Tom|Bachelor's Degree (±16 years)

CAND123C003|Wells|James|Bachelor's Degree (±16 years)

CAND123C004|Wells|Tim|Bachelor's Degree (±16 years)

The ± character is corrupted after tokenize with grouping. I was initially
under the assumption that the problem was with not setting the proper File
Encoding for split, but the exchange seems to have the right value for
property CamelCharsetName=ISO-8859-1.

from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n",2,true)).streaming()
.log("body: ${body}");

The same works fine with dont use grouping.

from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n")).streaming()
.log("body: ${body}");

Looking at  GroupTokenIterator
<https://github.com/apache/camel/blob/master/camel-core/src/main/java/org/apache/camel/util/GroupTokenIterator.java>
  
in camel code base the problem seems to be with the way TypeConverter is
used to convert String to InputStream

// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, data);
...
Note: the mandatoryConvertTo() has an overloaded method with exchange

<T> T mandatoryConvertTo(Class<T> type, Exchange exchange, Object value)
As the exchange is not passed as argument it always falls back to default
charset set using system property "org.apache.camel.default.charset"

Potential Fix:

// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class,
exchange, data);
...
As this fix is in the camel-core, another potential option is to use split
without grouping and use AgrregateStrategy with completionSize() and
completionTimeout().

Although it would be great to get this fixed in camel-core.

Kindly let me know your thoughts and as to whether this can be handled in a
different way.




--
View this message in context: 
http://camel.465427.n5.nabble.com/File-Encoding-falls-back-to-default-encoding-while-grouping-after-split-using-tokenize-tp5797769.html
Sent from the Camel - Users mailing list archive at Nabble.com.

File Encoding falls back to default encoding while grouping after split using tokenize

Reply via email to