I am quite new to Apache Camel. But after using it for a month now, I really feel it is a great Integration framework which makes solving various enterprise problems very effectively with minimal effort.
Coming to the issue, I had been working on splitting a huge csv using the splitter with tokenize & grouping N lines approach and ran into encoding issues with the grouped content. A similar issue had been raised in StackOverflow: Camel: UTF-8 Encoding is lost after using Group <http://stackoverflow.com/questions/36075063/camel-utf-8-encoding-is-lost-after-using-group> I had also commented on the same issue with my usecase and observations made. Including the same text here: Sample csv file: (with Delimiter - '|') CandidateNumber|CandidateLastName|CandidateFirstName|EducationLevel CAND123C001|Wells|Jimmy|Bachelor's Degree (±16 years) CAND123C002|Wells|Tom|Bachelor's Degree (±16 years) CAND123C003|Wells|James|Bachelor's Degree (±16 years) CAND123C004|Wells|Tim|Bachelor's Degree (±16 years) The ± character is corrupted after tokenize with grouping. I was initially under the assumption that the problem was with not setting the proper File Encoding for split, but the exchange seems to have the right value for property CamelCharsetName=ISO-8859-1. from("file://<dir with csv files>?noop=true&charset=ISO-8859-1") .split(body().tokenize("\n",2,true)).streaming() .log("body: ${body}"); The same works fine with dont use grouping. from("file://<dir with csv files>?noop=true&charset=ISO-8859-1") .split(body().tokenize("\n")).streaming() .log("body: ${body}"); Looking at GroupTokenIterator <https://github.com/apache/camel/blob/master/camel-core/src/main/java/org/apache/camel/util/GroupTokenIterator.java> in camel code base the problem seems to be with the way TypeConverter is used to convert String to InputStream // convert to input stream InputStream is = camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, data); ... Note: the mandatoryConvertTo() has an overloaded method with exchange <T> T mandatoryConvertTo(Class<T> type, Exchange exchange, Object value) As the exchange is not passed as argument it always falls back to default charset set using system property "org.apache.camel.default.charset" Potential Fix: // convert to input stream InputStream is = camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, exchange, data); ... As this fix is in the camel-core, another potential option is to use split without grouping and use AgrregateStrategy with completionSize() and completionTimeout(). Although it would be great to get this fixed in camel-core. Kindly let me know your thoughts and as to whether this can be handled in a different way. -- View this message in context: http://camel.465427.n5.nabble.com/File-Encoding-falls-back-to-default-encoding-while-grouping-after-split-using-tokenize-tp5797769.html Sent from the Camel - Users mailing list archive at Nabble.com.
