Hi, I have a source (.csv) with multi-encoding (it's [bs]ad but can't change that). When I try to apply a regexp_replace on a field (like...regexp_replace(`myfield`,'...','...')...) I get an error - Error: SYSTEM ERROR: MalformedInputException: Input length = 1
For example, I have a case due to a "รถ" encoding in ISO-8859-1 (\xF6) in the .csvh When Drill try to apply the regexp_replace(), as it work in UTF-8 it probably say (oh, byte between F0 and FF, so it's a UTF-8 4 bytes sequence (but "unfortunatly" next bytes are normal characters so the second byte is no 10xxxxxx, so it's not a valid UTF-8 I can't convert explicitly the file from ISO-8859-1 to UTF-8 because some line could be in ISO-8859-1 other in ISO-8859-5 or any existing encoding (single byte, multi-bytes or variable length) I don't want to eliminate "problematic" characters because I hope sometimes an human can decide or be helped by this info. So is there any way to use regexp_replace function without any error typically use regexp_replace in US-ASCII mode ? (like a LC_ALL=C sed ...) Or an option to continue even if error exists Or a drill function that detect invalid UTF-8 sequence and can prevent the apply of the regexp_replace on this string Thanks for any idea,
