[ https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
benj updated DRILL-7588: ------------------------ Description: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n). The file contains some special char like {noformat} http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 {noformat} The next request sometimes eat the first char of a line {code:sql} --CREATE TABLE dfs.test.`result_pqt` AS ( SELECT columns[0] as d ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n')) --) java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM" {code} The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv). If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt) keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?) keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?) Didn't manage to reduce more the file demo.tsv while keeping the problem. Environment: (was: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n). The file contains some special char like {noformat} http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 {noformat} The next request sometimes eat the first char of a line {code:sql} --CREATE TABLE dfs.test.`result_pqt` AS ( SELECT columns[0] as d ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n')) --) java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM" {code} The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv). If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt) keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?) keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?) Didn't manage to reduce more the file demo.tsv while keeping the problem. ) > Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a > row > -------------------------------------------------------------------------------- > > Key: DRILL-7588 > URL: https://issues.apache.org/jira/browse/DRILL-7588 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.17.0 > Reporter: benj > Priority: Major > Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log > > > With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = > \r\n). > The file contains some special char like > {noformat} > http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± > Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 > {noformat} > The next request sometimes eat the first char of a line > {code:sql} > --CREATE TABLE dfs.test.`result_pqt` AS ( > SELECT > columns[0] as d > ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) > FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, > fieldDelimiter => '\t', lineDelimiter => '\r\n')) > --) > java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid > format: "/19/2015 9:33:39 AM" > {code} > The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in > this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file > demo.tsv). > If '\r\n' are replaced by '\n' with _sed_ before the request, the result is > correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or > without function TABLE (there is no error and the date is correctly converted > with to_timestamp function / columns d is correct in the result_pqt) > keeping '\r\n' and trying to move (in another line in demo.tsv) the line that > produce error can prevent error (why ?) > keeping '\r\n' and trying to remove/modify one or more special char (like in > "thá»\235i trang jean") can prevent error (why ?) > Didn't manage to reduce more the file demo.tsv while keeping the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)