[jira] [Commented] (DRILL-8481) Ability to query XML root attributes
[ https://issues.apache.org/jira/browse/DRILL-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821298#comment-17821298 ] benj commented on DRILL-8481: - Hi [~cgivre], It's just a bug report or rather a request for evolution. > Ability to query XML root attributes > > > Key: DRILL-8481 > URL: https://issues.apache.org/jira/browse/DRILL-8481 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - XML >Affects Versions: 1.21.1 >Reporter: benj >Priority: Major > > Hi, > It is possible to retrieve the field attributes except those of the root > It would be interesting to be able to retrieve the attributes found in the > root node of XML files. > In my common use cases, I have many XML files each containing a single XML > frame with often one or more attributes in the root tag. > To recover this value, I am currently forced to preprocess the files to > "copy" this attribute into the fields of the XML record. > Even with multiple xml records under the root, it would be useful to consider > that the root attributes are accessible for each record > Example (fichier aaa.xml): > {noformat} > > > blue > > {noformat} > With request : > {code:sql} > SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', > dataLevel=>1)) as xml) AS x; > {code} > I can access to : > * P1_SubVersion > * P1_MID > * P1_PN > * P1_SL > * P2_SubVersion > * P2.Color > But I can' access to : > * PPP_Version > * PPP_TimeStamp > and changing the DataLevel does not solve the problem > Regards, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8481) Ability to query XML root attributes
[ https://issues.apache.org/jira/browse/DRILL-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-8481: Summary: Ability to query XML root attributes (was: Ability to query root attributes) > Ability to query XML root attributes > > > Key: DRILL-8481 > URL: https://issues.apache.org/jira/browse/DRILL-8481 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - XML >Affects Versions: 1.21.1 >Reporter: benj >Priority: Major > > Hi, > It is possible to retrieve the field attributes except those of the root > It would be interesting to be able to retrieve the attributes found in the > root node of XML files. > In my common use cases, I have many XML files each containing a single XML > frame with often one or more attributes in the root tag. > To recover this value, I am currently forced to preprocess the files to > "copy" this attribute into the fields of the XML record. > Even with multiple xml records under the root, it would be useful to consider > that the root attributes are accessible for each record > Example (fichier aaa.xml): > {noformat} > > > blue > > {noformat} > With request : > {code:sql} > SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', > dataLevel=>1)) as xml) AS x; > {code} > I can access to : > * P1_SubVersion > * P1_MID > * P1_PN > * P1_SL > * P2_SubVersion > * P2.Color > But I can' access to : > * PPP_Version > * PPP_TimeStamp > and changing the DataLevel does not solve the problem > Regards, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8481) Ability to query root attributes
benj created DRILL-8481: --- Summary: Ability to query root attributes Key: DRILL-8481 URL: https://issues.apache.org/jira/browse/DRILL-8481 Project: Apache Drill Issue Type: Improvement Components: Storage - XML Affects Versions: 1.21.1 Reporter: benj Hi, It is possible to retrieve the field attributes except those of the root It would be interesting to be able to retrieve the attributes found in the root node of XML files. In my common use cases, I have many XML files each containing a single XML frame with often one or more attributes in the root tag. To recover this value, I am currently forced to preprocess the files to "copy" this attribute into the fields of the XML record. Even with multiple xml records under the root, it would be useful to consider that the root attributes are accessible for each record Example (fichier aaa.xml): {noformat} blue {noformat} With request : {code:sql} SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', dataLevel=>1)) as xml) AS x; {code} I can access to : * P1_SubVersion * P1_MID * P1_PN * P1_SL * P2_SubVersion * P2.Color But I can' access to : * PPP_Version * PPP_TimeStamp and changing the DataLevel does not solve the problem Regards, -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode
[ https://issues.apache.org/jira/browse/DRILL-7740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382023#comment-17382023 ] benj edited comment on DRILL-7740 at 7/16/21, 12:13 PM: Le problème semble avoir été corrigé en version 1.19 {noformat} $ bash ./apache-drill-1.19.0/bin/drill-embedded apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b 2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, to_date('2018-02-28','-MM-dd') AS b); +++++ | a | b | min_a_b | max_a_b | +++++ | 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 | +++++ {noformat} And there is no warning when using LEAST or GREATEST was (Author: benj641): Le problème semble avoir été corrigé en version 1.19 {noformat} $ bash ./apache-drill-1.19.0/bin/drill-embedded apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b 2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, to_date('2018-02-28','-MM-dd') AS b); +++++ | a | b | min_a_b | max_a_b | +++++ | 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 | +++++ {noformat} > LEAST and GREATEST does not work well with date in embedded mode > > > Key: DRILL-7740 > URL: https://issues.apache.org/jira/browse/DRILL-7740 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > There seems to be a huge problem with LEAST and GREATEST functions in > embedded modewhen using them with DATE type > {code:sql} > bash bin/drill-embedded > apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b > FROM (select to_date('2018-02-26','-MM-dd') AS a, > to_date('2018-02-28','-MM-dd') AS b); > +++++ > | a | b | min_a_b | max_a_b | > +++++ > | 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 | > +++++ > {code} > min_a_b = 2018-02-25 instead of 2018-02-26 > max_a_b = 2018-02-27 instead of 2018-02-28 > Please note that first time I use LEAST or GREATEST I have a warning: > {noformat} > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.hive.common.StringInternUtils > (file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to > field java.net.URI.string > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.hive.common.StringInternUtils > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode
[ https://issues.apache.org/jira/browse/DRILL-7740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382023#comment-17382023 ] benj commented on DRILL-7740: - Le problème semble avoir été corrigé en version 1.19 {noformat} $ bash ./apache-drill-1.19.0/bin/drill-embedded apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b 2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, to_date('2018-02-28','-MM-dd') AS b); +++++ | a | b | min_a_b | max_a_b | +++++ | 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 | +++++ {noformat} > LEAST and GREATEST does not work well with date in embedded mode > > > Key: DRILL-7740 > URL: https://issues.apache.org/jira/browse/DRILL-7740 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill, Functions - Hive >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > There seems to be a huge problem with LEAST and GREATEST functions in > embedded modewhen using them with DATE type > {code:sql} > bash bin/drill-embedded > apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b > FROM (select to_date('2018-02-26','-MM-dd') AS a, > to_date('2018-02-28','-MM-dd') AS b); > +++++ > | a | b | min_a_b | max_a_b | > +++++ > | 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 | > +++++ > {code} > min_a_b = 2018-02-25 instead of 2018-02-26 > max_a_b = 2018-02-27 instead of 2018-02-28 > Please note that first time I use LEAST or GREATEST I have a warning: > {noformat} > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.hive.common.StringInternUtils > (file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to > field java.net.URI.string > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.hive.common.StringInternUtils > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-2388) support extract(epoch)
[ https://issues.apache.org/jira/browse/DRILL-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368733#comment-17368733 ] benj commented on DRILL-2388: - It's painful to do {noformat} SELECT extract(hour FROM a) * 3600 + extract(minute FROM a) * 60 + extract(second FROM a) AS seconds FROM CAST('00:02:03' AS time) {noformat} instead of {noformat} SELECT extract(epoch FROM a) AS seconds FROM CAST('00:02:03' AS time) {noformat} maybe epoch is a good solution or why not seconds (with a "s") > support extract(epoch) > --- > > Key: DRILL-2388 > URL: https://issues.apache.org/jira/browse/DRILL-2388 > Project: Apache Drill > Issue Type: Improvement > Components: SQL Parser >Affects Versions: 0.8.0 >Reporter: Chun Chang >Priority: Minor > Fix For: Future > > > Postgres supports the following: > {code} > SELECT extract(epoch FROM now()); > {code} > Drill will error: > {code} > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> SELECT extract(epoch FROM > now()) from sys.drillbits; > Query failed: ParseException: Encountered "epoch" at line 1, column 16. > Was expecting one of: > "YEAR" ... > "MONTH" ... > "DAY" ... > "HOUR" ... > "MINUTE" ... > "SECOND" ... > Error: exception while executing query: Failure while executing query. > (state=,code=0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data
benj created DRILL-7954: --- Summary: XML ability to not concatenate fields and attribute - change presentation of data Key: DRILL-7954 URL: https://issues.apache.org/jira/browse/DRILL-7954 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.19.0 Reporter: benj With a XML containing these data : {noformat} x y z a {noformat} {noformat} apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>1)) as x; +---++ | attributes | attr | +---++ | {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} | +---++ SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) as x; +-+-+ | attributes| set | +-+-+ | {"set_num":"01","set_val":"12"} | xy | | {"set_num":"23","set_val":"ab"} | za | +-+-+ apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>3)) as x; ++ | attributes | ++ | {} | | {} | | {} | | {} | ++ {noformat} Attributes and fields with the same name are concatenated and remains inexploitable _(maybe the posibility of adding separator should help but it's not the point here)_ In fact that we really need is the ability to obtain something like _(depending of the defining level)_ : {noformat} +--+ | attr | +--+ | [{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}] | | [{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}] | +--+ ++ | set | ++ | {"set":"x","_attributes":{"num":"0","val":"1"}} | | {"set":"y","_attributes":{"num":"1","val":"2"}} | | {"set":"z","_attributes":{"num":"2","val":"a"}} | | {"set":"a","_attributes":{"num":"3","val":"b"}} | ++ {noformat} _attributes fields could be generated on each level instead of generated with path from top level => that will allow to work with data from each level without losing information -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-4660) TextReader should support multibyte field delimiters
[ https://issues.apache.org/jira/browse/DRILL-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126798#comment-17126798 ] benj commented on DRILL-4660: - Any news/progression/hope for this functionality ? > TextReader should support multibyte field delimiters > > > Key: DRILL-4660 > URL: https://issues.apache.org/jira/browse/DRILL-4660 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.6.0 >Reporter: Arina Ielchiieva >Priority: Minor > Fix For: Future > > > Data file /tmp/foo.txt contents: > {noformat} > 0::2::3 > 0::3::1 > 0::5::2 > 0::9::4 > 0::11::1 > 0::12::2 > 0::15::1 > {noformat} > Query: > {code} > select > columns > from > table(dfs.`/tmp/foo.txt`(type => 'text', fieldDelimiter => '::')) > {code} > Results in a error message: > {noformat} > PARSE ERROR: > Expected single character but was String: :: > table /tmp/foo.txt > parameter fieldDelimiter SQL Query null > {noformat} > It would be nice that fieldDelimiter accepts text of any length. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields
[ https://issues.apache.org/jira/browse/DRILL-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7747: Description: it would be really useful to have a function allowing to know if a field comes from an existing column or not. With this data: {code:sql} apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`; +---++---+ | a | b| c | +---++---+ | 1 | test-1 | other | | 2 | test-2 | null | | 3 | test-3 | old | +---++---+ apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`; ++-+ | a |b| ++-+ | 10 | test-10 | | 20 | test-20 | | 30 | test-30 | ++-+ apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`; +++-+---+-+--+ |dir0| a |b| c | EXPR$1 | EXPR$2 | +++-+---+-+--+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | +++-+---+-+--+ {code} It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing. Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose (exist(column)=>true/false)) Virtual Example with previous data: {code:sql} apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`; +++-+---+-+--+---+ |dir0| a |b| c | EXPR$1 | EXPR$2 | origin | +++-+---+-+--+---+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | original | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | original | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | original | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | generated | +++-+---+-+--+---+ {code} Or maybe another way could be to have an implicit column name (like filename, filepath...) that contains the list of available "columns" was: it would be really useful to have a function allowing to know if a field comes from an existing column or not. With this data: {code:sql} apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`; +---++---+ | a | b| c | +---++---+ | 1 | test-1 | other | | 2 | test-2 | null | | 3 | test-3 | old | +---++---+ apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`; ++-+ | a |b| ++-+ | 10 | test-10 | | 20 | test-20 | | 30 | test-30 | ++-+ apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`; +++-+---+-+--+ |dir0| a |b| c | EXPR$1 | EXPR$2 | +++-+---+-+--+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | +++-+---+-+--+ {code} It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing. Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose (exist(column)=>true/false)) Virtual Example with previous data: {code:sql} apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`; +++-+---+-+--+---+ |dir0| a |b| c | EXPR$1 | EXPR$2 | origin | +++-+---+-+--+---+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | original | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | original | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | original | | f2.parquet | 10 | test-10 | null | VARCHAR |
[jira] [Updated] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields
[ https://issues.apache.org/jira/browse/DRILL-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7747: Description: it would be really useful to have a function allowing to know if a field comes from an existing column or not. With this data: {code:sql} apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`; +---++---+ | a | b| c | +---++---+ | 1 | test-1 | other | | 2 | test-2 | null | | 3 | test-3 | old | +---++---+ apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`; ++-+ | a |b| ++-+ | 10 | test-10 | | 20 | test-20 | | 30 | test-30 | ++-+ apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`; +++-+---+-+--+ |dir0| a |b| c | EXPR$1 | EXPR$2 | +++-+---+-+--+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | +++-+---+-+--+ {code} It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing. Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose (exist(column)=>true/false)) Virtual Example with previous data: {code:sql} apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`; +++-+---+-+--+---+ |dir0| a |b| c | EXPR$1 | EXPR$2 | origin | +++-+---+-+--+---+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | original | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | original | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | original | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | generated | +++-+---+-+--+---+ {code} was: it would be really useful to have a function allowing to know if a field comes from an existing column or not. With this data: {code:sql} apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`; +---++---+ | a | b| c | +---++---+ | 1 | test-1 | other | | 2 | test-2 | null | | 3 | test-3 | old | +---++---+ apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`; ++-+ | a |b| ++-+ | 10 | test-10 | | 20 | test-20 | | 30 | test-30 | ++-+ apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`; +++-+---+-+--+ |dir0| a |b| c | EXPR$1 | EXPR$2 | +++-+---+-+--+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | +++-+---+-+--+ {code} It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing. Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose) Virtual Example with previous data: {code:sql} apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`; +++-+---+-+--+---+ |dir0| a |b| c | EXPR$1 | EXPR$2 | origin | +++-+---+-+--+---+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | original | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | original | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | original | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | generated |
[jira] [Created] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields
benj created DRILL-7747: --- Summary: Function to determine Unknow fields / on fly generated missing fields Key: DRILL-7747 URL: https://issues.apache.org/jira/browse/DRILL-7747 Project: Apache Drill Issue Type: Wish Components: Functions - Drill Affects Versions: 1.17.0 Reporter: benj it would be really useful to have a function allowing to know if a field comes from an existing column or not. With this data: {code:sql} apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`; +---++---+ | a | b| c | +---++---+ | 1 | test-1 | other | | 2 | test-2 | null | | 3 | test-3 | old | +---++---+ apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`; ++-+ | a |b| ++-+ | 10 | test-10 | | 20 | test-20 | | 30 | test-30 | ++-+ apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`; +++-+---+-+--+ |dir0| a |b| c | EXPR$1 | EXPR$2 | +++-+---+-+--+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | +++-+---+-+--+ {code} It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing. Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose) Virtual Example with previous data: {code:sql} apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`; +++-+---+-+--+---+ |dir0| a |b| c | EXPR$1 | EXPR$2 | origin | +++-+---+-+--+---+ | f1.parquet | 1 | test-1 | other | VARCHAR | NULLABLE | original | | f1.parquet | 2 | test-2 | null | VARCHAR | NULLABLE | original | | f1.parquet | 3 | test-3 | old | VARCHAR | NULLABLE | original | | f2.parquet | 10 | test-10 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 20 | test-20 | null | VARCHAR | NULLABLE | generated | | f2.parquet | 30 | test-30 | null | VARCHAR | NULLABLE | generated | +++-+---+-+--+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-3014) Casting unknown field yields different result from casting null, and bad error message
[ https://issues.apache.org/jira/browse/DRILL-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125847#comment-17125847 ] benj commented on DRILL-3014: - This problem seems to have been corrected (it's not possible to reproduce in 1.17) > Casting unknown field yields different result from casting null, and bad > error message > -- > > Key: DRILL-3014 > URL: https://issues.apache.org/jira/browse/DRILL-3014 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators >Reporter: Daniel Barclay >Priority: Minor > Fix For: Future > > > Casting null to INTEGER works as expected like this: > {noformat} > 0: jdbc:drill:zk=local> select cast(NULL AS INTEGER) from > `dfs.tmp`.`simple.csv`; > ++ > | EXPR$0 | > ++ > | null | > ++ > 1 row selected (0.15 seconds) > 0: jdbc:drill:zk=local> > {noformat} > (File "{{simple.csv}}" contains one line containing simply "{{a,b,c,d}}".) > However, casting an unknown column yields an error: > {noformat} > 0: jdbc:drill:zk=local> select cast(noSuchField AS INTEGER) from > `dfs.tmp`.`simple.csv`; > Error: SYSTEM ERROR: null > Fragment 0:0 > [Error Id: a0b348ec-f2c5-4f66-9f05-591399f3c315 on dev-linux2:31010] > (state=,code=0) > 0: jdbc:drill:zk=local> > {noformat} > This looks like a JDK {{NumberFormatException}} that wasn't handled > properly*, and looks like the logical null from the non-existent column was > turned into the string "{{null}}" before the cast to {{INTEGER}}. > Is that a bug or is it intentional that the non-existent field in this case > is not actually treated as being all nulls (as non-existent fields are in at > least some other places)? > (*For most NumberFormatExceptions, the message text does not contain the > information that the kind of exception was a number-format exception--that > information is only in the class name. In particular that information is not > in the message text returned by getMessage(). > Drill code that can throw a {{NumberFormatException}} (e.g., cast functions > and other code that calls, e.g., {{Integer.parse(...)}}) should either > immediately wrap it in a {{UserException}}, or at least wrap it in another > {{NumberFormatException}} with fuller message text.) > This seem to confirm that it's a {{NumberFormatException}} (note the > first-column value "{{a}}"): > {noformat} > select cast(columns[0] AS INTEGER) from `dfs.tmp`.`simple.csv`; > Error: SYSTEM ERROR: a > Fragment 0:0 > [Error Id: 9d6107dc-dc2a-40ce-9676-6387ab427098 on dev-linux2:31010] > (state=,code=0) > 0: jdbc:drill:zk=local> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7104) Change of data type when parquet with multiple fragment
[ https://issues.apache.org/jira/browse/DRILL-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7104: Affects Version/s: 1.16.0 1.17.0 > Change of data type when parquet with multiple fragment > --- > > Key: DRILL-7104 > URL: https://issues.apache.org/jira/browse/DRILL-7104 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0, 1.17.0 >Reporter: benj >Priority: Major > Attachments: DRILL-7104_ErrorNumberFormatException_20190322.log > > > When creating a Parquet with a column filled only with "CAST(NULL AS > VARCHAR)", if the parquet has several fragment, the type is read like INT > instead of VARCHAR. > First, create +Parquet with only one fragment+ - all is fine (the type of > "demo" is correct). > {code:java} > CREATE TABLE `nobug` AS > (SELECT CAST(NULL AS VARCHAR) AS demo > , md5(cast(rand() AS VARCHAR) AS jam > FROM `onebigfile` LIMIT 100)); > +---++ > | Fragment | Number of records written | > +---++ > | 0_0 | 1000 | > SELECT drilltypeof(demo) AS goodtype FROM `bug` LIMIT 1; > ++ > | goodtype | > ++ > | VARCHAR| > {code} > Second, create +Parquet with at least 2 fragments+ - the type of "demo" > change to INT > {code:java} > CREATE TABLE `bug` AS > ((SELECT CAST(NULL AS VARCHAR) AS demo > ,md5(CAST(rand() AS VARCHAR)) AS jam > FROM `onebigfile` LIMIT 100) > UNION > (SELECT CAST(NULL AS VARCHAR) AS demo > ,md5(CAST(rand() AS VARCHAR)) AS jam > FROM `onebigfile` LIMIT 100)); > +---++ > | Fragment | Number of records written | > +---++ > | 1_1 | 1000276| > | 1_0 | 999724 | > SELECT drilltypeof(demo) AS badtype FROM `bug` LIMIT 1; > ++ > | badtype| > ++ > | INT|{code} > The change of type is really terrible... > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode
benj created DRILL-7740: --- Summary: LEAST and GREATEST does not work well with date in embedded mode Key: DRILL-7740 URL: https://issues.apache.org/jira/browse/DRILL-7740 Project: Apache Drill Issue Type: Bug Components: Functions - Drill, Functions - Hive Affects Versions: 1.17.0 Reporter: benj There seems to be a huge problem with LEAST and GREATEST functions in embedded modewhen using them with DATE type {code:sql} bash bin/drill-embedded apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b FROM (select to_date('2018-02-26','-MM-dd') AS a, to_date('2018-02-28','-MM-dd') AS b); +++++ | a | b | min_a_b | max_a_b | +++++ | 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 | +++++ {code} min_a_b = 2018-02-25 instead of 2018-02-26 max_a_b = 2018-02-27 instead of 2018-02-28 Please note that first time I use LEAST or GREATEST I have a warning: {noformat} WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.hive.common.StringInternUtils (file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to field java.net.URI.string WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.hive.common.StringInternUtils WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-6975) TO_CHAR does not seems work well depends on LOCALE
[ https://issues.apache.org/jira/browse/DRILL-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-6975: Affects Version/s: 1.17.0 > TO_CHAR does not seems work well depends on LOCALE > -- > > Key: DRILL-6975 > URL: https://issues.apache.org/jira/browse/DRILL-6975 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: benj >Priority: Major > > Strange results from TO_CHAR function when using different LOCALE. > {code:java} > SELECT TO_CHAR((CAST('2008-2-23' AS DATE)), '-MMM-dd') FROM (VALUES(1)); > 2008-Feb-23 (in documentation (en_US.UTF-8)) > 2008-févr.-2 (fr_FR.UTF-8) > {code} > surprisingly by adding a space ('-MMM-dd ') (or any character) at the end > of the format the result becomes correct (so there is no problem when format > a timestamp with ' MMM dd HH:mm:ss') > {code:java} > SELECT TO_CHAR(1256.789383, '#,###.###') FROM (VALUES(1)); > 1,256.789 (in documentation (en_US.UTF-8)) > 1 256,78 (fr_FR.UTF-8) > {code} > Even worse results can be achieved > {code:java} > SELECT TO_CHAR(12567,'#,###.###'); > 12,567 (en_US.UTF-8) > 12 56 (fr_FR.UTF-8) > {code} > Again, with the add of a space/char at the end we get a better result. > I don't have tested all the locale, but for the last example, the result is > right with de_DE.UTF-8 : 12.567 > The situation is identical in 1.14 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files
[ https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066614#comment-17066614 ] benj edited comment on DRILL-1755 at 3/25/20, 11:19 AM: The problem appears to have been corrected in 1.17 The error shown previously appearing in 1.16 (see above) no longer appears in 1.17 (see below) : {code:sql} drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`example.json`; +-+ | mytext| +-+ | justfortest | | justfortest | +-+ {code} was (Author: benj641): The problem appears to have been corrected in 1.17 The error shown previously appearing in 1.16 (see above) no longer appears in 1.17 (see below) : {code:sql} drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`tmp.json`; +-+ | mytext| +-+ | justfortest | | justfortest | +-+ {code} > Add support for arrays and scalars as first level elements in JSON files > > > Key: DRILL-1755 > URL: https://issues.apache.org/jira/browse/DRILL-1755 > Project: Apache Drill > Issue Type: Improvement >Reporter: Abhishek Girish >Priority: Major > Fix For: Future > > Attachments: drillbit.log > > > Publicly available JSON data sometimes have the following structure (arrays > as first level elements): > [ > {"accessLevel": "public" > }, > {"accessLevel": "private" > } > ] > Drill currently does not support Arrays or Scalars as first level elements. > Only maps are supported. We should add support for the arrays and scalars. > Log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files
[ https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066614#comment-17066614 ] benj commented on DRILL-1755: - The problem appears to have been corrected in 1.17 The error shown previously appearing in 1.16 (see above) no longer appears in 1.17 (see below) : {code:sql} drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`tmp.json`; +-+ | mytext| +-+ | justfortest | | justfortest | +-+ {code} > Add support for arrays and scalars as first level elements in JSON files > > > Key: DRILL-1755 > URL: https://issues.apache.org/jira/browse/DRILL-1755 > Project: Apache Drill > Issue Type: Improvement >Reporter: Abhishek Girish >Priority: Major > Fix For: Future > > Attachments: drillbit.log > > > Publicly available JSON data sometimes have the following structure (arrays > as first level elements): > [ > {"accessLevel": "public" > }, > {"accessLevel": "private" > } > ] > Drill currently does not support Arrays or Scalars as first level elements. > Only maps are supported. We should add support for the arrays and scalars. > Log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7602) Possibility to force repartition on read/select
[ https://issues.apache.org/jira/browse/DRILL-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7602: Description: It will be nice and usefull in certain situations to have the capacity to do repartition like in spark ([https://spark.apache.org/docs/latest/rdd-programming-guide.html]) either an automatically repartition in certain limit or possibility to indicate the desired repartition or both options. The only way (that I know now) to do that with Drill is to change _store.parquet.block-size_ and regenerate the input file then do the request (but it will be nice to have the ability to do that on read) illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and _file2_ of 1Mo (5000 rows) {code:sql} CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...) => ~ 50min -- Today we have to change the parquet block size to force multiple parquet files ALTER SESSION SET `store.parquet.block-size` = 1048576; -- Repartition data CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`); -- then Launch the request CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...) => ~ 1min {code} So it's possible to save a lot of time (depending of configuration of cluster) by simply forcing more input file. it would be useful not to have to regenerate the files with the ideal fragmentation before request. This situation easily appears when making inequality JOIN (to lookup ip in ip range for example) on not so big dataset: {code:java} ALTER SESSION SET `planner.enable_nljoin_for_scalar_only` = false; SELECT * FROM dfs.test.`a_pqt` AS a INNER JOIN dfs.test.`b_pqt` AS b ON inet_aton(b.ip) >= inet_aton(a.ip_first) AND inet_aton(b.ip) <= inet_aton(a.ip_last); {code} was: It will be nice and usefull ion certain situations to have the capacity to do repartition like in spark (https://spark.apache.org/docs/latest/rdd-programming-guide.html) either an automatically repartition in certain limit or possibility to indicate the desired repartition or both options. The only way (that I know now) to do that with Drill is to change _store.parquet.block-size_ and regenerate the input file then do the request (but it will be nice to have the ability to do that on read) illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and _file2_ of 1Mo (5000 rows) {code:sql} CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...) => ~ 50min -- Tody we have to change the parquet block size to force multiple parquet files ALTER SESSION SET `store.parquet.block-size` = 1048576; -- Repartition data CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`); -- then Launch the request CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...) => ~ 1min {code} So it's possible to save a lot of time (depending of configuration of cluster) by simply forcing more input file. it would be useful not to have to regenerate the files with the ideal fragmentation before request. > Possibility to force repartition on read/select > --- > > Key: DRILL-7602 > URL: https://issues.apache.org/jira/browse/DRILL-7602 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Flow, Query Planning Optimization >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > It will be nice and usefull in certain situations to have the capacity to do > repartition like in spark > ([https://spark.apache.org/docs/latest/rdd-programming-guide.html]) > either an automatically repartition in certain limit or possibility to > indicate the desired repartition or both options. > The only way (that I know now) to do that with Drill is to change > _store.parquet.block-size_ and regenerate the input file then do the request > (but it will be nice to have the ability to do that on read) > illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and > _file2_ of 1Mo (5000 rows) > {code:sql} > CREATE TABLE dfs.test.`result_from_1_parquet` AS > (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...) > => ~ 50min > -- Today we have to change the parquet block size to force multiple parquet > files > ALTER SESSION SET `store.parquet.block-size` = 1048576; > -- Repartition data > CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`); > -- then Launch the request > CREATE TABLE dfs.test.`result_from_1_parquet` AS > (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...) > => ~ 1min > {code} > So it's possible to save a lot of time (depending of configuration of >
[jira] [Comment Edited] (DRILL-7371) DST/UTC cast/to_timestamp problem
[ https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927527#comment-16927527 ] benj edited comment on DRILL-7371 at 3/12/20, 4:19 PM: --- [~volodymyr], the problem occurs on all Daylight Saving Time of Europe (Paris). >From investigation after your message, it's appears that the problem appears >in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs). The result is almost wrong with an execution via zeppelin (jdbc too). But there is no problem with the request launched directly in the Apache Drill web interface ([http://...:8047/query]) and no problem with request in drill-embedded So it's seems that the problem probably comes with JDBC. was (Author: benj641): [~volodymyr], the problem occurs on all Daylight Saving Time of Europe (Paris). >From investigation after your message, it's appears that the problem appears >in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs). The result is almost wrong with an execution via zeppelin (jdbc too). But there is no problem with the request launched directly in the Apache Drill web interface ([http://...:8047/query).] So it's seems that the problem probably comes with JDBC. > DST/UTC cast/to_timestamp problem > - > > Key: DRILL-7371 > URL: https://issues.apache.org/jira/browse/DRILL-7371 > Project: Apache Drill > Issue Type: Bug > Components: Client - JDBC, Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in > [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw] > find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp]) > {code:sql} > SELECT TIMEOFDAY(); > +-+ > | EXPR$0| > +-+ > | 2019-09-11 08:20:12.247 UTC | > +-+ > {code} > Problems appears when _cast/to_timestamp_ date (date related to the DST > (Daylight Save Time) of some countries). > To illustrate, all the next requests give the same +wrong+ results: > {code:sql} > SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z'); > SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss'); > SELECT cast('2018-03-25 02:22:40' as timestamp); > SELECT cast('2018-03-25 02:22:40 +' as timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-25 03:22:40.0 | > +---+ > {code} > while the result should be "2018-03-25 +02+:22:40.0" > An UTC date and time in string shouldn't change when casting to UTC timestamp. > To illustrate, the next requests produce +good+ results: > {code:java} > SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z'); > +---+ > |EXPR$0 | > +---+ > | 2018-03-26 02:22:40.0 | > +---+ > SELECT CAST('2018-03-24 02:22:40' AS timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-24 02:22:40.0 | > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7602) Possibility to force repartition on read/select
benj created DRILL-7602: --- Summary: Possibility to force repartition on read/select Key: DRILL-7602 URL: https://issues.apache.org/jira/browse/DRILL-7602 Project: Apache Drill Issue Type: Improvement Components: Execution - Flow, Query Planning Optimization Affects Versions: 1.17.0 Reporter: benj It will be nice and usefull ion certain situations to have the capacity to do repartition like in spark (https://spark.apache.org/docs/latest/rdd-programming-guide.html) either an automatically repartition in certain limit or possibility to indicate the desired repartition or both options. The only way (that I know now) to do that with Drill is to change _store.parquet.block-size_ and regenerate the input file then do the request (but it will be nice to have the ability to do that on read) illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and _file2_ of 1Mo (5000 rows) {code:sql} CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...) => ~ 50min -- Tody we have to change the parquet block size to force multiple parquet files ALTER SESSION SET `store.parquet.block-size` = 1048576; -- Repartition data CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`); -- then Launch the request CREATE TABLE dfs.test.`result_from_1_parquet` AS (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...) => ~ 1min {code} So it's possible to save a lot of time (depending of configuration of cluster) by simply forcing more input file. it would be useful not to have to regenerate the files with the ideal fragmentation before request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment
[ https://issues.apache.org/jira/browse/DRILL-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7595: Description: like on DRILL-7104, there is a bug that change the type from BIGINT to INT where a parquet have multiple fragment With a file containing few row (all is fine (we store a BIGINT and really have a BIGINT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_0 | 1500 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} With a file containing "enough" row (there is a problem (we store a BIGINT but we unfortunatly have an INT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`manyrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | INT| ++ {code} It's not really satisfactory but please note that there is a Trick to avoid this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS d FROM dfs.tmp.`manyrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} was: like on DRILL-7104, there is a bug that change the type from BIGINT to INT where a parquet have multiple fragment With a file containing few row (all is fine (we store a BIGINT and really have a BIGINT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_0 | 1500 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} With a file containing "enough" row (there is a problem (we store a BIGINT but we unfortunatly have an INT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | INT| ++ {code} It's not really satisfactory but please note that there is a Trick to avoid this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} > Change of data type from bigint to int when parquet with multiple fragment > -- > > Key: DRILL-7595 > URL: https://issues.apache.org/jira/browse/DRILL-7595 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > like on DRILL-7104, there is a bug that change the type from BIGINT to INT > where a parquet have multiple fragment > With a file containing few row (all is fine (we store a BIGINT and really > have a BIGINT in the Parquet) > {code:sql} > apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS > d FROM dfs.tmp.`fewrowfile`; > +--+---+ > | Fragment | Number of records written | > +--+---+ > | 1_0 | 1500
[jira] [Commented] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment
[ https://issues.apache.org/jira/browse/DRILL-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040919#comment-17040919 ] benj commented on DRILL-7595: - Another trick to avoid the problem (use a substract between 2 egal value that are bigger than an INTEGER): {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(2147483648 - 2147483648 as BIGINT) AS d FROM dfs.tmp.`manyrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++{code} > Change of data type from bigint to int when parquet with multiple fragment > -- > > Key: DRILL-7595 > URL: https://issues.apache.org/jira/browse/DRILL-7595 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > like on DRILL-7104, there is a bug that change the type from BIGINT to INT > where a parquet have multiple fragment > With a file containing few row (all is fine (we store a BIGINT and really > have a BIGINT in the Parquet) > {code:sql} > apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS > d FROM dfs.tmp.`fewrowfile`; > +--+---+ > | Fragment | Number of records written | > +--+---+ > | 1_0 | 1500 | > +--+---+ > apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; > ++ > | EXPR$0 | > ++ > | BIGINT | > ++ > {code} > With a file containing "enough" row (there is a problem (we store a BIGINT > but we unfortunatly have an INT in the Parquet) > {code:sql} > apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS > d FROM dfs.tmp.`fewrowfile`; > +--+---+ > | Fragment | Number of records written | > +--+---+ > | 1_1 | 934111| > | 1_0 | 1488743 | > +--+---+ > apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; > ++ > | EXPR$0 | > ++ > | INT| > ++ > {code} > > It's not really satisfactory but please note that there is a Trick to avoid > this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) > {code:sql} > apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) > AS d FROM dfs.tmp.`fewrowfile`; > +--+---+ > | Fragment | Number of records written | > +--+---+ > | 1_1 | 934111| > | 1_0 | 1488743 | > +--+---+ > apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; > ++ > | EXPR$0 | > ++ > | BIGINT | > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment
benj created DRILL-7595: --- Summary: Change of data type from bigint to int when parquet with multiple fragment Key: DRILL-7595 URL: https://issues.apache.org/jira/browse/DRILL-7595 Project: Apache Drill Issue Type: Bug Components: Storage - Parquet Affects Versions: 1.17.0 Reporter: benj like on DRILL-7104, there is a bug that change the type from BIGINT to INT where a parquet have multiple fragment With a file containing few row (all is fine (we store a BIGINT and really have a BIGINT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_0 | 1500 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} With a file containing "enough" row (there is a problem (we store a BIGINT but we unfortunatly have an INT in the Parquet) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | INT| ++ {code} It's not really satisfactory but please note that there is a Trick to avoid this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) {code:sql} apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS d FROM dfs.tmp.`fewrowfile`; +--+---+ | Fragment | Number of records written | +--+---+ | 1_1 | 934111| | 1_0 | 1488743 | +--+---+ apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`; ++ | EXPR$0 | ++ | BIGINT | ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row
[ https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040793#comment-17040793 ] benj commented on DRILL-7588: - As another possible solution: In the case of windows file with \r\n EOL, it's possible to use the '\n' as line delimiter to avoid the problem described above. But in this case the last field will have a \r included at the end. But if we know this last field it does not matter because it's possible to do a REGEXP_REPLACE(last_field,'\r$',''). But it's not really satisfactory. > Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a > row > > > Key: DRILL-7588 > URL: https://issues.apache.org/jira/browse/DRILL-7588 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log > > > With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = > \r\n). > The file contains some special char like > {noformat} > http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± > Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 > {noformat} > The next request sometimes eat the first char of a line > {code:sql} > --CREATE TABLE dfs.test.`result_pqt` AS ( > SELECT > columns[0] as d > ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) > FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, > fieldDelimiter => '\t', lineDelimiter => '\r\n')) > --) > java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid > format: "/19/2015 9:33:39 AM" > {code} > The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in > this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file > demo.tsv). > If '\r\n' are replaced by '\n' with _sed_ before the request, the result is > correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or > without function TABLE (there is no error and the date is correctly converted > with to_timestamp function / columns d is correct in the result_pqt) > keeping '\r\n' and trying to move (in another line in demo.tsv) the line that > produce error can prevent error (why ?) > keeping '\r\n' and trying to remove/modify one or more special char (like in > "thá»\235i trang jean") can prevent error (why ?) > Didn't manage to reduce more the file demo.tsv while keeping the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038357#comment-17038357 ] benj commented on DRILL-6096: - Just trying to use this new functionality. Some points (tested in 1.17 and last 1.18 @ 2020-02-17) : * Should at least add _"write_text"_ in description of allowed values for option _store.format_ * Why _write_text_ doesn't appears in default storage configuration ? * Try to create write_text or equivalent in storage configuration but use of _"fieldDelimiter"_ produce _"Please retry: Error (invalid JSON mapping)"_ - need a new ticket ? > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row
[ https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7588: Description: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n). The file contains some special char like {noformat} http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 {noformat} The next request sometimes eat the first char of a line {code:sql} --CREATE TABLE dfs.test.`result_pqt` AS ( SELECT columns[0] as d ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n')) --) java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM" {code} The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv). If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt) keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?) keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?) Didn't manage to reduce more the file demo.tsv while keeping the problem. Environment: (was: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n). The file contains some special char like {noformat} http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 {noformat} The next request sometimes eat the first char of a line {code:sql} --CREATE TABLE dfs.test.`result_pqt` AS ( SELECT columns[0] as d ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n')) --) java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM" {code} The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv). If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt) keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?) keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?) Didn't manage to reduce more the file demo.tsv while keeping the problem. ) > Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a > row > > > Key: DRILL-7588 > URL: https://issues.apache.org/jira/browse/DRILL-7588 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log > > > With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = > \r\n). > The file contains some special char like > {noformat} > http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± > Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 > {noformat} > The next request sometimes eat the first char of a line > {code:sql} > --CREATE TABLE dfs.test.`result_pqt` AS ( > SELECT > columns[0] as d > ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) > FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, > fieldDelimiter => '\t', lineDelimiter => '\r\n')) > --) > java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid > format: "/19/2015 9:33:39 AM" > {code} > The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in > this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file > demo.tsv). > If '\r\n' are replaced by '\n' with _sed_ before the request, the result is > correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or > without function TABLE (there is no error and the date is correctly converted >
[jira] [Created] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row
benj created DRILL-7588: --- Summary: Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row Key: DRILL-7588 URL: https://issues.apache.org/jira/browse/DRILL-7588 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.17.0 Environment: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n). The file contains some special char like {noformat} http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236 {noformat} The next request sometimes eat the first char of a line {code:sql} --CREATE TABLE dfs.test.`result_pqt` AS ( SELECT columns[0] as d ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n')) --) java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM" {code} The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv). If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt) keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?) keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?) Didn't manage to reduce more the file demo.tsv while keeping the problem. Reporter: benj Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7569) dir0 problem reader - when path with wilcard and column named dir0
benj created DRILL-7569: --- Summary: dir0 problem reader - when path with wilcard and column named dir0 Key: DRILL-7569 URL: https://issues.apache.org/jira/browse/DRILL-7569 Project: Apache Drill Issue Type: Bug Affects Versions: 1.17.0 Reporter: benj If file with named columns (like csvh, parquet, json) contains a column named *dir0* ( dir[0-9]+), it can cause problems when requesting with wilcard on path. {code:sql} apache drill> SELECT * FROM dfs.tmp.`REP/exa.csvh`; +-+--+ | dir0 | a | +-+--+ | coldir0 | cola | +-+--+ apache drill> SELECT * FROM dfs.tmp.`R*/exa.csvh`; Error: INTERNAL_ERROR ERROR: Failure while setting up text reader for file file:/tmp/REP/exa.csvh {code} The errors message are not the same depending on the input type file {noformat} CSVH => Error: INTERNAL_ERROR ERROR: Failure while setting up text reader for file file:... PARQUET => Error: INTERNAL_ERROR ERROR: Error in parquet record reader. Message: Failure in setting up reader Parquet Metadata:... JSON => Error: INTERNAL_ERROR ERROR: org.apache.drill.exec.exception.SchemaChangeException: It's not allowed to have regular field and implicit field share common name dir0. Either change regular field name in datasource, or change the default implicit field names. {noformat} Note that the JSON error message is more relevant and allows faster identification of the problem (even if (to my knowledge) dir* is not modifiable in default implicit field name). I know you should avoid using dir0 for a column name. But when creating table it's "easy" to use a "SELECT *" which will include dir0 (and other dir*) (if path containing wildcard). I have no good idea to solve this problem but it would be interesting to find a method to avoid falling into this trap. Maybe *dir** should not appear automatically when _SELECT *_ but need implicit call like _SELECT dir0, dir1, *_ (maybe direceted by an option) Maybe errors messages should be improved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7568) Strange renaming of duplicate column name
benj created DRILL-7568: --- Summary: Strange renaming of duplicate column name Key: DRILL-7568 URL: https://issues.apache.org/jira/browse/DRILL-7568 Project: Apache Drill Issue Type: Bug Affects Versions: 1.17.0, 1.16.0, 1.15.0 Reporter: benj (explicit called) duplicate columns name are automatically renamed by drill {code:java} apache drill> SELECT 1 a, 2 a, 3 a, 4 a, 5 a, 6 a; +---++++++ | a | a0 | a1 | a2 | a3 | a4 | +---++++++ | 1 | 2 | 3 | 4 | 5 | 6 | +---++++++ {code} That's ok, this rule seems "logical" BUT (with a csvh containing columns a,b and c : {code:java} SELECT *, a, a, a, a FROM dfs.tmp.`example.csvh`; +--+--+--+--+--+--+--+ | a | b | c | a0 | a00 | a1 | a2 | +--+--+--+--+--+--+--+ | cola | colb | colc | cola | cola | cola | cola | +--+--+--+--+--+--+--+ {code} The renaming rule is not applying at the same way The first duplicate a is well renaming *a0* but the second is renaming *a00* (instead of *a1*). Note that the third is renaming a1 (with an offset of 1 compared to the expected) and so on. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014216#comment-17014216 ] benj edited comment on DRILL-7449 at 1/28/20 11:28 AM: --- I have had a full check and in reality we havn't used the drill-url-tools because it sometimes produce incorrect values on big dataset (due to memory problem catch into UDF ?) . EDIT 28/01/2020 : Just found a bug in drill-url-tools and proposed a correction. It may correct the problem. After some other tests, the standard Drill *parse_url* works well (no Memory leak) +if remove the ORDER BY clause+. And note that Memory leaked can already appears with url_parse (from drill-url-tools) if using ORDER BY clause produce already. The only code that does not cause any critical problem for our use is regexp of the type: {code:sql} SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host FROM (SELECT REGEXP_REPLACE(NULLIF(Url, ''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity FROM ...) {code} Don't know why, but in terms of observation, ORDER BY clause produce number of error of different contexts with complex request and it's sometimes necessary to split the request into 2 distinct requests (one for the SELECT with computations and one for the SELECT with ORDER BY) Note that with the regexp there is no error even with ORDER BY clause. was (Author: benj641): I have had a full check and in reality we havn't used the drill-url-tools because it sometimes produce incorrect values on big dataset (due to memory problem catch into UDF ?) . After some other tests, the standard Drill *parse_url* works well (no Memory leak) +if remove the ORDER BY clause+. And note that Memory leaked can already appears with url_parse (from drill-url-tools) if using ORDER BY clause produce already. The only code that does not cause any critical problem for our use is regexp of the type: {code:sql} SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host FROM (SELECT REGEXP_REPLACE(NULLIF(Url, ''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity FROM ...) {code} Don't know why, but in terms of observation, ORDER BY clause produce number of error of different contexts with complex request and it's sometimes necessary to split the request into 2 distinct requests (one for the SELECT with computations and one for the SELECT with ORDER BY) Note that with the regexp there is no error even with ORDER BY clause. > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, > embedded_sqlline_with_enable_debug_logging.log.txt > > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019540#comment-17019540 ] benj commented on DRILL-7449: - [~arina], I would like but it's not possible, it's not a problem of size but a regulatory content issue. > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, > embedded_sqlline_with_enable_debug_logging.log.txt > > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked > by query. Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > >
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019508#comment-17019508 ] benj commented on DRILL-7449: - Hi [~IhorHuzenko] I realized that the problem may from input passed to _parse_url_. With the strict repetition of 2 extracted from beginning I can't produce the problem. But I have isolated typical row (from big original data) that can produce the problem when they are many. Others example of possible rows: {noformat} {"Attributable":true,"Description":"Website has been identified as malicious by Bing","FirstReportedDateTime":"2018-03-12T17:40:01Z","IndicatorExpirationDateTime":"2018-04-11T23:39:23Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T17:40:01Z","NetworkDestinationAsn":0,"NetworkDestinationIPv4":"255.255.255.255","NetworkDestinationPort":80,"Tags":["??"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://www.guruvittal.org/lzp/gets.php?hl=Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¹Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%9AÃ%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82©Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82³-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%9AÃ%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82©Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82³-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82²Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%99Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%86","Version":1.5} {"Attributable":true,"Description":"Website has been identified as malicious by Bing","FirstReportedDateTime":"2018-03-12T17:54:33Z","IndicatorExpirationDateTime":"2018-04-11T23:39:23Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T17:54:33Z","NetworkDestinationAsn":0,"NetworkDestinationIPv4":"255.255.255.255","NetworkDestinationPort":80,"Tags":["??"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://www.guruvittal.org/lzp/gets.php?hl=Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82·Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82²-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¨Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82±Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82½?-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¹Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82±Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82¨","Version":1.5} {"Attributable":true,"Description":"Website has been identified as malicious by
[jira] [Commented] (DRILL-7539) Aggregate expression is illegal in GROUP BY clause
[ https://issues.apache.org/jira/browse/DRILL-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019310#comment-17019310 ] benj commented on DRILL-7539: - Please note that is also possible to bypass the problem with fully prefixing columns used in GROUP BY Example (on the same way as before): {code:sql} /* OK because the GROUP BY is on x.b (not only b) */ apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP BY a, x.b; +---+---+ | a | b | +---+---+ | a | 1 | +---+---+ {code} > Aggregate expression is illegal in GROUP BY clause > -- > > Key: DRILL-7539 > URL: https://issues.apache.org/jira/browse/DRILL-7539 > Project: Apache Drill > Issue Type: Bug > Components: SQL Parser >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > When using GROUPED field in aggregate function it works unless the field is > aliased with the original name of the field. > Example (minimalist example with no real sense but based on structure > actually used (with more complex GROUP BY part)): > {code:sql} > /* OK because aggregate is on b that is not a grouped field */ > apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x > GROUP BY a; > +---+---+ > | a | b | > +---+---+ > | a | 1 | > +---+---+ > /* NOK because the aggregate on grouped field b is aliased to b (name used on > the group by) */ > apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x > GROUP BY a, b; > Error: VALIDATION ERROR: From line 1, column 11 to line 1, column 16: > Aggregate expression is illegal in GROUP BY clause > /* OK as aggregate on grouped_field b is aliased to c */ > apache drill 1.17> SELECT a, any_value(b) AS c FROM (SELECT 'a' a, 1 b) x > GROUP BY a, b; > +---+---+ > | a | c | > +---+---+ > | a | 1 | > +---+---+ > {code} > This is a problem that is easy to work around but it's easy to get caught. > And the bypass will sometimes requires an additional level of SELECT, which > is rarely desired. > Tested to compare VS postgres that doesn't have this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7539) Aggregate expression is illegal in GROUP BY clause
benj created DRILL-7539: --- Summary: Aggregate expression is illegal in GROUP BY clause Key: DRILL-7539 URL: https://issues.apache.org/jira/browse/DRILL-7539 Project: Apache Drill Issue Type: Bug Components: SQL Parser Affects Versions: 1.17.0 Reporter: benj When using GROUPED field in aggregate function it works unless the field is aliased with the original name of the field. Example (minimalist example with no real sense but based on structure actually used (with more complex GROUP BY part)): {code:sql} /* OK because aggregate is on b that is not a grouped field */ apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP BY a; +---+---+ | a | b | +---+---+ | a | 1 | +---+---+ /* NOK because the aggregate on grouped field b is aliased to b (name used on the group by) */ apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP BY a, b; Error: VALIDATION ERROR: From line 1, column 11 to line 1, column 16: Aggregate expression is illegal in GROUP BY clause /* OK as aggregate on grouped_field b is aliased to c */ apache drill 1.17> SELECT a, any_value(b) AS c FROM (SELECT 'a' a, 1 b) x GROUP BY a, b; +---+---+ | a | c | +---+---+ | a | 1 | +---+---+ {code} This is a problem that is easy to work around but it's easy to get caught. And the bypass will sometimes requires an additional level of SELECT, which is rarely desired. Tested to compare VS postgres that doesn't have this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016797#comment-17016797 ] benj commented on DRILL-7449: - hi [~IhorHuzenko], I have enable debug logging and the result is here: [^embedded_sqlline_with_enable_debug_logging.log.txt] > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, > embedded_sqlline_with_enable_debug_logging.log.txt > > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked > by query. Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > >
[jira] [Updated] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7449: Attachment: embedded_sqlline_with_enable_debug_logging.log.txt > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, > embedded_sqlline_with_enable_debug_logging.log.txt > > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked > by query. Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 >
[jira] [Comment Edited] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016064#comment-17016064 ] benj edited comment on DRILL-7449 at 1/15/20 3:07 PM: -- [~IhorHuzenko], please find in attachment (execution with leak from my local machine in embedded 1.17 on xubuntu 18.04): - [^embedded_FullJsonProfile.txt] - [^embedded_sqlline.log.txt] The Physical plan: {noformat} 00-00Screen : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {7.07295457E7 rows, 7.424585516618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 739 00-01 Project(Fragment=[$0], Number of records written=[$1]) : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {7.0145004E7 rows, 7.418740099618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 738 00-02Writer : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {6.4299587E7 rows, 7.301831759618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 737 00-03 ProjectAllowDup(Domain=[$0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {5.845417E7 rows, 7.243377589618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 736 00-04Project(Domain=[$0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {5.2608753E7 rows, 7.184923419618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 735 00-05 SingleMergeExchange(sort0=[0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {4.6763336E7 rows, 7.126469249618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 734 01-01OrderedMuxExchange(sort0=[0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {4.0917919E7 rows, 6.658835889618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 733 02-01 SelectionVectorRemover : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {3.5072502E7 rows, 6.600381719618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 732 02-02Sort(sort0=[$0], dir0=[ASC]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.9227085E7 rows, 6.541927549618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 731 02-03 HashToRandomExchange(dist0=[[$0]]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.3381668E7 rows, 1.28599174E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 0.0 memory}, id = 730 03-01Project(Domain=[ITEM($0, 'host')]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {1.7536251E7 rows, 3.5072502E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 729 03-02 Project(parsed=[PARSE_URL($0)]) : rowType = RecordType(ANY parsed): rowcount = 5845417.0, cumulative cost = {1.1690834E7 rows, 2.9227085E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 728 03-03Scan(table=[[dfs, tmp, fbingredagg.bigcopy.json]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/fbingredagg.bigcopy.json, numFiles=1, columns=[`Url`], files=[file:/tmp/fbingredagg.bigcopy.json], schema=null]]) : rowType = RecordType(ANY Url): rowcount = 5845417.0, cumulative cost = {5845417.0 rows, 5845417.0 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 727 {noformat} And the Operator profile Note that Rows are 8 695 808 although in the file there is 8 999 940 rows {noformat} Operator ID TypeAvg Setup Time Max Setup Time Avg Process Time Max Process TimeMin Wait Time Avg Wait Time Max Wait Time % Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory 00-xx-00SCREEN 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,94% 0,00% 0 - - 00-xx-01PROJECT 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 2,37% 0,00% 0 - - 00-xx-02PARQUET_WRITER 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 6,08% 0,00% 0 - - 00-xx-03PROJECT_ALLOW_DUP 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 16,61% 0,00% 0 52KB52KB 00-xx-04PROJECT 0,001s 0,001s 0,000s 0,000s 0,000s 0,000s 0,000s 35,03% 0,00% 0 52KB52KB 00-xx-05MERGING_RECEIVER0,000s 0,000s 0,000s 0,000s 40,382s 40,382s 40,382s 38,96% 0,00% 0 52KB52KB 01-xx-00
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016064#comment-17016064 ] benj commented on DRILL-7449: - [~IhorHuzenko], please find in attachment (execution with leak from my local machine in embedded 1.17 on xubuntu 18.04): - [^embedded_FullJsonProfile.txt] - [^embedded_sqlline.log.txt] The Physical plan: {noformat} 00-00Screen : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {7.07295457E7 rows, 7.424585516618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 739 00-01 Project(Fragment=[$0], Number of records written=[$1]) : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {7.0145004E7 rows, 7.418740099618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 738 00-02Writer : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 5845417.0, cumulative cost = {6.4299587E7 rows, 7.301831759618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 737 00-03 ProjectAllowDup(Domain=[$0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {5.845417E7 rows, 7.243377589618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 736 00-04Project(Domain=[$0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {5.2608753E7 rows, 7.184923419618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 735 00-05 SingleMergeExchange(sort0=[0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {4.6763336E7 rows, 7.126469249618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 734 01-01OrderedMuxExchange(sort0=[0]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {4.0917919E7 rows, 6.658835889618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 733 02-01 SelectionVectorRemover : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {3.5072502E7 rows, 6.600381719618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 732 02-02Sort(sort0=[$0], dir0=[ASC]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.9227085E7 rows, 6.541927549618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 memory}, id = 731 02-03 HashToRandomExchange(dist0=[[$0]]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.3381668E7 rows, 1.28599174E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 0.0 memory}, id = 730 03-01Project(Domain=[ITEM($0, 'host')]) : rowType = RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {1.7536251E7 rows, 3.5072502E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 729 03-02 Project(parsed=[PARSE_URL($0)]) : rowType = RecordType(ANY parsed): rowcount = 5845417.0, cumulative cost = {1.1690834E7 rows, 2.9227085E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 728 03-03Scan(table=[[dfs, tmp, fbingredagg.bigcopy.json]], groupscan=[EasyGroupScan [selectionRoot=file:/tmp/fbingredagg.bigcopy.json, numFiles=1, columns=[`Url`], files=[file:/tmp/fbingredagg.bigcopy.json], schema=null]]) : rowType = RecordType(ANY Url): rowcount = 5845417.0, cumulative cost = {5845417.0 rows, 5845417.0 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 727 {noformat} And the Operator profile Note that Rows are 8 695 808 although in the file there is 8 999 940 rows {noformat} Operator ID TypeAvg Setup Time Max Setup Time Avg Process Time Max Process TimeMin Wait Time Avg Wait Time Max Wait Time % Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory 00-xx-00SCREEN 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,94% 0,00% 0 - - 00-xx-01PROJECT 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 2,37% 0,00% 0 - - 00-xx-02PARQUET_WRITER 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 6,08% 0,00% 0 - - 00-xx-03PROJECT_ALLOW_DUP 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 0,000s 16,61% 0,00% 0 52KB52KB 00-xx-04PROJECT 0,001s 0,001s 0,000s 0,000s 0,000s 0,000s 0,000s 35,03% 0,00% 0 52KB52KB 00-xx-05MERGING_RECEIVER0,000s 0,000s 0,000s 0,000s 40,382s 40,382s 40,382s 38,96% 0,00% 0 52KB52KB 01-xx-00SINGLE_SENDER 0,000s 0,000s 0,000s 0,000s 0,001s
[jira] [Updated] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7449: Attachment: embedded_sqlline.log.txt embedded_FullJsonProfile.txt > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt > > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked > by query. Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 >
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014960#comment-17014960 ] benj commented on DRILL-7449: - Hi [~IhorHuzenko] The problem doesn't appears for each run. Sometimes (with exactly the same data) it will works 5 times before to crash. With the official 1.17 on a small cluster 3 node (for each ~ 48 proc / 128 Go (DRILL_HEAP=15G, DRILL_MAX_DIRECT_MEMORY=80G)) With a file of 688Mo / 1 118 320 JSON records On cluster When comparing profile of correct and crashed executions I can see that : - crash appears at "02-xx-02 - EXTERNAL_SORT" level - on "02-xx-03 - UNORDERED_RECEIVER" : - correct execution : 99% of the Max Records are concentrated on 1 of the 8 Minor fragment, and the cumulative total is correct - on crash execution : Max Record are ~ evenly/homogeneously distributed on the 8 Minor fragment and the cumulative total is incorrect (lower) (already incorrect in 03-xx-02 - PROJECT and 03-xx-00 - JSON_SUB_SCAN ) On my local Machine (1.17 too 8 Proc / 32Go), in embedded mode, When comparing profile of correct and crashed executions I can see that : - crash appears at "02-xx-02 - EXTERNAL_SORT" level - The difference is on 03-xx-00 - JSON_SUB_SCAN, crash execution doesn't have the good number for Max Records - for 02-xx-03 - UNORDERED_RECEIVER , in correct and crash Max Records are ~ evenly/homogeneously distributed on the 6 Minor fragment Example of log data from crash execution on cluster: {noformat} 2020-01-14 08:22:33,681 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO o.a.drill.exec.work.foreman.Foreman - Query text for query with id 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a issued by anonymous: CREATE TABLE dfs.test.`output_pqt` AS ( SELECT R.parsed.host AS D FROM (SELECT parse_url(T.Url) AS parsed FROM dfs.test.`demo2.big.json` AS T) AS R ORDER BY D ) 2020-01-14 08:22:33,724 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO o.a.d.e.p.s.h.CreateTableHandler - Creating persistent table [output_pqt]. 2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: State change requested AWAITING_ALLOCATION --> RUNNING 2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: State change requested AWAITING_ALLOCATION --> RUNNING 2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: State change requested AWAITING_ALLOCATION --> RUNNING 2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: State to report: RUNNING 2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: State to report: RUNNING 2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: State to report: RUNNING 2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State change requested AWAITING_ALLOCATION --> RUNNING 2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State to report: RUNNING 2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State change requested AWAITING_ALLOCATION --> RUNNING 2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State to report: RUNNING 2020-01-14 08:22:41,672 [BitServer-2] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State change requested RUNNING --> CANCELLATION_REQUESTED 2020-01-14 08:22:41,673 [BitServer-2] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State to report: CANCELLATION_REQUESTED 2020-01-14 08:22:41,674 [BitServer-2] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State change requested RUNNING --> CANCELLATION_REQUESTED 2020-01-14 08:22:41,674 [BitServer-2] INFO o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State to report: CANCELLATION_REQUESTED 2020-01-14 08:22:41,675 [BitServer-2] INFO o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: State change requested RUNNING --> CANCELLATION_REQUESTED 2020-01-14 08:22:41,675
[jira] [Created] (DRILL-7524) Distinct on array with any_value
benj created DRILL-7524: --- Summary: Distinct on array with any_value Key: DRILL-7524 URL: https://issues.apache.org/jira/browse/DRILL-7524 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.17.0 Reporter: benj Attachments: IndexOutOfBoundsException.txt, NegativeArraySizeException.txt AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works. But some problems appears: With a dataset of 223664 rows like: {code:sql} SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1; +-++ | Url | Tags | +-++ | http://000.dijiushipindian.com/feed.rss | ["us"] | +-++ {code} With the own UDF function to_string that only do {code:java} @Param FieldReader input; ... String rowString = input.readObject().toString(); ... {code} {code:sql} SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json` GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */; ++ | Tags | ++ | ["us"] | | ["cn"] | ... SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json` GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */; java.lang.NegativeArraySizeException {code} Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608)) And before producing the error, the output show some results like below {code} +--+--+ | Url | Tags | +--+--+ | http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html | [] | | http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | [] | {code} And this result is not correct because field Tags is empty although this is never the case in the source file. So maybe there is a problem with the aggregate function any_value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014216#comment-17014216 ] benj commented on DRILL-7449: - I have had a full check and in reality we havn't used the drill-url-tools because it sometimes produce incorrect values on big dataset (due to memory problem catch into UDF ?) . After some other tests, the standard Drill *parse_url* works well (no Memory leak) +if remove the ORDER BY clause+. And note that Memory leaked can already appears with url_parse (from drill-url-tools) if using ORDER BY clause produce already. The only code that does not cause any critical problem for our use is regexp of the type: {code:sql} SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host FROM (SELECT REGEXP_REPLACE(NULLIF(Url, ''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity FROM ...) {code} Don't know why, but in terms of observation, ORDER BY clause produce number of error of different contexts with complex request and it's sometimes necessary to split the request into 2 distinct requests (one for the SELECT with computations and one for the SELECT with ORDER BY) Note that with the regexp there is no error even with ORDER BY clause. > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 >
[jira] [Updated] (DRILL-7519) Error on case when different branche are array of same type but build differenlty
[ https://issues.apache.org/jira/browse/DRILL-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7519: Attachment: full_log_DRILL7519.log > Error on case when different branche are array of same type but build > differenlty > - > > Key: DRILL-7519 > URL: https://issues.apache.org/jira/browse/DRILL-7519 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > Attachments: full_log_DRILL7519.log > > > With 3 array build like > {code:sql} > SELECT T.s, typeof(T.s), modeof(T.s) > ,T.j, typeof(T.j), modeof(T.j) > ,T.j2.a, typeof(T.j2.a), modeof(T.j2.a) > FROM ( > SELECT split('a,b',',') as s > , convert_fromJSON('["c","d"]') AS j > , convert_fromJSON('{"tag":["e","f"]}') AS j2 > ) AS T > +---+-++---+-++---+-++ > | s | EXPR$1 | EXPR$2 | j | EXPR$4 | EXPR$5 | EXPR$6 | > EXPR$7 | EXPR$8 | > +---+-++---+-++---+-++ > | ["a","b"] | VARCHAR | ARRAY | ["c","d"] | VARCHAR | ARRAY | ["e","f"] | > VARCHAR | ARRAY | > +---+-++---+-++---+-++ > {code} > it's possible to use *s* and *j* in the branch of the same case, but it's not > possible to use *s or j* in accordance with *j2.tag* > {code:sql} > SELECT CASE WHEN true THEN T.s ELSE T.j END > , CASE WHEN false THEN T.s ELSE T.j END > FROM ( > SELECT split('a,b',',') AS s > , convert_fromJSON('["c","d"]') AS j > , convert_fromJSON('{"tag":["e","f"]}') AS j2 > ) AS T > +---+---+ > | EXPR$0 | EXPR$1 | > +---+---+ > | ["a","b"] | ["c","d"] | > +---+---+ > SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END > , CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END > FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, > convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; > +---+---+ > | EXPR$0 | EXPR$1 | > +---+---+ > | ["e","f"] | ["a","b"] | > +---+---+ > /* But surprisingly */ > SELECT CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END > FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, > convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; > Error: SYSTEM ERROR: NullPointerException > /* and */ > SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END > FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, > convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; > +---+ > | EXPR$0 | > +---+ > | ["e","f"] | > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7449) memory leak parse_url function
[ https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012874#comment-17012874 ] benj commented on DRILL-7449: - In the meantime and if it can helps someone I have found this page [https://www.r-bloggers.com/two-new-apache-drill-udfs-for-processing-urils-and-internet-domain-names/] and now using the function _url_parse_ from [https://github.com/hrbrmstr/drill-url-tools] that use [http://galimatias.mola.io/] > memory leak parse_url function > -- > > Key: DRILL-7449 > URL: https://issues.apache.org/jira/browse/DRILL-7449 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Assignee: Igor Guzenko >Priority: Major > > Requests with *parse_url* works well when the number of treated rows is low > but produce memory leak when number of rows grows (~ between 500 000 and 1 > million) (and for certain number of row sometimes the request works and > sometimes it failed with memory leaks) > Extract from dataset tested: > {noformat} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} > {"Attributable":true,"Description":"Website has been identified as malicious > by > Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} > {noformat} > Request tested: > {code:sql} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.use_new_reader` = true; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; > ALTER SESSION SET `store.json.all_text_mode` = true; > ALTER SESSION SET `exec.enable_union_type` = true; > ALTER SESSION SET `store.json.all_text_mode` = true; > CREATE TABLE dfs.test.`output_pqt` AS > ( > SELECT R.parsed.host AS Domain > FROM ( > SELECT parse_url(T.Url) AS parsed > FROM dfs.test.`file.json` AS T > ) AS R > ORDER BY Domain > ); > {code} > > Result when memory leak: > {noformat} > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. > Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 > org.apache.drill.exec.ops.FragmentContextImpl.close():546 > > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 > org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1149 > java.util.concurrent.ThreadPoolExecutor$Worker.run():624 > java.lang.Thread.run():748 (state=,code=0) > java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked > by query. Memory leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > Fragment 3:0 > Please, refer to logs for more information. > [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] > (java.lang.IllegalStateException) Memory was leaked by query. Memory > leaked: (256) > Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) > org.apache.drill.exec.memory.BaseAllocator.close():520 > org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 >
[jira] [Created] (DRILL-7519) Error on case when different branche are array of same type but build differenlty
benj created DRILL-7519: --- Summary: Error on case when different branche are array of same type but build differenlty Key: DRILL-7519 URL: https://issues.apache.org/jira/browse/DRILL-7519 Project: Apache Drill Issue Type: Bug Affects Versions: 1.17.0 Reporter: benj With 3 array build like {code:sql} SELECT T.s, typeof(T.s), modeof(T.s) ,T.j, typeof(T.j), modeof(T.j) ,T.j2.a, typeof(T.j2.a), modeof(T.j2.a) FROM ( SELECT split('a,b',',') as s , convert_fromJSON('["c","d"]') AS j , convert_fromJSON('{"tag":["e","f"]}') AS j2 ) AS T +---+-++---+-++---+-++ | s | EXPR$1 | EXPR$2 | j | EXPR$4 | EXPR$5 | EXPR$6 | EXPR$7 | EXPR$8 | +---+-++---+-++---+-++ | ["a","b"] | VARCHAR | ARRAY | ["c","d"] | VARCHAR | ARRAY | ["e","f"] | VARCHAR | ARRAY | +---+-++---+-++---+-++ {code} it's possible to use *s* and *j* in the branch of the same case, but it's not possible to use *s or j* in accordance with *j2.tag* {code:sql} SELECT CASE WHEN true THEN T.s ELSE T.j END , CASE WHEN false THEN T.s ELSE T.j END FROM ( SELECT split('a,b',',') AS s , convert_fromJSON('["c","d"]') AS j , convert_fromJSON('{"tag":["e","f"]}') AS j2 ) AS T +---+---+ | EXPR$0 | EXPR$1 | +---+---+ | ["a","b"] | ["c","d"] | +---+---+ SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END , CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; +---+---+ | EXPR$0 | EXPR$1 | +---+---+ | ["e","f"] | ["a","b"] | +---+---+ /* But surprisingly */ SELECT CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; Error: SYSTEM ERROR: NullPointerException /* and */ SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T; +---+ | EXPR$0 | +---+ | ["e","f"] | +---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7516) count(*) on empty JSON produce nothing
benj created DRILL-7516: --- Summary: count(*) on empty JSON produce nothing Key: DRILL-7516 URL: https://issues.apache.org/jira/browse/DRILL-7516 Project: Apache Drill Issue Type: Bug Components: Storage - JSON Affects Versions: 1.17.0 Reporter: benj With 2 files containing nothing {code:bash} touch 0.csv touch 0.json {code} the count( * ) doesn't produce the same result {code:sql} apache drill> select count(*) from dfs.TEST.`0.json`; ++ | EXPR$0 | ++ ++ No rows selected (0.151 seconds) apache drill> select count(*) from dfs.TEST.`0.csv`; ++ | EXPR$0 | ++ | 0 | ++ 1 row selected (0.415 seconds) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7515) ORDER BY clause produce error on GROUP BY with array field manager with any_value
benj created DRILL-7515: --- Summary: ORDER BY clause produce error on GROUP BY with array field manager with any_value Key: DRILL-7515 URL: https://issues.apache.org/jira/browse/DRILL-7515 Project: Apache Drill Issue Type: Bug Components: Execution - Data Types Affects Versions: 1.17.0 Reporter: benj With a parquet containing an array field, for example: {code:sql} apache drill 1.17> CREATE TABLE dfs.TEST.`example_any_pqt` AS (SELECT 'foo' AS a, 'bar' b, split('foo,bar',',') as c); apache drill 1.17> SELECT *, typeof(c) AS type, sqltypeof(c) AS sql_type FROM dfs.TEST.`example_any_pqt`; +-+-+---+-+--+ | a | b | c | type | sql_type | +-+-+---+-+--+ | foo | bar | ["foo","bar"] | VARCHAR | ARRAY| +-+-+---+-+--+ {code} The next request work well {code:sql} apache drill 1.17> SELECT * FROM (SELECT a, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a) ORDER BY a; +-+---+ | a |EXPR$1 | +-+---+ | foo | ["foo","bar"] | +-+---+ {code} But the next request (with the same struct as the previous request) failed {code:sql} apache drill 1.17> SELECT * FROM (SELECT a, b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b) ORDER BY a; Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` (VARCHAR:OPTIONAL)], [`EXPR$2` (NULL:OPTIONAL)]], selectionVector=NONE] Incoming schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` (VARCHAR:OPTIONAL)], [`EXPR$2` (VARCHAR:REPEATED), children=([`$data$` (VARCHAR:REQUIRED)])]], selectionVector=NONE] Fragment 0:0 {code} Note that the same request +without the order by+ works well. It's also possible to use intermediate table and apply the ORDER BY in a second time. {code:sql} apache drill 1.17> SELECT * FROM (SELECT a, b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b); +-+-+---+ | a | b |EXPR$2 | +-+-+---+ | foo | bar | ["foo","bar"] | +-+-+---+ apache drill 1.17> CREATE TABLE dfs.TEST.`ok_pqt` AS (SELECT * FROM (SELECT a, b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b)); +--+---+ | Fragment | Number of records written | +--+---+ | 0_0 | 1 | +--+---+ apache drill 1.17> SELECT * FROM dfs.TEST.`ok_pqt` ORDER BY a; +-+-+---+ | a | b |EXPR$2 | +-+-+---+ | foo | bar | ["foo","bar"] | +-+-+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7493) convert_fromJSON and unicode
benj created DRILL-7493: --- Summary: convert_fromJSON and unicode Key: DRILL-7493 URL: https://issues.apache.org/jira/browse/DRILL-7493 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj transform a json string (with \u char) into json struct {code:sql} apache drill> SELECT x_str, convert_fromJSON(x_str) AS x_array FROM (SELECT '["test=\u0014=test"]' x_str); +--+--+ |x_str | x_array| +--+--+ | ["test=\u0014=test"] | ["test=\u0014=test"] | +--+--+ {code} Use json struct : {code:sql} apache drill> SELECT x_str , x_array , x_array[0] AS x_array0 FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array FROM (SELECT '["test=\u0014=test"]' x_str)); +--+--+-+ |x_str | x_array| x_array0 | +--+--+-+ | ["test=\u0014=test"] | ["test=\u0014=test"] | test==test | +--+--+-+ {code} Note that the char \u0014 is interpreted in x_array0 if using split function on x_array0, an array is built with non interpreted \u {code:sql} apache drill> SELECT x_str , x_array , x_array[0] AS x_array0 , split(x_array[0],',') AS x_array0_split FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array FROM (SELECT '["test=\u0014=test"]' x_str)); +--+--+-+--+ |x_str | x_array| x_array0 |x_array0_split | +--+--+-+--+ | ["test=\u0014=test"] | ["test=\u0014=test"] | test==test | ["test=\u0014=test"] | +--+--+-+--+ {code} It's not possible to use convert_fromJSON on the interpreted \u {code:sql} SELECT x_str , x_array , x_array[0] AS x_array0 , split(x_array[0],',') AS x_array0_split , convert_fromJSON('["' || x_array[0] || '"]') AS convertJSONerror FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array FROM (SELECT '["test=\u0014=test"]' x_str)); Error: DATA_READ ERROR: Illegal unquoted character ((CTRL-CHAR, code 20)): has to be escaped using backslash to be included in string value at [Source: (org.apache.drill.exec.vector.complex.fn.DrillBufInputStream); line: 1, column: 9] {code} don't work although the string is the same as the origin but \u is unfortunatly interpreted -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6963) create/aggregate/work with array
[ https://issues.apache.org/jira/browse/DRILL-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985067#comment-16985067 ] benj commented on DRILL-6963: - For the second point (arry_agg), in attempt of an eventual official function, here is a simple implementation that can do that (without possibility to _DISTINCT_ or _ORDER BY_) {code:java} package org.apache.drill.contrib.function; import io.netty.buffer.DrillBuf; import org.apache.drill.exec.expr.DrillAggFunc; import org.apache.drill.exec.expr.annotations.FunctionTemplate; import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope; import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; import org.apache.drill.exec.expr.annotations.Output; import org.apache.drill.exec.expr.annotations.Param; import org.apache.drill.exec.expr.annotations.Workspace; import org.apache.drill.exec.expr.holders.*; import javax.inject.Inject; // If dataset is too large, need : ALTER SESSION SET `planner.enable_hashagg` = false public class ArrayAgg { // STRING NULLABLE // @FunctionTemplate( name = "array_agg", scope = FunctionScope.POINT_AGGREGATE, nulls = NullHandling.INTERNAL) public static class NullableVarChar_ArrayAgg implements DrillAggFunc { @Param NullableVarCharHolder input; @Workspace ObjectHolder agg; @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out; @Inject DrillBuf buffer; @Override public void setup() { agg = new ObjectHolder(); } @Override public void reset() { agg = new ObjectHolder(); } @Override public void add() { org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter; if (agg.obj == null) { agg.obj = out.rootAsList(); } if ( input.isSet == 0 ) return; org.apache.drill.exec.expr.holders.VarCharHolder rowHolder = new org.apache.drill.exec.expr.holders.VarCharHolder(); byte[] inputBytes = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder( input ).getBytes( com.google.common.base.Charsets.UTF_8 ); buffer.reallocIfNeeded(inputBytes.length); buffer.setBytes(0, inputBytes); rowHolder.start = 0; rowHolder.end = inputBytes.length; rowHolder.buffer = buffer; listWriter = (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj; listWriter.varChar().write( rowHolder ); } @Override public void output() { ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj).endList(); } } // INTEGER NULLABLE // @FunctionTemplate( name = "array_agg", scope = FunctionScope.POINT_AGGREGATE, nulls = NullHandling.INTERNAL) public static class NullableInt_ArrayAgg implements DrillAggFunc { @Param NullableIntHolder input; @Workspace ObjectHolder agg; @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out; @Inject DrillBuf buffer; @Override public void setup() { agg = new ObjectHolder(); } @Override public void reset() { agg = new ObjectHolder(); } @Override public void add() { org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter; if (agg.obj == null) { agg.obj = out.rootAsList(); } if ( input.isSet == 0 ) return; listWriter = (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj; listWriter.integer().writeInt( input.value ); } @Override public void output() { ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj).endList(); } } // ... } {code} > create/aggregate/work with array > > > Key: DRILL-6963 > URL: https://issues.apache.org/jira/browse/DRILL-6963 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Reporter: benj >Priority: Major > > * Add the possibility to build array (like : SELECT array[a1,a2,a3...]) - > ideally work with all types > * Add a default array_agg (like : SELECT col1, array_agg(col2), > array_agg(DISTINCT col2) FROM ... GROUP BY col1) ; - ideally work with all > types > * Add function/facilities/operator to work with array -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files
[ https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982491#comment-16982491 ] benj commented on DRILL-1755: - This problems seems partially OK, With a file containing the previous description {noformat}[{"accessLevel": "public"},{"accessLevel": "private"}]{noformat} {code:sql} apache drill (1.16)> SELECT *, 'justfortest' AS mytext FROM dfs.tmp.`example.json`; +-+-+ | accessLevel | mytext| +-+-+ | public | justfortest | | private | justfortest | +-+-+ 2 rows selected (0.127 seconds) {code} But some problems subsists, like {code:sql} apache drill (1.16)> SELECT 'justfortest' As mytext FROM dfs.tmp.`example.json`; Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record. Current token was START_ARRAY File /tmp/example.json Record 1 Column 2 Fragment 0:0 {code} > Add support for arrays and scalars as first level elements in JSON files > > > Key: DRILL-1755 > URL: https://issues.apache.org/jira/browse/DRILL-1755 > Project: Apache Drill > Issue Type: Improvement >Reporter: Abhishek Girish >Priority: Major > Fix For: Future > > Attachments: drillbit.log > > > Publicly available JSON data sometimes have the following structure (arrays > as first level elements): > [ > {"accessLevel": "public" > }, > {"accessLevel": "private" > } > ] > Drill currently does not support Arrays or Scalars as first level elements. > Only maps are supported. We should add support for the arrays and scalars. > Log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7452) Support comparison operator for Array
benj created DRILL-7452: --- Summary: Support comparison operator for Array Key: DRILL-7452 URL: https://issues.apache.org/jira/browse/DRILL-7452 Project: Apache Drill Issue Type: Wish Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj Attachments: example_array.parquet It will be useful to have a comparison operator for nested types, at less for Array. sample file in attachment : example_array.parquet {code:sql} /* It's possible to do */ apache drill(1.16)> SELECT id, tags FROM `example_array.parquet`; +++ | id |tags| +++ | 7b8808 | [1,2,3]| | 7b8808 | [1,20,3] | | 55a4be | [1,3,5,6] | +++ /* But it's not possible to use DISTINCT or ORDER BY on the field Tags (ARRAY) */ /* https://drill.apache.org/docs/nested-data-limitations/ */ apache drill(1.16)> SELECT DISTINCT id, tags FROM `example_array_parquet` ORDER BY tags; Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or repeated scalar type should not be used in group by, order by or in a comparison operator. Drill does not support compare between BIGINT:REPEATED and BIGINT:REPEATED. {code} It's possible to do that in Postgres {code:sql} SELECT DISTINCT id, tags FROM ( SELECT '7b8808' AS id, ARRAY[1,2,3] tags UNION SELECT '7b8808', ARRAY[1,20,3] UNION SELECT '55a4be', ARRAY[1,3,5,6] ) x ORDER BY tags 7b8808;{1,2,3} 55a4be;{1,3,5,6} 7b8808;{1,20,3} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7379) Planning error
[ https://issues.apache.org/jira/browse/DRILL-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7379: Description: sample file: [^example.parquet] With data as: {code:sql} SELECT id, tags FROM `example_parquet`; +++ | id |tags| +++ | 7b8808 | ["peexe","signed","overlay"] | | 55a4ae | ["peexe","signed","upx","overlay"] | +++ {code} The next request is OK {code:sql} SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) LIMIT 2; +++ | id | tag | +++ | 55a4ae | peexe | | 55a4ae | signed | +++ {code} But unexpectedly, the next query failed: {code:sql} SELECT tag, count(*) FROM ( SELECT flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) ) GROUP BY tag; Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or repeated scalar type should not be used in group by, order by or in a comparison operator. Drill does not support compare between MAP:REPEATED and MAP:REPEATED. /* Or other error with another set of data : Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index 0. Error: Missing function implementation: [hash32asdouble(MAP-REPEATED, INT-REQUIRED)]. Full expression: null.. */ {code} These errors are incomprehensible because, the aggregate is on VARCHAR. More, the request works if decomposed in 2 request with with the creation of an intermediate table like below: {code:sql} CREATE TABLE `tmp.parquet` AS ( SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id )); SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag; +-+---+ | tag | c | +-+---+ | overlay | 2 | | peexe | 2 | | signed | 2 | | upx | 1 | +-+---+ {code} was: With data as: {code:sql} SELECT id, tags FROM `example_parquet`; +++ | id |tags| +++ | 7b8808 | ["peexe","signed","overlay"] | | 55a4ae | ["peexe","signed","upx","overlay"] | +++ {code} The next request is OK {code:sql} SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) LIMIT 2; +++ | id | tag | +++ | 55a4ae | peexe | | 55a4ae | signed | +++ {code} But unexpectedly, the next query failed: {code:sql} SELECT tag, count(*) FROM ( SELECT flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) ) GROUP BY tag; Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or repeated scalar type should not be used in group by, order by or in a comparison operator. Drill does not support compare between MAP:REPEATED and MAP:REPEATED. /* Or other error with another set of data : Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index 0. Error: Missing function implementation: [hash32asdouble(MAP-REPEATED, INT-REQUIRED)]. Full expression: null.. */ {code} These errors are incomprehensible because, the aggregate is on VARCHAR. More, the request works if decomposed in 2 request with with the creation of an intermediate table like below: {code:sql} CREATE TABLE `tmp.parquet` AS ( SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id )); SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag; +-+---+ | tag | c | +-+---+ | overlay | 2 | | peexe | 2 | | signed | 2 | | upx | 1 | +-+---+ {code} > Planning error > -- > > Key: DRILL-7379 > URL: https://issues.apache.org/jira/browse/DRILL-7379 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > Attachments: example.parquet > > > sample file: [^example.parquet] > With data as: > {code:sql} > SELECT id, tags FROM `example_parquet`; > +++ > | id |tags| > +++ > | 7b8808 | ["peexe","signed","overlay"] | > | 55a4ae | ["peexe","signed","upx","overlay"] | > +++ >
[jira] [Updated] (DRILL-7379) Planning error
[ https://issues.apache.org/jira/browse/DRILL-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7379: Attachment: example.parquet > Planning error > -- > > Key: DRILL-7379 > URL: https://issues.apache.org/jira/browse/DRILL-7379 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > Attachments: example.parquet > > > With data as: > {code:sql} > SELECT id, tags FROM `example_parquet`; > +++ > | id |tags| > +++ > | 7b8808 | ["peexe","signed","overlay"] | > | 55a4ae | ["peexe","signed","upx","overlay"] | > +++ > {code} > The next request is OK > {code:sql} > SELECT id, flatten(tags) tag > FROM ( > SELECT id, any_value(tags) tags > FROM `example_parquet` > GROUP BY id > ) LIMIT 2; > +++ > | id | tag | > +++ > | 55a4ae | peexe | > | 55a4ae | signed | > +++ > {code} > But unexpectedly, the next query failed: > {code:sql} > SELECT tag, count(*) > FROM ( > SELECT flatten(tags) tag > FROM ( > SELECT id, any_value(tags) tags > FROM `example_parquet` > GROUP BY id > ) > ) GROUP BY tag; > Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or > repeated scalar type should not be used in group by, order by or in a > comparison operator. Drill does not support compare between MAP:REPEATED and > MAP:REPEATED. > /* Or other error with another set of data : > Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to > materialize incoming schema. Errors: > > Error in expression at index 0. Error: Missing function implementation: > [hash32asdouble(MAP-REPEATED, INT-REQUIRED)]. Full expression: null.. > */ > {code} > These errors are incomprehensible because, the aggregate is on VARCHAR. > More, the request works if decomposed in 2 request with with the creation of > an intermediate table like below: > {code:sql} > CREATE TABLE `tmp.parquet` AS ( > SELECT id, flatten(tags) tag > FROM ( > SELECT id, any_value(tags) tags > FROM `example_parquet` > GROUP BY id > )); > SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag; > +-+---+ > | tag | c | > +-+---+ > | overlay | 2 | > | peexe | 2 | > | signed | 2 | > | upx | 1 | > +-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7449) memory leak parse_url function
benj created DRILL-7449: --- Summary: memory leak parse_url function Key: DRILL-7449 URL: https://issues.apache.org/jira/browse/DRILL-7449 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj Requests with *parse_url* works well when the number of treated rows is low but produce memory leak when number of rows grows (~ between 500 000 and 1 million) (and for certain number of row sometimes the request works and sometimes it failed with memory leaks) Extract from dataset tested: {noformat} {"Attributable":true,"Description":"Website has been identified as malicious by Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5} {"Attributable":true,"Description":"Website has been identified as malicious by Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5} {noformat} Request tested: {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.use_new_reader` = true; ALTER SESSION SET `store.parquet.compression` = 'snappy'; ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true; ALTER SESSION SET `store.json.all_text_mode` = true; ALTER SESSION SET `exec.enable_union_type` = true; ALTER SESSION SET `store.json.all_text_mode` = true; CREATE TABLE dfs.test.`output_pqt` AS ( SELECT R.parsed.host AS Domain FROM ( SELECT parse_url(T.Url) AS parsed FROM dfs.test.`file.json` AS T ) AS R ORDER BY Domain ); {code} Result when memory leak: {noformat} Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: (256) Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) Fragment 3:0 Please, refer to logs for more information. [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] (java.lang.IllegalStateException) Memory was leaked by query. Memory leaked: (256) Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) org.apache.drill.exec.memory.BaseAllocator.close():520 org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 org.apache.drill.exec.ops.FragmentContextImpl.close():546 org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1149 java.util.concurrent.ThreadPoolExecutor$Worker.run():624 java.lang.Thread.run():748 (state=,code=0) java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: (256) Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) Fragment 3:0 Please, refer to logs for more information. [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010] (java.lang.IllegalStateException) Memory was leaked by query. Memory leaked: (256) Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit) org.apache.drill.exec.memory.BaseAllocator.close():520 org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552 org.apache.drill.exec.ops.FragmentContextImpl.close():546 org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386 org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214 org.apache.drill.exec.work.fragment.FragmentExecutor.run():329 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1149 java.util.concurrent.ThreadPoolExecutor$Worker.run():624 java.lang.Thread.run():748 at org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:538) at org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:610) at
[jira] [Commented] (DRILL-7375) composite/nested type map/array convert_to/cast to varchar
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975198#comment-16975198 ] benj commented on DRILL-7375: - Waiting for a possible official version of such a feature, it is possible to use an own UDF like: {code:java} package org.apache.drill.contrib.function; import io.netty.buffer.DrillBuf; import org.apache.drill.exec.expr.DrillSimpleFunc; import org.apache.drill.exec.expr.annotations.FunctionTemplate; import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope; import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; import org.apache.drill.exec.expr.annotations.Output; import org.apache.drill.exec.expr.annotations.Param; import org.apache.drill.exec.vector.complex.reader.FieldReader; import org.apache.drill.exec.expr.holders.*; import javax.inject.Inject; public class ToString { @FunctionTemplate( name = "to_string", scope = FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL) public static class NullableVarChar_Field_ToString implements DrillSimpleFunc { @Param FieldReader input; @Output VarCharHolder out; @Inject DrillBuf buffer; @Override public void setup() { } @Override public void eval() { String rowString = input.readObject().toString(); buffer = buffer.reallocIfNeeded(rowString.length()); buffer.setBytes(0, rowString.getBytes(), 0, rowString.length()); out.start = 0; out.end= rowString.length(); out.buffer = buffer; } } } {code} Example of use: {code:sql} apache drill> SELECT j, typeof(j) AS tj, to_string(j) AS strj, typeof(to_string(j)) AS tstrj FROM (SELECT convert_fromJSON('{a:["1","2","3"]}' ) j); +-+-+-+-+ | j | tj |strj | tstrj | +-+-+-+-+ | {"a":["1","2","3"]} | MAP | {"a":["1","2","3"]} | VARCHAR | +-+-+-+-+ 1 row selected (0.132 seconds) {code} With this function it's possible to "cast" anything in varchar and avoid storage problem in Parquet due to certain types. And it is eventually possible to cast the other way when requesting the Parquet file. > composite/nested type map/array convert_to/cast to varchar > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (convert_from + JSON) with convert_from > or transform a varchar to array (split) > {code:sql} > SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: > 200}' ,'JSON') a); > +---+-++ > | a | EXPR$1 | EXPR$2 | > +---+-++ > | {"a":100,"b":200} | MAP | STRUCT | > +---+-++ > SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( > SELECT 'foo,bar' AS str)); > +---+-++ > |a | EXPR$1 | EXPR$2 | > +---+-++ > | ["foo","bar"] | VARCHAR | ARRAY | > +---+-++ > {code} > It will be very usefull : > # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast > syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | > +---+--+---+-+ > {code} > # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast > syntax" or any other method > Expected > {code:sql} > SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( > SELECT 'foo,bar' AS str)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR | > +---+--+---+-+ > {code} > Please note that these possibility of course exists in other database systems > Example with Postgres: > {code:sql} > SELECT '{"a":100,"b":200}'::json::text; > => {"a":100,"b":200} > SELECT array[1,2,3]::text; >
[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on Drill embedded
[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7444: Summary: JSON blank result on SELECT when too much byte in multiple files on Drill embedded (was: JSON blank result on SELECT when too much byte in multiple files on embedded) > JSON blank result on SELECT when too much byte in multiple files on Drill > embedded > -- > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_ when using +Drill embedded+. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'ab' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 1 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 1 a.json > 239 b.json > 10239 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > | aab... | > ++ > => All is fine here > == II == > $ gen.sh 1 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 1 a.json > 240 b.json > 10240 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > || > ++ > => In a surprising way field `At` is empty > == III == > $ gen.sh 10240 > ab.json > $ wc -c *.json > 10240 ab.json > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; > ++ > |At | > ++ > | aab... | > ++ > => All is fine here although the number of lines is equal to case II > {noformat} > The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > This problem doesn't appears with Drill embedded 1.16 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on embedded
[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7444: Description: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_ when using +Drill embedded+. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} The Version of the Drill 1.17 tested here is the latest at 2019-11-13 This problem doesn't appears with Drill embedded 1.16 was: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_ when using Drill embedded. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > JSON blank result on SELECT when too much byte in multiple files on embedded > > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_ when using +Drill embedded+. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'ab' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 1 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 1 a.json > 239 b.json > 10239 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > | aab... | > ++ > => All is fine here > == II == > $ gen.sh 1 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 1 a.json > 240 b.json > 10240 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > || > ++ > => In a surprising way field `At` is empty > == III == > $
[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on embedded
[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7444: Description: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_ when using Drill embedded. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} The Version of the Drill 1.17 tested here is the latest at 2019-11-13 was: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} * This problem doesn't appears in Drill 1.16. * The Version of the Drill 1.17 tested here is the latest at 2019-11-13 Summary: JSON blank result on SELECT when too much byte in multiple files on embedded (was: JSON blank result on SELECT when too much byte in multiple files ) > JSON blank result on SELECT when too much byte in multiple files on embedded > > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_ when using Drill embedded. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'ab' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 1 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 1 a.json > 239 b.json > 10239 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > | aab... | > ++ > => All is fine here > == II == > $ gen.sh 1 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 1 a.json > 240 b.json > 10240 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At |
[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files
[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7444: Description: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json $ bash drill-embedded apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} * This problem doesn't appears in Drill 1.16. * The Version of the Drill 1.17 tested here is the latest at 2019-11-13 was: 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} * This problem doesn't appears in Drill 1.16. * The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > JSON blank result on SELECT when too much byte in multiple files > - > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'ab' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 1 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 1 a.json > 239 b.json > 10239 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > | aab... | > ++ > => All is fine here > == II == > $ gen.sh 1 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 1 a.json > 240 b.json > 10240 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > || > ++ > => In a surprising way field `At` is empty > == III == > $ gen.sh 10240 > ab.json > $ wc -c *.json > 10240 ab.json > $ bash drill-embedded > apache drill> SELECT * FROM
[jira] [Created] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files
benj created DRILL-7444: --- Summary: JSON blank result on SELECT when too much byte in multiple files Key: DRILL-7444 URL: https://issues.apache.org/jira/browse/DRILL-7444 Project: Apache Drill Issue Type: Bug Components: Storage - JSON Affects Versions: 1.17.0 Reporter: benj 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce different results on a simple _SELECT_. Problem appears from a number of byte (~ 102 400 000 in my case) {code:bash} #!/bin/bash # script gen.sh to reproduce the problem for ((i=1;i<=$1;++i)); do echo -n '{"At":"' for j in {1..999}; do echo -n 'ab' done echo '"}' done {code} {noformat} == I == $ gen.sh 1 > a.json $ gen.sh 239 > b.json $ wc -c *.json 1 a.json 239 b.json 10239 total apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ | aab... | ++ => All is fine here == II == $ gen.sh 1 > a.json $ gen.sh 240 > b.json $ wc -c *.json 1 a.json 240 b.json 10240 total apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; ++ | At | ++ || ++ => In a surprising way field `At` is empty == III == $ gen.sh 10240 > ab.json $ wc -c *.json 10240 ab.json apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; ++ |At | ++ | aab... | ++ => All is fine here although the number of lines is equal to case II {noformat} * This problem doesn't appears in Drill 1.16. * The Version of the Drill 1.17 tested here is the latest at 2019-11-13 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7426) Json support lists of different types
[ https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961802#comment-16961802 ] benj commented on DRILL-7426: - In my particular case schema is difficult to predict, but as [~cgivre] says, an option to get/force value as string will be great. But what is particularly surprising in this case is that {noformat} apache drill> ALTER SESSION SET `store.json.all_text_mode` = true; apache drill> ALTER SESSION SET `exec.enable_union_type` = true; /* I) doesn't work with simple array */ {"name": "toto", "info": ["LOAD", 5, [] ], "response": 1 } apache drill> SELECT * FROM dfs.test.`file.json` LIMIT 1; Error: SYSTEM ERROR: SchemaChangeRuntimeException: Inner vector type mismatch. Requested type: [minor_type: VARCHAR mode: OPTIONAL ], actual type: [minor_type: UNION mode: OPTIONAL sub_type: VARCHAR sub_type: LIST ] /* II) but work with array of array */ {"name": "toto", "info": [ ["LOAD", 5, [] ] ], "response": 1 } apache drill> SELECT * FROM dfs.test.`file.json` LIMIT 1; +--+---+--+ | name | info| response | +--+---+--+ | toto | [["LOAD","5",[]]] | 1| +--+---+--+ 1 row selected (0.133 seconds) /* III) and it also work WHEN acceding to first field of array of array (info[0]) that seems the same as the array of the case (I)*/ apache drill> SELECT *, info[0], info[0][0], info[0][1], info[0][2] FROM dfs.test.`file.json` LIMIT 1; +--+--+--+-++++ | name | info | response | EXPR$1 | EXPR$2 | EXPR$3 | EXPR$4 | +--+--+--+-++++ | toto | [["LOAD","5",[]],[]] | 1| ["LOAD","5",[]] | LOAD | 5 | [] | +--+--+--+-++++ 1 row selected (0.185 seconds) {noformat} > Json support lists of different types > - > > Key: DRILL-7426 > URL: https://issues.apache.org/jira/browse/DRILL-7426 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.16.0 >Reporter: benj >Priority: Trivial > > With a file.json like > {code:json} > { > "name": "toto", > "info": [["LOAD", []]], > "response": 1 > } > {code} > A simple SELECT gives an error > {code:sql} > apache drill> SELECT * FROM dfs.test.`file.json`; > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. > {code} > But there is an option _exec.enable_union_type_ that allows these request > {code:sql} > apache drill> ALTER SESSION SET `exec.enable_union_type` = true; > apache drill> SELECT * FROM dfs.test.`file.json`; > +--+---+--+ > | name | info | response | > +--+---+--+ > | toto | [["LOAD",[]]] | 1| > +--+---+--+ > 1 row selected (0.283 seconds) > {code} > The usage of this option is not evident. So, it will be useful to mention > after the error message the possibility to set it. > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. SET > the option 'exec.enable_union_type' to true and try again; > {noformat} > This behaviour is used for other error, example: > {noformat} > ... > Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due > to either a cartesian join or an inequality join. > If a cartesian or inequality join is used intentionally, set the option > 'planner.enable_nljoin_for_scalar_only' to false and try again. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7426) Json support lists of different types
benj created DRILL-7426: --- Summary: Json support lists of different types Key: DRILL-7426 URL: https://issues.apache.org/jira/browse/DRILL-7426 Project: Apache Drill Issue Type: Improvement Components: Documentation Affects Versions: 1.16.0 Reporter: benj With a file.json like {code:json} { "name": "toto", "info": [["LOAD", []]], "response": 1 } {code} A simple SELECT gives an error {code:sql} apache drill> SELECT * FROM dfs.test.`file.json`; Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a value of type LIST. Drill does not support lists of different types. {code} But there is an option _exec.enable_union_type_ that allows these request {code:sql} apache drill> ALTER SESSION SET `exec.enable_union_type` = true; apache drill> SELECT * FROM dfs.test.`file.json`; +--+---+--+ | name | info | response | +--+---+--+ | toto | [["LOAD",[]]] | 1| +--+---+--+ 1 row selected (0.283 seconds) {code} The usage of this option is not evident. So, it will be useful to mention after the error message the possibility to set it. {noformat} Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a value of type LIST. Drill does not support lists of different types. SET the option 'exec.enable_union_type' to true and try again; {noformat} This behaviour is used for other error, example: {noformat} ... Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join. If a cartesian or inequality join is used intentionally, set the option 'planner.enable_nljoin_for_scalar_only' to false and try again. {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7420) window function improve ROWS clause/frame possibilities
benj created DRILL-7420: --- Summary: window function improve ROWS clause/frame possibilities Key: DRILL-7420 URL: https://issues.apache.org/jira/browse/DRILL-7420 Project: Apache Drill Issue Type: New Feature Affects Versions: 1.16.0 Reporter: benj The possibility of window frame are currently limited in Apache Drill. ROWS clauses is only possible with "BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW". It will be useful to have possibilities to use: * "BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" * "BETWEEN x PRECEDING AND y FOLLOWING" {code:sql} /* ROWS clause is only possible with "BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" */ apache drill> SELECT *, sum(a) OVER(ORDER BY b ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM (SELECT 1 a, 1 b, 1 c); +---+---+---++ | a | b | c | EXPR$3 | +---+---+---++ | 1 | 1 | 1 | 1 | +---+---+---++ 1 row selected (1.357 seconds) /* ROWS is currently not possible with "BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" (it's possible with RANGE but with single ORDER BY only ) */ apache drill> SELECT *, sum(a) OVER(ORDER BY b, c ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELECT 1 a, 1 b, 1 c); Error: UNSUPPORTED_OPERATION ERROR: This type of window frame is currently not supported See Apache Drill JIRA: DRILL-3188 /* ROWS is currently not possible with "BETWEEN x PRECEDING AND y FOLLOWING" */ apache drill> SELECT *, sum(a) OVER(ORDER BY b ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM (SELECT 1 a, 1 b, 1 c); Error: UNSUPPORTED_OPERATION ERROR: This type of window frame is currently not supported See Apache Drill JIRA: DRILL-3188 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7017) lz4 codec for (un)compression
[ https://issues.apache.org/jira/browse/DRILL-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957093#comment-16957093 ] benj commented on DRILL-7017: - Not sure to understand because lz4 is already (by default) in jars/3rdparty/lz4-1.3.0.jar in Apache Drill and it doesn't work. Even with adding "org.apache.hadoop.io.compress.Lz4Codec" in io.compression.codecs in core-site.xml and Djava.library.path=/usr/hdp/.../lib/native/ {code:sql} SELECT * FROM dfs.test.`a.csvh.lz4`; Error: EXECUTION_ERROR ERROR: native lz4 library not available {code} > lz4 codec for (un)compression > - > > Key: DRILL-7017 > URL: https://issues.apache.org/jira/browse/DRILL-7017 > Project: Apache Drill > Issue Type: Wish > Components: Storage - Text CSV >Affects Versions: 1.15.0 >Reporter: benj >Priority: Major > > I didn't find in the documentation what compression formats are supported. > But as it's possible to use drill on compressed file, like > {code:java} > SELECT * FROM tmp.`myfile.csv.gz`; > {code} > It will be useful to have the possibility to use this functionality for lz4 > file ([https://github.com/lz4/lz4]) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7404) window function RANGE with compound ORDER BY
benj created DRILL-7404: --- Summary: window function RANGE with compound ORDER BY Key: DRILL-7404 URL: https://issues.apache.org/jira/browse/DRILL-7404 Project: Apache Drill Issue Type: Improvement Components: Documentation Affects Versions: 1.16.0 Reporter: benj When creating a ticket CALCITE-3402 (to ask for improve the window functions), it's appears that the documentation of drill seems not up to date [https://drill.apache.org/docs/aggregate-window-functions/] {code:java} frame_clause If an ORDER BY clause is used for an aggregate function, an explicit frame clause is required. The frame clause refines the set of rows in a function's window, including or excluding sets of rows within the ordered result. The frame clause consists of the ROWS or RANGE keyword and associated specifiers. {code} But it's currently (1.16) possible to write ORDER BY clause in window function +without+ specify an explicit RANGE clause. In this case, an +implicit+ frame clause is used. And normally the default/implicit framing option is {{RANGE UNBOUNDED PRECEDING}}, which is the same as {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (and should perhaps also be more explicitly specified) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7291: Component/s: Documentation > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Documentation, Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, > sqlline_error.log > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7246) DESCRIBE on Parquet File
[ https://issues.apache.org/jira/browse/DRILL-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7246: Description: It will be nice if it's possible to use DESCRIBE on Parquet File. Example : {code:sql} DESCRIBE dfs.tmp.`test_parquet`; +---+---+-+ | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | +---+---+-+ | MyColumn | INT | YES | | AnotherColumn | DATE | NO | | AdditionnalColumn | VARCHAR | YES | +---+---+-+ {code} And more why not propose this possibility for any file (in the case of the CSV, DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit useless, this would at least list the available columns. was: It will be nice if it's possible to use DESCRIBE on Parquet File. Example : {code:sql} DESCRIBE dfs.tmp.`test_parquet`; +---+---+-+ | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | +---+---+-+ | MyColumn | INT | YES | | AnotherColumn | DATE | NO | | AdditionnalColumn | VARCHAR | YES | +---+---+-+ {code} And more why not propose this possibility for any file (in the case of the CSV, DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit useless, this would at least list the available columns. > DESCRIBE on Parquet File > > > Key: DRILL-7246 > URL: https://issues.apache.org/jira/browse/DRILL-7246 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > It will be nice if it's possible to use DESCRIBE on Parquet File. > Example : > {code:sql} > DESCRIBE dfs.tmp.`test_parquet`; > +---+---+-+ > | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | > +---+---+-+ > | MyColumn | INT | YES | > | AnotherColumn | DATE | NO | > | AdditionnalColumn | VARCHAR | YES | > +---+---+-+ > {code} > And more why not propose this possibility for any file (in the case of the > CSV, DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit > useless, this would at least list the available columns. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950784#comment-16950784 ] benj commented on DRILL-7291: - [~arina] you are right, with the option +store.parquet.use_new_reader+ the problem disappears in 1.17 (last commit) and even in 1.16 and 1.15. I had not considered this option since it is marked "Not supported in this release." in the Comment field. So it can be that the problem is solved by means of a change of the comment of the option _store.parquet.use_new_reader_ And maybe this option could be true by default in future version (but can it be negative impacts ?) Appreciated your help to solve this problem. > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, > sqlline_error.log > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949569#comment-16949569 ] benj commented on DRILL-7291: - Please find in attchment the error log [^sqlline_error.log] I was intrigued by the "sc: REQUIRED BINARY O:UTF8 R:0 D:0" in relation with the error "Error: INTERNAL_ERROR ERROR: null" So I have tried the request below that solve the problem here. {code:sql} CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc , COALESCE(sc,'') /* USE COALESCE to avoid NULL VALUE although csv empty value are protected by quote */ FROM dfs.tmp.`short_no_binary_quote.csvh`; SELECT * FROM dfs.tmp.`t2`; +--+--+--++---+---+-++ | sha1 | md5| crc32 | fn | fs | pc | osc | EXPR$7 | +--+--+--++---+---+-++ | 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 || | 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 || | 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml| 1684 | 21842 | 362 || +--+--+--++---+---+-++ 3 rows selected (0.111 seconds) {code} But there is a something wrong because: {code:sql} CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc , COALESCE(sc,'FLAGFLAGFLAG') /* USE COALESCE to avoid NULL VALUE although csv empty value are protected by quote - NOTE that FLAGFLAGFLAG will not appear in EXPR$7*/ FROM dfs.tmp.`short_no_binary_quote.csvh`; SELECT * FROM dfs.tmp.`t2` /* FLAGFLAGFLAG will not appear in EXPR$7 */ ; +--+--+--++---+---+-++ | sha1 | md5| crc32 | fn | fs | pc | osc | EXPR$7 | +--+--+--++---+---+-++ | 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 || | 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 || | 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml| 1684 | 21842 | 362 || +--+--+--++---+---+-++ 3 rows selected (0.156 seconds) {code} > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, > sqlline_error.log > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM
[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7291: Attachment: sqlline_error.log > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, > sqlline_error.log > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949501#comment-16949501 ] benj commented on DRILL-7291: - Exactly done the same as you (see below), but obtain an "+Error: INTERNAL_ERROR ERROR: null+" at the end {code:sql} ./drill-embedded Apache Drill 1.17.0-SNAPSHOT "A Drill in the hand is better than two in the bush." apache drill> ALTER SESSION SET `store.parquet.compression` = 'gzip'; +--++ | ok | summary | +--++ | true | store.parquet.compression updated. | +--++ 1 row selected (0.339 seconds) apache drill> select * from dfs.tmp.`short_no_binary_quote.csvh`; +--+--+--++---+---+-++ | sha1 | md5| crc32 | fn | fs | pc | osc | sc | +--+--+--++---+---+-++ | 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 || | 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 || | 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml| 1684 | 21842 | 362 || +--+--+--++---+---+-++ 3 rows selected (1.008 seconds) apache drill> use dfs.tmp; +--+-+ | ok | summary | +--+-+ | true | Default schema changed to [dfs.tmp] | +--+-+ 1 row selected (0.087 seconds) apache drill (dfs.tmp)> create table t as select * from dfs.tmp.`short_no_binary_quote.csvh`; +--+---+ | Fragment | Number of records written | +--+---+ | 0_0 | 3 | +--+---+ 1 row selected (0.306 seconds) apache drill (dfs.tmp)> select * from t; Error: INTERNAL_ERROR ERROR: null Fragment 0:0 {code} But, the parquet seems OK {code:bash} hadoop jar parquet-tools-1.10.0.jar cat /tmp/t/0_0_0.parquet 2019-10-11 15:48:14,047 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. 2019-10-11 15:48:14,048 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 2019-10-11 15:48:14,063 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2019-10-11 15:48:14,063 INFO compress.CodecPool: Got brand-new decompressor [.gz] 2019-10-11 15:48:14,068 INFO hadoop.InternalParquetRecordReader: block read in memory in 19 ms. row count = 3 sha1 = 000F8527DCCAB6642252BBCFA1B8072D33EE md5 = 68CE322D8A896B6E4E7E3F18339EC85C crc32 = E39149E4 fn = Blended_Coolers_Vanilla_NL.png fs = 30439 pc = 19042 osc = 362 sc = sha1 = 0091728653B7D55DF30BFAFE86C52F2F4A59 md5 = 81AE5D302A0E6D33182CB69ED791181C crc32 = 5594C3B0 fn = ic_menu_notifications.png fs = 366 pc = 21386 osc = 362 sc = sha1 = 065F1900120613745CC5E25A57C84624DC2B md5 = AEB7C147EF7B7CEE91807B500A378BA4 crc32 = 24400952 fn = points_program_fragment.xml fs = 1684 pc = 21842 osc = 362 sc = hadoop jar parquet-tools-1.10.0.jar meta /tmp/t/0_0_0.parquet 2019-10-11 15:51:25,255 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: reading another 1 footers 2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file:file:/tmp/t/0_0_0.parquet creator: parquet-mr version 1.10.0 (build ${buildNumber}) extra: drill-writer.version = 3 extra: drill.version = 1.17.0-SNAPSHOT file schema: root sha1:REQUIRED BINARY O:UTF8 R:0 D:0 md5: REQUIRED BINARY O:UTF8 R:0 D:0 crc32: REQUIRED BINARY O:UTF8 R:0 D:0 fn: REQUIRED BINARY O:UTF8 R:0 D:0 fs: REQUIRED BINARY O:UTF8 R:0 D:0 pc: REQUIRED BINARY O:UTF8 R:0 D:0 osc: REQUIRED BINARY O:UTF8 R:0 D:0 sc: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:3 TS:914 OFFSET:4 sha1: BINARY GZIP DO:0 FPO:4 SZ:210/239/1,14 VC:3 ENC:BIT_PACKED,PLAIN ST:[min:
[jira] [Comment Edited] (DRILL-7348) Aggregate on Subquery with Select Distinct or UNION fails to Group By
[ https://issues.apache.org/jira/browse/DRILL-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947512#comment-16947512 ] benj edited comment on DRILL-7348 at 10/9/19 10:01 AM: --- I have tested in 1.15 and 1.16. I think thata the problem of [~snapdoodle] is that date is a reserved keyword {code:sql} SELECT date, COUNT(1) FROM ( SELECT DISTINCT id, date, status FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1 Error: PARSE ERROR: Encountered "date ," {code} So you need quote around date: {code:sql} SELECT `date`, COUNT(1) FROM ( SELECT DISTINCT id, `date`, status FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1; +---+-+ | date | EXPR$1 | +---+-+ | b | 1 | | c | 2 | +---+-+ {code} The result is correct here. Maybe there is also a real problem, but without the content of the file dfs.`path` it will be difficult to conclude. was (Author: benj641): I have tested in 1.15 and 1.16. I think thata the problem of [~snapdoodle] is that date is a reserved keyword {code:code} SELECT date, COUNT(1) FROM ( SELECT DISTINCT id, date, status FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1 Error: PARSE ERROR: Encountered "date ," {code} So you need quote around date: {code:sql} SELECT `date`, COUNT(1) FROM ( SELECT DISTINCT id, `date`, status FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1; +---+-+ | date | EXPR$1 | +---+-+ | b | 1 | | c | 2 | +---+-+ {code} The result is correct here. Maybe there is also a real problem, but without the content of the file dfs.`path` it will be difficult to conclude. > Aggregate on Subquery with Select Distinct or UNION fails to Group By > - > > Key: DRILL-7348 > URL: https://issues.apache.org/jira/browse/DRILL-7348 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.15.0 >Reporter: Keith G Yu >Priority: Major > > The following query fails to group properly. > {code:java} > SELECT date, COUNT(1) > FROM ( > SELECT DISTINCT > id, > date, > status > FROM table(dfs.`path`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > ) > GROUP BY 1{code} > This also fails to group properly. > {code:java} > SELECT date, COUNT(1) > FROM ( > SELECT > id, > date, > status > FROM table(dfs.`path1`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > UNION > SELECT > id, > date, > status > FROM table(dfs.`path2`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > ) > GROUP BY 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7348) Aggregate on Subquery with Select Distinct or UNION fails to Group By
[ https://issues.apache.org/jira/browse/DRILL-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947512#comment-16947512 ] benj commented on DRILL-7348: - I have tested in 1.15 and 1.16. I think thata the problem of [~snapdoodle] is that date is a reserved keyword {code:code} SELECT date, COUNT(1) FROM ( SELECT DISTINCT id, date, status FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1 Error: PARSE ERROR: Encountered "date ," {code} So you need quote around date: {code:sql} SELECT `date`, COUNT(1) FROM ( SELECT DISTINCT id, `date`, status FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 1, 'c', 4) AS r1 ) AS r2 GROUP BY 1; +---+-+ | date | EXPR$1 | +---+-+ | b | 1 | | c | 2 | +---+-+ {code} The result is correct here. Maybe there is also a real problem, but without the content of the file dfs.`path` it will be difficult to conclude. > Aggregate on Subquery with Select Distinct or UNION fails to Group By > - > > Key: DRILL-7348 > URL: https://issues.apache.org/jira/browse/DRILL-7348 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.15.0 >Reporter: Keith G Yu >Priority: Major > > The following query fails to group properly. > {code:java} > SELECT date, COUNT(1) > FROM ( > SELECT DISTINCT > id, > date, > status > FROM table(dfs.`path`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > ) > GROUP BY 1{code} > This also fails to group properly. > {code:java} > SELECT date, COUNT(1) > FROM ( > SELECT > id, > date, > status > FROM table(dfs.`path1`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > UNION > SELECT > id, > date, > status > FROM table(dfs.`path2`(type => 'text', fieldDelimiter => ',', > extractHeader => TRUE)) > ) > GROUP BY 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945871#comment-16945871 ] benj commented on DRILL-7291: - In fact, the problem appear in +embeded mode+ (bin/drill-embedded) and not in sqlline mode. So, it may be a little less serious and that explains the differences in behavior. > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945825#comment-16945825 ] benj commented on DRILL-7291: - Sure, sorry for the delay but I preferred to double check some points with the original file: - There is a column 'FileName' in the csvh => Renaming columns doesn't fix the problem. - There is microsoft end of line in the original csvh => Changing to unix end of line doesn't fix the problem. - There is some field without "quote surround" => Forcing quote everywhere doesn't fix the problem. - There is some binary data in the csvh. But the same problem appears with a small extract with no binary. So I prefer to push a small file rather than the big one. - And I have again double check on the last git 1.17 without fix the problem. So in attachment there is a minimalist csvh file (1 row of header and 3 rows of data). I put also just below [^short_no_binary_quote.csvh] {code:java} "sha1","md5","crc32","fn","fs","pc","osc","sc" "000F8527DCCAB6642252BBCFA1B8072D33EE","68CE322D8A896B6E4E7E3F18339EC85C","E39149E4","Blended_Coolers_Vanilla_NL.png","30439","19042","362","" "0091728653B7D55DF30BFAFE86C52F2F4A59","81AE5D302A0E6D33182CB69ED791181C","5594C3B0","ic_menu_notifications.png","366","21386","362","" "065F1900120613745CC5E25A57C84624DC2B","AEB7C147EF7B7CEE91807B500A378BA4","24400952","points_program_fragment.xml","1684","21842","362","" {code} {code:sql} /* "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," }, */ ALTER SESSION set exec.storage.enable_v3_text_reader=true; ALTER SESSION SET `store.parquet.compression` = 'gzip'; SELECT * FROM `DRILL_7291/short_no_binary_quote.csvh`; +--+--+--++---+---+-++ | sha1 | md5| crc32 | fn | fs | pc | osc | sc | +--+--+--++---+---+-++ | 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 || | 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 || | 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml| 1684 | 21842 | 362 || +--+--+--++---+---+-++ CREATE TABLE `DRILL_7291/problem_pqt` AS( SELECT * FROM `DRILL_7291/short_no_binary_quote.csvh`); +--+---+ | Fragment | Number of records written | +--+---+ | 0_0 | 3 | +--+---+ SELECT * FROM `DRILL_7291/problem_pqt`; Error: DATA_READ ERROR: Error reading page data {code} And All work's fine with 'snappy' or 'none' > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM
[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7291: Attachment: short_no_binary_quote.csvh > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585 ] benj edited comment on DRILL-7291 at 10/7/19 9:55 AM: -- Indeed, surprisingly, I can't reproduce the problem with the file in attachment. But I try to reproduce the problem with the original file (2.2Go) and all is fine with no compression or with snappy compression but with gzip compression it failed (but no exactly the same way as before): {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.compression` = 'gzip'; CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`); /* 1.16 (and 1.15) */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: DATA_READ ERROR: Error reading page data /* 1.17 */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: INTERNAL_ERROR ERROR: null {code} Note that {code:sql} SELECT count( * ) FROM `file_gzip_pqt` WHERE SpecialCode =' '; /* OK */ SELECT count( * ) FROM `file_gzip_pqt` WHERE SpecialCode <> ''; /* NOK */ {code} But, maybe it's because all the values of 'SpecialCode' column are empty ("") was (Author: benj641): Indeed, surprisingly, I can't reproduce the problem with the file in attachment. But I try to reproduce the problem with the original file (2.2Go) and all is fine with no compression or with snappy compression but with gzip compression it failed (but no exactly the same way as before): {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.compression` = 'gzip'; CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`); /* 1.16 (and 1.15) */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: DATA_READ ERROR: Error reading page data /* 1.17 */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: INTERNAL_ERROR ERROR: null {code} Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?) > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & >
[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585 ] benj edited comment on DRILL-7291 at 10/7/19 9:44 AM: -- Indeed, surprisingly, I can't reproduce the problem with the file in attachment. But I try to reproduce the problem with the original file (2.2Go) and all is fine with no compression or with snappy compression but with gzip compression it failed (but no exactly the same way as before): {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.compression` = 'gzip'; CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`); /* 1.16 (and 1.15) */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: DATA_READ ERROR: Error reading page data /* 1.17 */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: INTERNAL_ERROR ERROR: null {code} Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?) was (Author: benj641): Indeed, surprisingly, I can't reproduce the problem with the file in attachment. But I try to reproduce the problem with the original file (2.2Go) and all is fine with no compression or with snappy compression but with gzip compression it failed (but no exactly the same way as before): {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.compression` = 'gzip'; CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`); /* 1.16 (and 1.15) */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: DATA_READ ERROR: Error reading page data /* 1.17 */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: INTERNAL_ERROR ERROR: null {code} Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?) if forcing each column with {code:sql}CAST(COLUMN AS VARCHAR){code} when {code:sql}CREATE AS{code} it seems there is no problem after. But I think more investigation needed. I will try to push more elements. > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next
[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well
[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585 ] benj commented on DRILL-7291: - Indeed, surprisingly, I can't reproduce the problem with the file in attachment. But I try to reproduce the problem with the original file (2.2Go) and all is fine with no compression or with snappy compression but with gzip compression it failed (but no exactly the same way as before): {code:sql} ALTER SESSION SET `store.format`='parquet'; ALTER SESSION SET `store.parquet.compression` = 'gzip'; CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`); /* 1.16 (and 1.15) */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: DATA_READ ERROR: Error reading page data /* 1.17 */ SELECT count(*) FROM `file_gzip_pqt`; /* OK */ SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */ Error: INTERNAL_ERROR ERROR: null {code} Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?) if forcing each column with {code:sql}CAST(COLUMN AS VARCHAR){code} when {code:sql}CREATE AS{code} it seems there is no problem after. But I think more investigation needed. I will try to push more elements. > parquet with compression gzip doesn't work well > --- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > Attachments: 0_0_0.parquet > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE `file_snappy_pqt` > AS(SELECT * FROM `file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE `file_gzip_pqt` > AS(SELECT * FROM `file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM `file_pqt`;=> 15728036 > SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM `file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0 > SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7396) Exception when trying to access last element of an array with repeated_count
benj created DRILL-7396: --- Summary: Exception when trying to access last element of an array with repeated_count Key: DRILL-7396 URL: https://issues.apache.org/jira/browse/DRILL-7396 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj Use of array in drill is not friendly {code:sql} SELECT (split('a,b,c',','))[0]; /*NOK */ Error: SYSTEM ERROR: ClassCastException: org.apache.drill.common.expression.FunctionCall cannot be cast to org.apache.drill.common.expression.SchemaPath /* outer SELECT needed*/ SELECT x[0] FROM (SELECT split('a,b,c',',') x); /* OK */ {code} And access last element of an array is worse {code:sql} SELECT x[repeated_count(x) - 1] AS lasteltidx FROM (SELECT split('a,b,c',',') x); Error: SYSTEM ERROR: ClassCastException: org.apache.calcite.rex.RexCall cannot be cast to org.apache.calcite.rex.RexLiteral /* while */ SELECT x[2] lastelt, (repeated_count(x) - 1) AS lasteltidx FROM (SELECT split('a,b,c',',') x); +-++ | lastelt | lasteltidx | +-++ | c | 2 | +-++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7395) Partial Partition By to CTAS Parquet files
benj created DRILL-7395: --- Summary: Partial Partition By to CTAS Parquet files Key: DRILL-7395 URL: https://issues.apache.org/jira/browse/DRILL-7395 Project: Apache Drill Issue Type: Improvement Components: Storage - Parquet Affects Versions: 1.16.0 Reporter: benj In the case of a data set with few value are prevailing while most have weak occurrences, it will be useful to have the abilities to create Parquet with a partial _PARTITION BY_. It would then be possible to group all the small occurrences together without being "impacted" by the "too" common values. It's not exactly the same, but it exists partial index on some database (https://www.postgresql.org/docs/current/indexes-partial.html) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7004) improve show files functionnality
[ https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943152#comment-16943152 ] benj commented on DRILL-7004: - Ok, thanks for the precision. Note that in a suprising way, in the presented case there are not many file in each directory (60 files per directory) but there are many directories and the system tested have several server with several dozen of processors each. So it's strange that the time increase linearly from 1 directory (60 files) to 24 directories (of 60 files each) or to 31*24 directories (of 60 files each). _(Ok there is a factor 2 for the 24 Directories - but it's not enough meaningful)_ > improve show files functionnality > - > > Key: DRILL-7004 > URL: https://issues.apache.org/jira/browse/DRILL-7004 > Project: Apache Drill > Issue Type: Wish > Components: Storage - Other >Affects Versions: 1.15.0 >Reporter: benj >Priority: Major > > For instant, it's possible to show files/directories in a particular > directory with the command > {code:java} > SHOW files FROM tmp.`mypath`; > {code} > It would be certainly very useful to improve this functionality with : > * possibility to list recursively > * possibility to use at least wildcard > {code:java} > SHOW files FROM tmp.`mypath/*/test/*/*a*`; > {code} > * possibility to use the result like a table > {code:java} > SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ... > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7004) improve show files functionnality
[ https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942863#comment-16942863 ] benj commented on DRILL-7004: - Despite the parallelization, for a system containing several hundreds of thousands of files, it is really too long and therefore unusable. Example: {code:java} DIR_root |--DIR_DAY_x |--DIR_HOUR_y |--File_MINUTE_z with x from 1 to 31, y from 0 to 23, z from 0 to 59 {code} {code:sql} /* mydfs.myminutes : location = "DIR_ROOT/DIR_DAY_1/DIR_HOUR_0" */ SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myminutes' and is_file = true; => time ~ 0.65 seconds - 60 files > (time of unix find : 0.042s) /* mydfs.myhours : location = "DIR_ROOT/DIR_DAY_1" */ alter session SET `storage.list_files_recursively` = true; SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myhours' and is_file = true; => time ~ 9 seconds - 1440 files (60*24) >>> (time of unix find : 0.095s) /* mydfs.mydays : location = "DIR_ROOT/" */ alter session SET `storage.list_files_recursively` = true; SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.mydays' and is_file = true; => time ~ 417 seconds - 44640 files (60*24*31) >> (time of unix find : 1.5s (with print)) {code} It's comprehensible that there is overhead compared to unix tools, but the average time per file is too much expensive - Here : 0.01s, (ie 2h30 to scan 1 million files) it's a pity that it's really more efficient to make a `find path/ -type f > mytmp.csv` and next `SELECT * FROM mytmp.csv` _(with necessary permission)_ > improve show files functionnality > - > > Key: DRILL-7004 > URL: https://issues.apache.org/jira/browse/DRILL-7004 > Project: Apache Drill > Issue Type: Wish > Components: Storage - Other >Affects Versions: 1.15.0 >Reporter: benj >Priority: Major > > For instant, it's possible to show files/directories in a particular > directory with the command > {code:java} > SHOW files FROM tmp.`mypath`; > {code} > It would be certainly very useful to improve this functionality with : > * possibility to list recursively > * possibility to use at least wildcard > {code:java} > SHOW files FROM tmp.`mypath/*/test/*/*a*`; > {code} > * possibility to use the result like a table > {code:java} > SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ... > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7392) Exclude some files when requesting directory
benj created DRILL-7392: --- Summary: Exclude some files when requesting directory Key: DRILL-7392 URL: https://issues.apache.org/jira/browse/DRILL-7392 Project: Apache Drill Issue Type: Wish Reporter: benj Fix For: 1.16.0 Currently Drill ignores files starting with dot ('.') or underscore ('_'). When requesting directory with file of different types or different schema and present at multiple levels of the tree file, it will be useful/more flexible, to have also option(s) to exclude some files by extension or maybe with a regexp. For Example: {code:java} myTable |--D1 |--file1.csv |-file2.csv |--D2 | SubD2 |--file1.csv |--file1.csv |--file1.xml |--file1.json {code} without enter in a debate of what is a good the organisation/disposition for the data, currently to request all the csv files of this example, the way is: {code:sql} SELECT * FROM `myTable/*/*.csv` UNION SELECT * FROM `myTable/*/*/*.csv` {code} It will be useful to have the capacity to request directly _myTable_ like: {code:sql} /* ALTER SESSION SET exclude_files='xml,json' */ /* or */ /* ALTER SESSION SET only_files='csv' */ SELECT * FROM myTable {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7390) kvgen/flatten doesn't produce same result from .json or .parquet
benj created DRILL-7390: --- Summary: kvgen/flatten doesn't produce same result from .json or .parquet Key: DRILL-7390 URL: https://issues.apache.org/jira/browse/DRILL-7390 Project: Apache Drill Issue Type: Bug Components: Execution - Data Types, Functions - Drill, Storage - JSON, Storage - Parquet Affects Versions: 1.16.0 Reporter: benj Attachments: ANIMALS_json.tar.gz, ANIMALS_pqt.tar.gz With a Parquet produce from JSON (_ANIMALS_json_ and _ANIMALS_pqt_ in attachment in tar.gz format) {code:sql} CREATE TABLE `ANIMALS_pqt` AS (SELECT * FROM `ANIMALS_json`); {code} Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_json`)) AS x) => 8482290 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt`)) AS x) => 929430 {code} Or another example: {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_json`)) AS x WHERE x.f.key='Cat') => 121368 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt`)) AS x WHERE x.f.key='Cat') => 13470 {code} The real result is the json one, as proved by: {code:bash} cat ANIMALS_json/*.json | grep -c "Cat" 121368 {code} Please note that, here, It's appear the particular file _ANIMALS_pqt/1_0_0.parquet_ is not well computed but the other are correct: {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat'); => 107898 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat'); => 2429 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5419 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5622 {code} 2429+5419+5622=13470 (result of request on ANIMALS_pqt) 107898+2429+5419+5622=121368 (result of request on ANIMALS_json) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7389) JSON empty list avoid Parquet creation
benj created DRILL-7389: --- Summary: JSON empty list avoid Parquet creation Key: DRILL-7389 URL: https://issues.apache.org/jira/browse/DRILL-7389 Project: Apache Drill Issue Type: Improvement Components: Storage - JSON, Storage - Parquet Affects Versions: 1.16.0 Reporter: benj With a JSON file with only one row with an empty list as below, it's possible to request the file but there is an error when trying to create a Parquet File ANIMALS_1.json: {code:json} {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}} {code} {code:sql} SELECT * FROM `ANIMALS_1.json`; +---+ |animals | +---+ | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} | +---+ CREATE TABLE `ANIMALS_1_pqt` AS (SELECT * FROM `ANIMALS_1.json`); => Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an empty group: optional group Horse {} {code} But if the json file contains a second line with a non-empty list for "Horse", it's possible to request file and create the Parquet File ANIMALS_2.json: {code:json} {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}} {"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}} {code} {code:sql} SELECT * FROM `ANIMALS_2.json`; +---+ | animals +---+ | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} | | {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}} | +---+ CREATE TABLE `ANIMALS_2_pqt` AS (SELECT * FROM `ANIMALS_2.json`); +--+---+ | Fragment | Number of records written | +--+---+ | 0_0 | 2 | +--+---+ {code} Many problems appears with this when manipulating multiple JSON with "rare" value (and when do not master the generation). It's very annoying to have no possibility push data in parquet where there is missing/null value in JSON. The possibility to cast in varchar (DRILL-7375) the data could allow the parquet storage In the simple case of the example discussed here, it's possible to change the type of the input file from JSON to CSV and it will work. But it does not answer all the problems and it doesn't allow to keep some part in "json" and some other in "text" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6958) CTAS csv with option
[ https://issues.apache.org/jira/browse/DRILL-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935943#comment-16935943 ] benj commented on DRILL-6958: - In the next example, with a table with a column that contain a piece of json like {code:sql} SELECT * FROM `example.parquet` LIMIT 2; +-++---+ | hash | date | info | +-++---+ | B29C56F | 2019-09-23 | {"Number": 322, "scans": {"nameofprocess": {"detection": false, "version": "1.2"}}, {"othername": {"detection": true, "version": "0.1"}}} | | C28956E | 2019-09-22 | {"Number": 312, "scans": {"thirdname": {"detection": false, "version": "1.0"}}} | +---+--+---+ SELECT typeof(hash) AS hash, typeof(`date`) AS `date`, typeof(info) AS info FROM `example.parquet` LIMIT 1; +-++--+ | hash | date | info | +-++--+ | VARCHAR | DATE | MAP | +-++--+ {code} It's not possible to push in a right way into a CSV file because of the presence of separator and quote inside the json. And there is no possibility to manually avoid this problem with a change of separator or introduce quote because the type MAP is not convertible in VARCHAR (DRILL-7375), so it's not possible to manually concatenate data > CTAS csv with option > > > Key: DRILL-6958 > URL: https://issues.apache.org/jira/browse/DRILL-6958 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text CSV >Affects Versions: 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > > Currently, it may be difficult to produce well-formed CSV with CTAS (see > comment below). > It appears necessary to have some additional/configuratble options to write > CSV file with CTAS : > * possibility to change/define the separator, > * possibility to write or not the header, > * possibility to force the write of only 1 file instead of lot of parts, > * possibility to force quoting > * possibility to use/change escape char > * ... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7379) Planning error
benj created DRILL-7379: --- Summary: Planning error Key: DRILL-7379 URL: https://issues.apache.org/jira/browse/DRILL-7379 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj With data as: {code:sql} SELECT id, tags FROM `example_parquet`; +++ | id |tags| +++ | 7b8808 | ["peexe","signed","overlay"] | | 55a4ae | ["peexe","signed","upx","overlay"] | +++ {code} The next request is OK {code:sql} SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) LIMIT 2; +++ | id | tag | +++ | 55a4ae | peexe | | 55a4ae | signed | +++ {code} But unexpectedly, the next query failed: {code:sql} SELECT tag, count(*) FROM ( SELECT flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id ) ) GROUP BY tag; Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or repeated scalar type should not be used in group by, order by or in a comparison operator. Drill does not support compare between MAP:REPEATED and MAP:REPEATED. /* Or other error with another set of data : Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index 0. Error: Missing function implementation: [hash32asdouble(MAP-REPEATED, INT-REQUIRED)]. Full expression: null.. */ {code} These errors are incomprehensible because, the aggregate is on VARCHAR. More, the request works if decomposed in 2 request with with the creation of an intermediate table like below: {code:sql} CREATE TABLE `tmp.parquet` AS ( SELECT id, flatten(tags) tag FROM ( SELECT id, any_value(tags) tags FROM `example_parquet` GROUP BY id )); SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag; +-+---+ | tag | c | +-+---+ | overlay | 2 | | peexe | 2 | | signed | 2 | | upx | 1 | +-+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7378) Allowing less outer/inner select
benj created DRILL-7378: --- Summary: Allowing less outer/inner select Key: DRILL-7378 URL: https://issues.apache.org/jira/browse/DRILL-7378 Project: Apache Drill Issue Type: Improvement Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj Currently, it's not possible to exploit the result of some function like _kvgen_ or _flatten_ and an inner/outer select is needed for some operations. It will be easiest to allow the use of the results of theses functions directly. Example: {code:sql} CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') j; +--+ |j | +--+ | {"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}} | +--+ {code} But it's not possible to simply do {code:sql} SELECT kvgen(CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON')); Error: PLAN ERROR: Failure while materializing expression in constant expression evaluator [CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}', 'JSON')]. Errors: Error in expression at index -1. Error: Only ProjectRecordBatch could have complex writer function. You are using complex writer function convert_fromJSON in a non-project operation!. Full expression: --UNKNOWN EXPRESSION--. {code} It's only possible to do {code:sql} SELECT kvgen(c) AS k FROM (SELECT CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') c); +--+ |k | +--+ | [{"key":"Tuesday","value":{"close":"22:00"}},{"key":"Friday","value":{"close":"23:00"}}] | +--+ {code} Its possible to cascade with flatten: {code:sql} SELECT flatten(kvgen(c)) f FROM (SELECT CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') c); +-+ | f | +-+ | {"key":"Tuesday","value":{"close":"22:00"}} | | {"key":"Friday","value":{"close":"23:00"}} | +-+ {code} But it's not possible to use directly use the result of flatten to select key or value {code:sql} SELECT (flatten(kvgen(r.c))).key f FROM (SELECT CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') c) r; Error: VALIDATION ERROR: From line 1, column 9 to line 1, column 27: Incompatible types {code} You have to inner/outer select like: {code:sql} SELECT r.f.key k FROM (SELECT flatten(kvgen(c)) f FROM (SELECT CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') c)) r; +-+ |k| +-+ | Tuesday | | Friday | +-+ {code} it would be useful to be able to write/read shorter and simpler queries with limiting when it's possible the need of inner/outer SELECT. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7375) composite/nested type map/array convert_to/cast to varchar
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Summary: composite/nested type map/array convert_to/cast to varchar (was: composite type map cast/convert_to) > composite/nested type map/array convert_to/cast to varchar > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (convert_from + JSON) with convert_from > or transform a varchar to array (split) > {code:sql} > SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: > 200}' ,'JSON') a); > +---+-++ > | a | EXPR$1 | EXPR$2 | > +---+-++ > | {"a":100,"b":200} | MAP | STRUCT | > +---+-++ > SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( > SELECT 'foo,bar' AS str)); > +---+-++ > |a | EXPR$1 | EXPR$2 | > +---+-++ > | ["foo","bar"] | VARCHAR | ARRAY | > +---+-++ > {code} > It will be very usefull : > # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast > syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | > +---+--+---+-+ > {code} > # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast > syntax" or any other method > Expected > {code:sql} > SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( > SELECT 'foo,bar' AS str)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR | > +---+--+---+-+ > {code} > Please note that these possibility of course exists in other database systems > Example with Postgres: > {code:sql} > SELECT '{"a":100,"b":200}'::json::text; > => {"a":100,"b":200} > SELECT array[1,2,3]::text; > => {1,2,3} > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-7375) composite type map cast/convert_to
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Description: As it possible to cast varchar to map (convert_from + JSON) with convert_from or transform a varchar to array (split) {code:sql} SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---+-++ | a | EXPR$1 | EXPR$2 | +---+-++ | {"a":100,"b":200} | MAP | STRUCT | +---+-++ SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+-++ |a | EXPR$1 | EXPR$2 | +---+-++ | ["foo","bar"] | VARCHAR | ARRAY | +---+-++ {code} It will be very usefull : # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast syntax" or any other method Expected {code:sql} SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with Postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text; => {"a":100,"b":200} SELECT array[1,2,3]::text; => {1,2,3} {code} was: As it possible to cast varchar to map (convert_from + JSON) with convert_from or transform a varchar to array (split) {code:sql} SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---+-++ | a | EXPR$1 | EXPR$2 | +---+-++ | {"a":100,"b":200} | MAP | STRUCT | +---+-++ SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+-++ |a | EXPR$1 | EXPR$2 | +---+-++ | ["foo","bar"] | VARCHAR | ARRAY | +---+-++ {code} It will be very usefull : # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast syntax" or any other method Expected {code:sql} SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text; => {"a":100,"b":200} SELECT array[1,2,3]::text; => {1,2,3} {code} > composite type map cast/convert_to > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (convert_from + JSON) with convert_from > or transform a
[jira] [Updated] (DRILL-7375) composite type map cast/convert_to
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Description: As it possible to cast varchar to map (convert_from + JSON) with convert_from or transform a varchar to array (split) {code:sql} SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---+-++ | a | EXPR$1 | EXPR$2 | +---+-++ | {"a":100,"b":200} | MAP | STRUCT | +---+-++ SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+-++ |a | EXPR$1 | EXPR$2 | +---+-++ | ["foo","bar"] | VARCHAR | ARRAY | +---+-++ {code} It will be very usefull : # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast syntax" or any other method Expected {code:sql} SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 'foo,bar' AS str)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text; => {"a":100,"b":200} SELECT array[1,2,3]::text; => {1,2,3} {code} was: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text; SELECT (array[1,2,3])::text; {code} > composite type map cast/convert_to > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (convert_from + JSON) with convert_from > or transform a varchar to array (split) > {code:sql} > SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: > 200}' ,'JSON') a); > +---+-++ > | a | EXPR$1 | EXPR$2 | > +---+-++ > | {"a":100,"b":200} | MAP | STRUCT | > +---+-++ > SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( > SELECT 'foo,bar' AS str)); > +---+-++ > |a | EXPR$1 | EXPR$2 | > +---+-++ > | ["foo","bar"] | VARCHAR | ARRAY | > +---+-++ > {code} > It will be very usefull : > # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast > syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); >
[jira] [Updated] (DRILL-7375) composite type map cast/convert_to
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Description: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text; SELECT (array[1,2,3])::text; {code} was: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text {code} > composite type map cast/convert_to > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (JSON) with convert_from > {code:sql} > SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); > +---++ > | a | EXPR$1 | > +---++ > | {"a":100,"b":200} | MAP| > +---++ > {code} > It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR > with a "cast syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | > +---+--+---+-+ > {code} > > > Please note that these possibility of course exists in other database systems > Example with postgres: > {code:sql} > SELECT '{"a":100,"b":200}'::json::text; > SELECT (array[1,2,3])::text; > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-7375) composite type map cast/convert_to
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Description: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text {code} was: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text {code} > composite type map cast/convert_to > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (JSON) with convert_from > {code:sql} > SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); > +---++ > | a | EXPR$1 | > +---++ > | {"a":100,"b":200} | MAP| > +---++ > {code} > It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR > with a "cast syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | > +---+--+---+-+ > {code} > > > Please note that these possibility of course exists in other database systems > Example with postgres: > {code:sql} > SELECT '{"a":100,"b":200}'::json::text > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-7375) composite type map cast/convert_to
[ https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7375: Description: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Expected: {code:sql} SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a)); +---+--+---+-+ | a | ta | va| tva | +---+--+---+-+ | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | +---+--+---+-+ {code} Please note that these possibility of course exists in other database systems Example with postgres: {code:sql} SELECT '{"a":100,"b":200}'::json::text {code} was: As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Please note that these possibility of course exists in other database systems (example postgres : _SELECT '{"a":100,"b":200}'::json::text_) > composite type map cast/convert_to > -- > > Key: DRILL-7375 > URL: https://issues.apache.org/jira/browse/DRILL-7375 > Project: Apache Drill > Issue Type: Wish > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > As it possible to cast varchar to map (JSON) with convert_from > {code:sql} > SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); > +---++ > | a | EXPR$1 | > +---++ > | {"a":100,"b":200} | MAP| > +---++ > {code} > It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR > with a "cast syntax" or with a "convert_to" possibility > Expected: > {code:sql} > SELECT a, typeof(a) ta, va, typeof(va) tva FROM ( > SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' > ,'JSON') a)); > +---+--+---+-+ > | a | ta | va| tva | > +---+--+---+-+ > | {"a":100,"b":200} | MAP | {"a":100,"b":200} | VARCHAR | > +---+--+---+-+ > {code} > Please note that these possibility of course exists in other database systems > Example with postgres: > {code:sql} > SELECT '{"a":100,"b":200}'::json::text > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (DRILL-7375) composite type map cast/convert_to
benj created DRILL-7375: --- Summary: composite type map cast/convert_to Key: DRILL-7375 URL: https://issues.apache.org/jira/browse/DRILL-7375 Project: Apache Drill Issue Type: Wish Components: Functions - Drill Affects Versions: 1.16.0 Reporter: benj As it possible to cast varchar to map (JSON) with convert_from {code:sql} SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a); +---++ | a | EXPR$1 | +---++ | {"a":100,"b":200} | MAP| +---++ {code} It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR with a "cast syntax" or with a "convert_to" possibility Please note that these possibility of course exists in other database systems (example postgres : _SELECT '{"a":100,"b":200}'::json::text_) -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-6975) TO_CHAR does not seems work well depends on LOCALE
[ https://issues.apache.org/jira/browse/DRILL-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-6975: Affects Version/s: 1.14.0 1.16.0 > TO_CHAR does not seems work well depends on LOCALE > -- > > Key: DRILL-6975 > URL: https://issues.apache.org/jira/browse/DRILL-6975 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.14.0, 1.15.0, 1.16.0 >Reporter: benj >Priority: Major > > Strange results from TO_CHAR function when using different LOCALE. > {code:java} > SELECT TO_CHAR((CAST('2008-2-23' AS DATE)), '-MMM-dd') FROM (VALUES(1)); > 2008-Feb-23 (in documentation (en_US.UTF-8)) > 2008-févr.-2 (fr_FR.UTF-8) > {code} > surprisingly by adding a space ('-MMM-dd ') (or any character) at the end > of the format the result becomes correct (so there is no problem when format > a timestamp with ' MMM dd HH:mm:ss') > {code:java} > SELECT TO_CHAR(1256.789383, '#,###.###') FROM (VALUES(1)); > 1,256.789 (in documentation (en_US.UTF-8)) > 1 256,78 (fr_FR.UTF-8) > {code} > Even worse results can be achieved > {code:java} > SELECT TO_CHAR(12567,'#,###.###'); > 12,567 (en_US.UTF-8) > 12 56 (fr_FR.UTF-8) > {code} > Again, with the add of a space/char at the end we get a better result. > I don't have tested all the locale, but for the last example, the result is > right with de_DE.UTF-8 : 12.567 > The situation is identical in 1.14 > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-7371) DST/UTC cast/to_timestamp problem
[ https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7371: Component/s: Client - JDBC > DST/UTC cast/to_timestamp problem > - > > Key: DRILL-7371 > URL: https://issues.apache.org/jira/browse/DRILL-7371 > Project: Apache Drill > Issue Type: Bug > Components: Client - JDBC, Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in > [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw] > find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp]) > {code:sql} > SELECT TIMEOFDAY(); > +-+ > | EXPR$0| > +-+ > | 2019-09-11 08:20:12.247 UTC | > +-+ > {code} > Problems appears when _cast/to_timestamp_ date (date related to the DST > (Daylight Save Time) of some countries). > To illustrate, all the next requests give the same +wrong+ results: > {code:sql} > SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z'); > SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss'); > SELECT cast('2018-03-25 02:22:40' as timestamp); > SELECT cast('2018-03-25 02:22:40 +' as timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-25 03:22:40.0 | > +---+ > {code} > while the result should be "2018-03-25 +02+:22:40.0" > An UTC date and time in string shouldn't change when casting to UTC timestamp. > To illustrate, the next requests produce +good+ results: > {code:java} > SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z'); > +---+ > |EXPR$0 | > +---+ > | 2018-03-26 02:22:40.0 | > +---+ > SELECT CAST('2018-03-24 02:22:40' AS timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-24 02:22:40.0 | > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (DRILL-7371) DST/UTC cast/to_timestamp problem
[ https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927527#comment-16927527 ] benj commented on DRILL-7371: - [~vvysotskyi], the problem occurs on all Daylight Saving Time of Europe (Paris). >From investigation after your message, it's appears that the problem appears >in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs). The result is almost wrong with an execution via zeppelin (jdbc too). But there is no problem with the request launched directly in the Apache Drill web interface ([http://...:8047/query).] So it's seems that the problem probably comes with JDBC. > DST/UTC cast/to_timestamp problem > - > > Key: DRILL-7371 > URL: https://issues.apache.org/jira/browse/DRILL-7371 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in > [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw] > find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp]) > {code:sql} > SELECT TIMEOFDAY(); > +-+ > | EXPR$0| > +-+ > | 2019-09-11 08:20:12.247 UTC | > +-+ > {code} > Problems appears when _cast/to_timestamp_ date (date related to the DST > (Daylight Save Time) of some countries). > To illustrate, all the next requests give the same +wrong+ results: > {code:sql} > SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z'); > SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss'); > SELECT cast('2018-03-25 02:22:40' as timestamp); > SELECT cast('2018-03-25 02:22:40 +' as timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-25 03:22:40.0 | > +---+ > {code} > while the result should be "2018-03-25 +02+:22:40.0" > An UTC date and time in string shouldn't change when casting to UTC timestamp. > To illustrate, the next requests produce +good+ results: > {code:java} > SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z'); > +---+ > |EXPR$0 | > +---+ > | 2018-03-26 02:22:40.0 | > +---+ > SELECT CAST('2018-03-24 02:22:40' AS timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-24 02:22:40.0 | > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (DRILL-7371) DST/UTC cast/to_timestamp problem
[ https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benj updated DRILL-7371: Summary: DST/UTC cast/to_timestamp problem (was: DST/UTC problem) > DST/UTC cast/to_timestamp problem > - > > Key: DRILL-7371 > URL: https://issues.apache.org/jira/browse/DRILL-7371 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.16.0 >Reporter: benj >Priority: Major > > With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in > [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw] > find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp]) > {code:sql} > SELECT TIMEOFDAY(); > +-+ > | EXPR$0| > +-+ > | 2019-09-11 08:20:12.247 UTC | > +-+ > {code} > Problems appears when _cast/to_timestamp_ date (date related to the DST > (Daylight Save Time) of some countries). > To illustrate, all the next requests give the same +wrong+ results: > {code:sql} > SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z'); > SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss'); > SELECT cast('2018-03-25 02:22:40' as timestamp); > SELECT cast('2018-03-25 02:22:40 +' as timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-25 03:22:40.0 | > +---+ > {code} > while the result should be "2018-03-25 +02+:22:40.0" > An UTC date and time in string shouldn't change when casting to UTC timestamp. > To illustrate, the next requests produce +good+ results: > {code:java} > SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z'); > +---+ > |EXPR$0 | > +---+ > | 2018-03-26 02:22:40.0 | > +---+ > SELECT CAST('2018-03-24 02:22:40' AS timestamp); > +---+ > |EXPR$0 | > +---+ > | 2018-03-24 02:22:40.0 | > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)