from:"benj \(JIRA\)"

[jira] [Commented] (DRILL-8481) Ability to query XML root attributes

2024-02-27 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821298#comment-17821298
 ] 

benj commented on DRILL-8481:
-

Hi [~cgivre], It's just a bug report or rather a request for evolution.

> Ability to query XML root attributes
> 
>
> Key: DRILL-8481
> URL: https://issues.apache.org/jira/browse/DRILL-8481
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - XML
>Affects Versions: 1.21.1
>Reporter: benj
>Priority: Major
>
> Hi,
> It is possible to retrieve the field attributes except those of the root
> It would be interesting to be able to retrieve the attributes found in the 
> root node of XML files.
> In my common use cases, I have many XML files each containing a single XML 
> frame with often one or more attributes in the root tag.
> To recover this value, I am currently forced to preprocess the files to 
> "copy" this attribute into the fields of the XML record.
> Even with multiple xml records under the root, it would be useful to consider 
> that the root attributes are accessible for each record
> Example (fichier aaa.xml): 
> {noformat}
> 
> 
> blue
> 
> {noformat}
> With request : 
> {code:sql}
> SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', 
> dataLevel=>1)) as xml) AS x;
> {code}
> I can access to :
> * P1_SubVersion
> * P1_MID
> * P1_PN
> * P1_SL
> * P2_SubVersion
> * P2.Color
> But I can' access to :
> * PPP_Version
> * PPP_TimeStamp
> and changing the DataLevel does not solve the problem
> Regards,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (DRILL-8481) Ability to query XML root attributes

2024-02-27 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-8481:

Summary: Ability to query XML root attributes  (was: Ability to query root 
attributes)

> Ability to query XML root attributes
> 
>
> Key: DRILL-8481
> URL: https://issues.apache.org/jira/browse/DRILL-8481
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - XML
>Affects Versions: 1.21.1
>Reporter: benj
>Priority: Major
>
> Hi,
> It is possible to retrieve the field attributes except those of the root
> It would be interesting to be able to retrieve the attributes found in the 
> root node of XML files.
> In my common use cases, I have many XML files each containing a single XML 
> frame with often one or more attributes in the root tag.
> To recover this value, I am currently forced to preprocess the files to 
> "copy" this attribute into the fields of the XML record.
> Even with multiple xml records under the root, it would be useful to consider 
> that the root attributes are accessible for each record
> Example (fichier aaa.xml): 
> {noformat}
> 
> 
> blue
> 
> {noformat}
> With request : 
> {code:sql}
> SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', 
> dataLevel=>1)) as xml) AS x;
> {code}
> I can access to :
> * P1_SubVersion
> * P1_MID
> * P1_PN
> * P1_SL
> * P2_SubVersion
> * P2.Color
> But I can' access to :
> * PPP_Version
> * PPP_TimeStamp
> and changing the DataLevel does not solve the problem
> Regards,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (DRILL-8481) Ability to query root attributes

2024-02-27 Thread benj (Jira)

benj created DRILL-8481:
---

 Summary: Ability to query root attributes
 Key: DRILL-8481
 URL: https://issues.apache.org/jira/browse/DRILL-8481
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - XML
Affects Versions: 1.21.1
Reporter: benj


Hi,

It is possible to retrieve the field attributes except those of the root
It would be interesting to be able to retrieve the attributes found in the root 
node of XML files.
In my common use cases, I have many XML files each containing a single XML 
frame with often one or more attributes in the root tag.
To recover this value, I am currently forced to preprocess the files to "copy" 
this attribute into the fields of the XML record.

Even with multiple xml records under the root, it would be useful to consider 
that the root attributes are accessible for each record

Example (fichier aaa.xml): 
{noformat}


blue

{noformat}

With request : 

{code:sql}
SELECT * FROM(SELECT filename, * FROM TABLE(dfs.test.`/aaa.xml`(type=>'xml', 
dataLevel=>1)) as xml) AS x;
{code}

I can access to :
* P1_SubVersion
* P1_MID
* P1_PN
* P1_SL
* P2_SubVersion
* P2.Color

But I can' access to :
* PPP_Version
* PPP_TimeStamp

and changing the DataLevel does not solve the problem

Regards,





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode

2021-07-16 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382023#comment-17382023
 ] 

benj edited comment on DRILL-7740 at 7/16/21, 12:13 PM:


Le problème semble avoir été corrigé en version 1.19
{noformat}
$ bash ./apache-drill-1.19.0/bin/drill-embedded
apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b
2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, 
to_date('2018-02-28','-MM-dd') AS b);
+++++
| a  | b  |  min_a_b   |  max_a_b   |
+++++
| 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 |
+++++
{noformat}

And there is no warning when using LEAST or GREATEST


was (Author: benj641):
Le problème semble avoir été corrigé en version 1.19
{noformat}
$ bash ./apache-drill-1.19.0/bin/drill-embedded
apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b
2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, 
to_date('2018-02-28','-MM-dd') AS b);
+++++
| a  | b  |  min_a_b   |  max_a_b   |
+++++
| 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 |
+++++
{noformat}



> LEAST and GREATEST does not work well with date in embedded mode
> 
>
> Key: DRILL-7740
> URL: https://issues.apache.org/jira/browse/DRILL-7740
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> There seems to be a huge problem with LEAST and GREATEST functions in 
> embedded modewhen using them with DATE type
> {code:sql}
> bash bin/drill-embedded
> apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b 
> FROM (select to_date('2018-02-26','-MM-dd') AS a, 
> to_date('2018-02-28','-MM-dd') AS b);
> +++++
> | a  | b  |  min_a_b   |  max_a_b   |
> +++++
> | 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 |
> +++++
> {code}
> min_a_b = 2018-02-25 instead of 2018-02-26
> max_a_b = 2018-02-27 instead of 2018-02-28
> Please note that first time I use LEAST or GREATEST I have a warning:
> {noformat}
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.hive.common.StringInternUtils 
> (file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to 
> field java.net.URI.string
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.hive.common.StringInternUtils
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode

2021-07-16 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382023#comment-17382023
 ] 

benj commented on DRILL-7740:
-

Le problème semble avoir été corrigé en version 1.19
{noformat}
$ bash ./apache-drill-1.19.0/bin/drill-embedded
apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b
2..semicolon> FROM (select to_date('2018-02-26','-MM-dd') AS a, 
to_date('2018-02-28','-MM-dd') AS b);
+++++
| a  | b  |  min_a_b   |  max_a_b   |
+++++
| 2018-02-26 | 2018-02-28 | 2018-02-26 | 2018-02-28 |
+++++
{noformat}



> LEAST and GREATEST does not work well with date in embedded mode
> 
>
> Key: DRILL-7740
> URL: https://issues.apache.org/jira/browse/DRILL-7740
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill, Functions - Hive
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> There seems to be a huge problem with LEAST and GREATEST functions in 
> embedded modewhen using them with DATE type
> {code:sql}
> bash bin/drill-embedded
> apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b 
> FROM (select to_date('2018-02-26','-MM-dd') AS a, 
> to_date('2018-02-28','-MM-dd') AS b);
> +++++
> | a  | b  |  min_a_b   |  max_a_b   |
> +++++
> | 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 |
> +++++
> {code}
> min_a_b = 2018-02-25 instead of 2018-02-26
> max_a_b = 2018-02-27 instead of 2018-02-28
> Please note that first time I use LEAST or GREATEST I have a warning:
> {noformat}
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.hive.common.StringInternUtils 
> (file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to 
> field java.net.URI.string
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.hive.common.StringInternUtils
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-2388) support extract(epoch)

2021-06-24 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368733#comment-17368733
 ] 

benj commented on DRILL-2388:
-

It's painful to do 
{noformat}
SELECT extract(hour FROM a) * 3600 + extract(minute FROM a) * 60 + 
extract(second FROM a) AS seconds
FROM CAST('00:02:03' AS time)
{noformat}

instead of 
{noformat}
SELECT extract(epoch FROM a) AS seconds
FROM CAST('00:02:03' AS time)
{noformat}

maybe epoch is a good solution or why not seconds (with a "s")


> support extract(epoch) 
> ---
>
> Key: DRILL-2388
> URL: https://issues.apache.org/jira/browse/DRILL-2388
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: SQL Parser
>Affects Versions: 0.8.0
>Reporter: Chun Chang
>Priority: Minor
> Fix For: Future
>
>
> Postgres supports the following:
> {code}
> SELECT extract(epoch FROM now());
> {code}
> Drill will error:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> SELECT extract(epoch FROM 
> now()) from sys.drillbits;
> Query failed: ParseException: Encountered "epoch" at line 1, column 16.
> Was expecting one of:
> "YEAR" ...
> "MONTH" ...
> "DAY" ...
> "HOUR" ...
> "MINUTE" ...
> "SECOND" ...
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7954) XML ability to not concatenate fields and attribute - change presentation of data

2021-06-22 Thread benj (Jira)

benj created DRILL-7954:
---

 Summary: XML ability to not concatenate fields and attribute - 
change presentation of data
 Key: DRILL-7954
 URL: https://issues.apache.org/jira/browse/DRILL-7954
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.19.0
Reporter: benj


With a XML containing these data :
{noformat}

  
x
y
  
  
z
a
  

{noformat}

{noformat}
apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
dataLevel=>1)) as x;
+---++
|  attributes   |  attr  |
+---++
| {"attr_set_num":"0123","attr_set_val":"12ab"} | {"set":"xyza"} |
+---++

SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', dataLevel=>2)) as 
x;
+-+-+
|   attributes| set |
+-+-+
| {"set_num":"01","set_val":"12"} | xy  |
| {"set_num":"23","set_val":"ab"} | za  |
+-+-+

apache drill> SELECT * FROM TABLE(dfs.test.`attributetest.xml`(type=>'xml', 
dataLevel=>3)) as x;
++
| attributes |
++
| {} |
| {} |
| {} |
| {} |
++
{noformat}

Attributes and fields with the same name are concatenated and remains 
inexploitable _(maybe the posibility of adding separator should help but it's 
not the point here)_

In fact that we really need is the ability to obtain something like _(depending 
of the defining level)_ :
{noformat}
+--+
|   attr
   |
+--+
| 
[{"set":"x","_attributes":{"num":"0","val":"1"}},{"set":"y","_attributes":{"num":"1","val":"2"}}]
 |
| 
[{"set":"z","_attributes":{"num":"2","val":"a"}},{"set":"a","_attributes":{"num":"3","val":"b"}}]
 |
+--+

++
|  set   |
++
| {"set":"x","_attributes":{"num":"0","val":"1"}} |
| {"set":"y","_attributes":{"num":"1","val":"2"}} |
| {"set":"z","_attributes":{"num":"2","val":"a"}} |
| {"set":"a","_attributes":{"num":"3","val":"b"}} |
++
{noformat}
_attributes fields could be generated on each level instead of generated with 
path from top level => that will allow to work with data from each level 
without losing information





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-4660) TextReader should support multibyte field delimiters

2020-06-05 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126798#comment-17126798
 ] 

benj commented on DRILL-4660:
-

Any news/progression/hope for this functionality ?

> TextReader should support multibyte field delimiters
> 
>
> Key: DRILL-4660
> URL: https://issues.apache.org/jira/browse/DRILL-4660
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text  CSV
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Priority: Minor
> Fix For: Future
>
>
> Data file /tmp/foo.txt contents:
> {noformat}
> 0::2::3
> 0::3::1
> 0::5::2
> 0::9::4
> 0::11::1
> 0::12::2
> 0::15::1
> {noformat}
> Query:
> {code}
> select
>   columns
> from
>   table(dfs.`/tmp/foo.txt`(type => 'text', fieldDelimiter => '::'))
> {code}
> Results in a error message:
> {noformat}
> PARSE ERROR:
> Expected single character but was String: ::
> table /tmp/foo.txt
> parameter fieldDelimiter SQL Query null
> {noformat}
> It would be nice that fieldDelimiter accepts text of any length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields

2020-06-04 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7747:

Description: 
it would be really useful to have a function allowing to know if a field comes 
from an existing column or not.

With this data:
{code:sql}
apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
+---++---+
| a |   b|   c   |
+---++---+
| 1 | test-1 | other |
| 2 | test-2 | null  |
| 3 | test-3 | old   |
+---++---+

apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
++-+
| a  |b|
++-+
| 10 | test-10 |
| 20 | test-20 |
| 30 | test-30 |
++-+

apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |
+++-+---+-+--+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
+++-+---+-+--+
{code}

It will be nice to know when 'c' data is present because the column exists in 
the Parquet  (or other type file) or if the value NULL was generated because 
the column was missing.

Example a function 'origin' that take a column name and return for each row if 
the value was 'generated' or 'original' (other/better keyword could be choose 
(exist(column)=>true/false))
Virtual Example with previous data:
{code:sql}
apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+---+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |  origin   |
+++-+---+-+--+---+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE | generated |
+++-+---+-+--+---+
{code}


Or maybe another way could be to have an implicit column name (like filename, 
filepath...) that contains the list of available "columns"


  was:
it would be really useful to have a function allowing to know if a field comes 
from an existing column or not.

With this data:
{code:sql}
apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
+---++---+
| a |   b|   c   |
+---++---+
| 1 | test-1 | other |
| 2 | test-2 | null  |
| 3 | test-3 | old   |
+---++---+

apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
++-+
| a  |b|
++-+
| 10 | test-10 |
| 20 | test-20 |
| 30 | test-30 |
++-+

apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |
+++-+---+-+--+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
+++-+---+-+--+
{code}

It will be nice to know when 'c' data is present because the column exists in 
the Parquet  (or other type file) or if the value NULL was generated because 
the column was missing.

Example a function 'origin' that take a column name and return for each row if 
the value was 'generated' or 'original' (other/better keyword could be choose 
(exist(column)=>true/false))
Virtual Example with previous data:
{code:sql}
apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+---+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |  origin   |
+++-+---+-+--+---+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
| f2.parquet | 10 | test-10 | null  | VARCHAR |

[jira] [Updated] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields

2020-06-04 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7747:

Description: 
it would be really useful to have a function allowing to know if a field comes 
from an existing column or not.

With this data:
{code:sql}
apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
+---++---+
| a |   b|   c   |
+---++---+
| 1 | test-1 | other |
| 2 | test-2 | null  |
| 3 | test-3 | old   |
+---++---+

apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
++-+
| a  |b|
++-+
| 10 | test-10 |
| 20 | test-20 |
| 30 | test-30 |
++-+

apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |
+++-+---+-+--+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
+++-+---+-+--+
{code}

It will be nice to know when 'c' data is present because the column exists in 
the Parquet  (or other type file) or if the value NULL was generated because 
the column was missing.

Example a function 'origin' that take a column name and return for each row if 
the value was 'generated' or 'original' (other/better keyword could be choose 
(exist(column)=>true/false))
Virtual Example with previous data:
{code:sql}
apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+---+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |  origin   |
+++-+---+-+--+---+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE | generated |
+++-+---+-+--+---+
{code}




  was:
it would be really useful to have a function allowing to know if a field comes 
from an existing column or not.

With this data:
{code:sql}
apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
+---++---+
| a |   b|   c   |
+---++---+
| 1 | test-1 | other |
| 2 | test-2 | null  |
| 3 | test-3 | old   |
+---++---+

apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
++-+
| a  |b|
++-+
| 10 | test-10 |
| 20 | test-20 |
| 30 | test-30 |
++-+

apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |
+++-+---+-+--+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
+++-+---+-+--+
{code}

It will be nice to know when 'c' data is present because the column exists in 
the Parquet  (or other type file) or if the value NULL was generated because 
the column was missing.

Example a function 'origin' that take a column name and return for each row if 
the value was 'generated' or 'original' (other/better keyword could be choose)
Virtual Example with previous data:
{code:sql}
apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+---+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |  origin   |
+++-+---+-+--+---+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE | generated |

[jira] [Created] (DRILL-7747) Function to determine Unknow fields / on fly generated missing fields

2020-06-04 Thread benj (Jira)

benj created DRILL-7747:
---

 Summary: Function to determine Unknow fields / on fly generated 
missing fields
 Key: DRILL-7747
 URL: https://issues.apache.org/jira/browse/DRILL-7747
 Project: Apache Drill
  Issue Type: Wish
  Components: Functions - Drill
Affects Versions: 1.17.0
Reporter: benj


it would be really useful to have a function allowing to know if a field comes 
from an existing column or not.

With this data:
{code:sql}
apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
+---++---+
| a |   b|   c   |
+---++---+
| 1 | test-1 | other |
| 2 | test-2 | null  |
| 3 | test-3 | old   |
+---++---+

apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
++-+
| a  |b|
++-+
| 10 | test-10 |
| 20 | test-20 |
| 30 | test-30 |
++-+

apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |
+++-+---+-+--+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
+++-+---+-+--+
{code}

It will be nice to know when 'c' data is present because the column exists in 
the Parquet  (or other type file) or if the value NULL was generated because 
the column was missing.

Example a function 'origin' that take a column name and return for each row if 
the value was 'generated' or 'original' (other/better keyword could be choose)
Virtual Example with previous data:
{code:sql}
apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM 
dfs.test.`f*.parquet`;
+++-+---+-+--+---+
|dir0| a  |b|   c   | EXPR$1  |  EXPR$2  |  origin   |
+++-+---+-+--+---+
| f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
| f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
| f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
| f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE | generated |
| f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE | generated |
+++-+---+-+--+---+
{code}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-3014) Casting unknown field yields different result from casting null, and bad error message

2020-06-04 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125847#comment-17125847
 ] 

benj commented on DRILL-3014:
-

This problem seems to have been corrected (it's not possible to reproduce in 
1.17)

> Casting unknown field yields different result from casting null, and bad 
> error message
> --
>
> Key: DRILL-3014
> URL: https://issues.apache.org/jira/browse/DRILL-3014
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Daniel Barclay
>Priority: Minor
> Fix For: Future
>
>
> Casting null to INTEGER works as expected like this:
> {noformat}
> 0: jdbc:drill:zk=local> select  cast(NULL AS  INTEGER) from 
> `dfs.tmp`.`simple.csv`;
> ++
> |   EXPR$0   |
> ++
> | null   |
> ++
> 1 row selected (0.15 seconds)
> 0: jdbc:drill:zk=local>
> {noformat}
> (File "{{simple.csv}}" contains one line containing simply "{{a,b,c,d}}".)
> However, casting an unknown column yields an error:
> {noformat}
> 0: jdbc:drill:zk=local> select  cast(noSuchField AS  INTEGER) from 
> `dfs.tmp`.`simple.csv`;
> Error: SYSTEM ERROR: null
> Fragment 0:0
> [Error Id: a0b348ec-f2c5-4f66-9f05-591399f3c315 on dev-linux2:31010] 
> (state=,code=0)
> 0: jdbc:drill:zk=local>
> {noformat}
> This looks like a JDK {{NumberFormatException}} that wasn't handled 
> properly*, and looks like the logical null from the non-existent column was 
> turned into the string "{{null}}" before the cast to {{INTEGER}}.
> Is that a bug or is it intentional that the non-existent field in this case 
> is not actually treated as being all nulls (as non-existent fields are in at 
> least some other places)?
> (*For most NumberFormatExceptions, the message text does not contain the 
> information that the kind of exception was a number-format exception--that 
> information is only in the class name.  In particular that information is not 
> in the message text returned by getMessage().
> Drill code that can throw a {{NumberFormatException}} (e.g., cast functions 
> and other code that calls, e.g., {{Integer.parse(...)}}) should either 
> immediately wrap it in a {{UserException}}, or at least wrap it in another 
> {{NumberFormatException}} with fuller message text.)
> This seem to confirm that it's a {{NumberFormatException}} (note the 
> first-column value "{{a}}"):
> {noformat}
> select  cast(columns[0] AS  INTEGER) from `dfs.tmp`.`simple.csv`;
> Error: SYSTEM ERROR: a
> Fragment 0:0
> [Error Id: 9d6107dc-dc2a-40ce-9676-6387ab427098 on dev-linux2:31010] 
> (state=,code=0)
> 0: jdbc:drill:zk=local>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7104) Change of data type when parquet with multiple fragment

2020-05-20 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7104:

Affects Version/s: 1.16.0
   1.17.0

> Change of data type when parquet with multiple fragment
> ---
>
> Key: DRILL-7104
> URL: https://issues.apache.org/jira/browse/DRILL-7104
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0, 1.17.0
>Reporter: benj
>Priority: Major
> Attachments: DRILL-7104_ErrorNumberFormatException_20190322.log
>
>
> When creating a Parquet with a column filled only with "CAST(NULL AS 
> VARCHAR)", if the parquet has several fragment, the type is read like INT 
> instead of VARCHAR.
> First, create +Parquet with only one fragment+ - all is fine (the type of 
> "demo" is correct).
> {code:java}
> CREATE TABLE `nobug` AS 
>  (SELECT CAST(NULL AS VARCHAR) AS demo
>   , md5(cast(rand() AS VARCHAR) AS jam 
>   FROM `onebigfile` LIMIT 100));
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 1000   |
> SELECT drilltypeof(demo) AS goodtype FROM `bug` LIMIT 1;
> ++
> | goodtype   |
> ++
> | VARCHAR|
> {code}
> Second, create +Parquet with at least 2 fragments+ - the type of "demo" 
> change to INT
> {code:java}
> CREATE TABLE `bug` AS 
> ((SELECT CAST(NULL AS VARCHAR) AS demo
>   ,md5(CAST(rand() AS VARCHAR)) AS jam 
>   FROM `onebigfile` LIMIT 100) 
>  UNION 
>  (SELECT CAST(NULL AS VARCHAR) AS demo
>   ,md5(CAST(rand() AS VARCHAR)) AS jam
>   FROM `onebigfile` LIMIT 100));
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 1_1   | 1000276|
> | 1_0   | 999724 |
> SELECT drilltypeof(demo) AS badtype FROM `bug` LIMIT 1;
> ++
> | badtype|
> ++
> | INT|{code}
> The change of type is really terrible...
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7740) LEAST and GREATEST does not work well with date in embedded mode

2020-05-18 Thread benj (Jira)

benj created DRILL-7740:
---

 Summary: LEAST and GREATEST does not work well with date in 
embedded mode
 Key: DRILL-7740
 URL: https://issues.apache.org/jira/browse/DRILL-7740
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill, Functions - Hive
Affects Versions: 1.17.0
Reporter: benj


There seems to be a huge problem with LEAST and GREATEST functions in embedded 
modewhen using them with DATE type


{code:sql}
bash bin/drill-embedded
apache drill> SELECT a, b, LEAST(a,b) AS min_a_b, GREATEST(a,b) AS max_a_b FROM 
(select to_date('2018-02-26','-MM-dd') AS a, 
to_date('2018-02-28','-MM-dd') AS b);
+++++
| a  | b  |  min_a_b   |  max_a_b   |
+++++
| 2018-02-26 | 2018-02-28 | 2018-02-25 | 2018-02-27 |
+++++
{code}
min_a_b = 2018-02-25 instead of 2018-02-26
max_a_b = 2018-02-27 instead of 2018-02-28

Please note that first time I use LEAST or GREATEST I have a warning:
{noformat}
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
org.apache.hadoop.hive.common.StringInternUtils 
(file:.../apache-drill-1.17.0/jars/drill-hive-exec-shaded-1.17.0.jar) to field 
java.net.URI.string
WARNING: Please consider reporting this to the maintainers of 
org.apache.hadoop.hive.common.StringInternUtils
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
{noformat}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-6975) TO_CHAR does not seems work well depends on LOCALE

2020-03-30 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-6975:

Affects Version/s: 1.17.0

> TO_CHAR does not seems work well depends on LOCALE
> --
>
> Key: DRILL-6975
> URL: https://issues.apache.org/jira/browse/DRILL-6975
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.14.0, 1.15.0, 1.16.0, 1.17.0
>Reporter: benj
>Priority: Major
>
> Strange results from TO_CHAR function when using different LOCALE.
> {code:java}
> SELECT TO_CHAR((CAST('2008-2-23' AS DATE)), '-MMM-dd') FROM (VALUES(1));
> 2008-Feb-23 (in documentation (en_US.UTF-8)) 
> 2008-févr.-2 (fr_FR.UTF-8)
> {code}
> surprisingly by adding a space ('-MMM-dd ') (or any character) at the end 
> of the format  the result becomes correct (so there is no problem when format 
> a timestamp with ' MMM dd HH:mm:ss')
> {code:java}
> SELECT TO_CHAR(1256.789383, '#,###.###') FROM (VALUES(1));
> 1,256.789 (in documentation (en_US.UTF-8)) 
> 1 256,78 (fr_FR.UTF-8)
> {code}
> Even worse results can be achieved
> {code:java}
> SELECT TO_CHAR(12567,'#,###.###');
> 12,567 (en_US.UTF-8)
> 12 56 (fr_FR.UTF-8)
> {code}
> Again, with the add of a space/char at the end we get a better result.
> I don't have tested all the locale, but for the last example, the result is 
> right with de_DE.UTF-8 : 12.567
> The situation is identical in 1.14
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files

2020-03-25 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066614#comment-17066614
 ] 

benj edited comment on DRILL-1755 at 3/25/20, 11:19 AM:


The problem appears to have been corrected in 1.17

The error shown previously appearing in 1.16 (see above) no longer appears in 
1.17 (see below) :
{code:sql}
drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`example.json`;
+-+
|   mytext|
+-+
| justfortest |
| justfortest |
+-+
{code}
 


was (Author: benj641):
The problem appears to have been corrected in 1.17

The error shown previously appearing in 1.16 (see above) no longer appears in 
1.17 (see below) :
{code:sql}
drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`tmp.json`;
+-+
|   mytext|
+-+
| justfortest |
| justfortest |
+-+
{code}
 

> Add support for arrays and scalars as first level elements in JSON files
> 
>
> Key: DRILL-1755
> URL: https://issues.apache.org/jira/browse/DRILL-1755
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Abhishek Girish
>Priority: Major
> Fix For: Future
>
> Attachments: drillbit.log
>
>
> Publicly available JSON data sometimes have the following structure (arrays 
> as first level elements):
> [
> {"accessLevel": "public"
> },
> {"accessLevel": "private"
> }
> ]
> Drill currently does not support Arrays or Scalars as first level elements. 
> Only maps are supported. We should add support for the arrays and scalars. 
> Log attached. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files

2020-03-25 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066614#comment-17066614
 ] 

benj commented on DRILL-1755:
-

The problem appears to have been corrected in 1.17

The error shown previously appearing in 1.16 (see above) no longer appears in 
1.17 (see below) :
{code:sql}
drill-embedded 1.17> SELECT 'justfortest' AS mytext FROM dfs.tmp.`tmp.json`;
+-+
|   mytext|
+-+
| justfortest |
| justfortest |
+-+
{code}
 

> Add support for arrays and scalars as first level elements in JSON files
> 
>
> Key: DRILL-1755
> URL: https://issues.apache.org/jira/browse/DRILL-1755
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Abhishek Girish
>Priority: Major
> Fix For: Future
>
> Attachments: drillbit.log
>
>
> Publicly available JSON data sometimes have the following structure (arrays 
> as first level elements):
> [
> {"accessLevel": "public"
> },
> {"accessLevel": "private"
> }
> ]
> Drill currently does not support Arrays or Scalars as first level elements. 
> Only maps are supported. We should add support for the arrays and scalars. 
> Log attached. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7602) Possibility to force repartition on read/select

2020-03-24 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7602:

Description: 
It will be nice and usefull in certain situations to have the capacity to do 
repartition like in spark 
([https://spark.apache.org/docs/latest/rdd-programming-guide.html])

either an automatically repartition in certain limit or possibility to indicate 
the desired repartition or both options.

The only way (that I know now) to do that with Drill is to change 
_store.parquet.block-size_ and regenerate the input file then do the request 
(but it will be nice to have the ability to do that on read)

illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
_file2_ of 1Mo (5000 rows)
{code:sql}
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
=> ~ 50min

-- Today we have to change the parquet block size to force multiple parquet 
files
ALTER SESSION SET `store.parquet.block-size` = 1048576;
-- Repartition data
CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
-- then Launch the request
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
=> ~ 1min
{code}
So it's possible to save a lot of time (depending of configuration of cluster) 
by simply forcing more input file.
 it would be useful not to have to regenerate the files with the ideal 
fragmentation before request.

This situation easily appears when making inequality JOIN (to lookup ip in ip 
range for example) on not so big dataset:
{code:java}
ALTER SESSION SET `planner.enable_nljoin_for_scalar_only` = false;
SELECT  * 
FROM dfs.test.`a_pqt` AS a
INNER JOIN dfs.test.`b_pqt` AS b
ON inet_aton(b.ip) >= inet_aton(a.ip_first) AND inet_aton(b.ip) <= 
inet_aton(a.ip_last);
{code}

  was:
It will be nice and usefull ion certain situations to have the capacity to do 
repartition like in spark  
(https://spark.apache.org/docs/latest/rdd-programming-guide.html)

either an automatically repartition in certain limit or possibility to indicate 
the desired repartition or both options.

The only way (that I know now) to do that with Drill is to change 
_store.parquet.block-size_ and regenerate the input file then do the request 
(but it will be nice to have the ability to do that on read)

illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
_file2_ of 1Mo (5000 rows)
{code:sql}
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
=> ~ 50min

-- Tody we have to change the parquet block size to force multiple parquet files
ALTER SESSION SET `store.parquet.block-size` = 1048576;
-- Repartition data
CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
-- then Launch the request
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
=> ~ 1min
{code}

So it's possible to save a lot of time (depending of configuration of cluster) 
by simply forcing more input file.
it would be useful not to have to regenerate the files with the ideal 
fragmentation before request.

 




> Possibility to force repartition on read/select
> ---
>
> Key: DRILL-7602
> URL: https://issues.apache.org/jira/browse/DRILL-7602
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow, Query Planning  Optimization
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> It will be nice and usefull in certain situations to have the capacity to do 
> repartition like in spark 
> ([https://spark.apache.org/docs/latest/rdd-programming-guide.html])
> either an automatically repartition in certain limit or possibility to 
> indicate the desired repartition or both options.
> The only way (that I know now) to do that with Drill is to change 
> _store.parquet.block-size_ and regenerate the input file then do the request 
> (but it will be nice to have the ability to do that on read)
> illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
> _file2_ of 1Mo (5000 rows)
> {code:sql}
> CREATE TABLE dfs.test.`result_from_1_parquet` AS 
> (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
> => ~ 50min
> -- Today we have to change the parquet block size to force multiple parquet 
> files
> ALTER SESSION SET `store.parquet.block-size` = 1048576;
> -- Repartition data
> CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
> -- then Launch the request
> CREATE TABLE dfs.test.`result_from_1_parquet` AS 
> (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
> => ~ 1min
> {code}
> So it's possible to save a lot of time (depending of configuration of 
>

[jira] [Comment Edited] (DRILL-7371) DST/UTC cast/to_timestamp problem

2020-03-12 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927527#comment-16927527
 ] 

benj edited comment on DRILL-7371 at 3/12/20, 4:19 PM:
---

[~volodymyr], the problem occurs on all Daylight Saving Time of Europe (Paris).

>From investigation after your message, it's appears that the problem appears 
>in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs).
 The result is almost wrong with an execution via zeppelin (jdbc too).
 But there is no problem with the request launched directly in the Apache Drill 
web interface ([http://...:8047/query]) and no problem with request in 
drill-embedded

So it's seems that the problem probably comes with JDBC.

 


was (Author: benj641):
[~volodymyr], the problem occurs on all Daylight Saving Time of Europe (Paris).

>From investigation after your message, it's appears that the problem appears 
>in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs).
 The result is almost wrong with an execution via zeppelin (jdbc too).
 But there is no problem with the request launched directly in the Apache Drill 
web interface ([http://...:8047/query).]

So it's seems that the problem probably comes with JDBC.

 

> DST/UTC cast/to_timestamp problem
> -
>
> Key: DRILL-7371
> URL: https://issues.apache.org/jira/browse/DRILL-7371
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC, Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in 
> [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw]
>  find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp])
> {code:sql}
> SELECT TIMEOFDAY();
> +-+
> |   EXPR$0|
> +-+
> | 2019-09-11 08:20:12.247 UTC |
> +-+
> {code}
> Problems appears when _cast/to_timestamp_ date (date related to the DST 
> (Daylight Save Time) of some countries).
> To illustrate, all the next requests give the same +wrong+ results:
> {code:sql}
> SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z');
> SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss');
> SELECT cast('2018-03-25 02:22:40' as timestamp);
> SELECT cast('2018-03-25 02:22:40 +' as timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-25 03:22:40.0 |
> +---+
> {code}
> while the result should be "2018-03-25 +02+:22:40.0"
> An UTC date and time in string shouldn't change when casting to UTC timestamp.
>  To illustrate, the next requests produce +good+ results:
> {code:java}
> SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z');
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-26 02:22:40.0 |
> +---+
> SELECT CAST('2018-03-24 02:22:40' AS timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-24 02:22:40.0 |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7602) Possibility to force repartition on read/select

2020-02-25 Thread benj (Jira)

benj created DRILL-7602:
---

 Summary: Possibility to force repartition on read/select
 Key: DRILL-7602
 URL: https://issues.apache.org/jira/browse/DRILL-7602
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Flow, Query Planning  Optimization
Affects Versions: 1.17.0
Reporter: benj


It will be nice and usefull ion certain situations to have the capacity to do 
repartition like in spark  
(https://spark.apache.org/docs/latest/rdd-programming-guide.html)

either an automatically repartition in certain limit or possibility to indicate 
the desired repartition or both options.

The only way (that I know now) to do that with Drill is to change 
_store.parquet.block-size_ and regenerate the input file then do the request 
(but it will be nice to have the ability to do that on read)

illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
_file2_ of 1Mo (5000 rows)
{code:sql}
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
=> ~ 50min

-- Tody we have to change the parquet block size to force multiple parquet files
ALTER SESSION SET `store.parquet.block-size` = 1048576;
-- Repartition data
CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
-- then Launch the request
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
=> ~ 1min
{code}

So it's possible to save a lot of time (depending of configuration of cluster) 
by simply forcing more input file.
it would be useful not to have to regenerate the files with the ideal 
fragmentation before request.

 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment

2020-02-20 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7595:

Description: 
like on DRILL-7104, there is a bug that change the type from BIGINT to INT 
where a parquet have multiple fragment

With a file containing few row (all is fine (we store a BIGINT and really have 
a BIGINT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_0  | 1500  |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}
With a file containing "enough" row (there is a problem (we store a BIGINT but 
we unfortunatly have an INT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`manyrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| INT|
++
{code}
 
It's not really satisfactory but please note that there is a Trick to avoid 
this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) 
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS 
d FROM dfs.tmp.`manyrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}


 

  was:
like on DRILL-7104, there is a bug that change the type from BIGINT to INT 
where a parquet have multiple fragment

With a file containing few row (all is fine (we store a BIGINT and really have 
a BIGINT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_0  | 1500  |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}
With a file containing "enough" row (there is a problem (we store a BIGINT but 
we unfortunatly have an INT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| INT|
++
{code}
 
It's not really satisfactory but please note that there is a Trick to avoid 
this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) 
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS 
d FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}


 


> Change of data type from bigint to int when parquet with multiple fragment
> --
>
> Key: DRILL-7595
> URL: https://issues.apache.org/jira/browse/DRILL-7595
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> like on DRILL-7104, there is a bug that change the type from BIGINT to INT 
> where a parquet have multiple fragment
> With a file containing few row (all is fine (we store a BIGINT and really 
> have a BIGINT in the Parquet)
> {code:sql}
> apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS 
> d FROM dfs.tmp.`fewrowfile`;
> +--+---+
> | Fragment | Number of records written |
> +--+---+
> | 1_0  | 1500

[jira] [Commented] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment

2020-02-20 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040919#comment-17040919
 ] 

benj commented on DRILL-7595:
-

Another trick to avoid the problem (use a substract between 2 egal value that 
are bigger than an INTEGER):
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(2147483648 - 
2147483648 as BIGINT) AS d FROM dfs.tmp.`manyrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++{code}


> Change of data type from bigint to int when parquet with multiple fragment
> --
>
> Key: DRILL-7595
> URL: https://issues.apache.org/jira/browse/DRILL-7595
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> like on DRILL-7104, there is a bug that change the type from BIGINT to INT 
> where a parquet have multiple fragment
> With a file containing few row (all is fine (we store a BIGINT and really 
> have a BIGINT in the Parquet)
> {code:sql}
> apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS 
> d FROM dfs.tmp.`fewrowfile`;
> +--+---+
> | Fragment | Number of records written |
> +--+---+
> | 1_0  | 1500  |
> +--+---+
> apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
> ++
> | EXPR$0 |
> ++
> | BIGINT |
> ++
> {code}
> With a file containing "enough" row (there is a problem (we store a BIGINT 
> but we unfortunatly have an INT in the Parquet)
> {code:sql}
> apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS 
> d FROM dfs.tmp.`fewrowfile`;
> +--+---+
> | Fragment | Number of records written |
> +--+---+
> | 1_1  | 934111|
> | 1_0  | 1488743   |
> +--+---+
> apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
> ++
> | EXPR$0 |
> ++
> | INT|
> ++
> {code}
>  
> It's not really satisfactory but please note that there is a Trick to avoid 
> this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) 
> {code:sql}
> apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) 
> AS d FROM dfs.tmp.`fewrowfile`;
> +--+---+
> | Fragment | Number of records written |
> +--+---+
> | 1_1  | 934111|
> | 1_0  | 1488743   |
> +--+---+
> apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
> ++
> | EXPR$0 |
> ++
> | BIGINT |
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7595) Change of data type from bigint to int when parquet with multiple fragment

2020-02-20 Thread benj (Jira)

benj created DRILL-7595:
---

 Summary: Change of data type from bigint to int when parquet with 
multiple fragment
 Key: DRILL-7595
 URL: https://issues.apache.org/jira/browse/DRILL-7595
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.17.0
Reporter: benj


like on DRILL-7104, there is a bug that change the type from BIGINT to INT 
where a parquet have multiple fragment

With a file containing few row (all is fine (we store a BIGINT and really have 
a BIGINT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_0  | 1500  |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}
With a file containing "enough" row (there is a problem (we store a BIGINT but 
we unfortunatly have an INT in the Parquet)
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST(0 as BIGINT) AS d 
FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| INT|
++
{code}
 
It's not really satisfactory but please note that there is a Trick to avoid 
this problem: using a CAST('0' AS BIGINT) instead of a CAST(0 AS BIGINT) 
{code:sql}
apache drill> CREATE TABLE dfs.tmp.`out_pqt` AS (SELECT CAST('0' as BIGINT) AS 
d FROM dfs.tmp.`fewrowfile`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 1_1  | 934111|
| 1_0  | 1488743   |
+--+---+
apache drill> SELECT typeof(d) FROM dfs.tmp.`out_pqt`;
++
| EXPR$0 |
++
| BIGINT |
++
{code}


 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row

2020-02-20 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040793#comment-17040793
 ] 

benj commented on DRILL-7588:
-

As another possible solution:

In the case of windows file with \r\n EOL, it's possible to use the '\n' as 
line delimiter to avoid the problem described above.
 But in this case the last field will have a \r included at the end. But if we 
know this last field it does not matter because it's possible to do a 
REGEXP_REPLACE(last_field,'\r$','').
 But it's not really satisfactory.

> Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a 
> row
> 
>
> Key: DRILL-7588
> URL: https://issues.apache.org/jira/browse/DRILL-7588
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
> Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log
>
>
> With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = 
> \r\n).
>  The file contains some special char like
> {noformat}
> http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â±
>  Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
> {noformat}
> The next request sometimes eat the first char of a line
> {code:sql}
> --CREATE TABLE dfs.test.`result_pqt` AS (
> SELECT 
>   columns[0] as d
>  ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
> FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, 
> fieldDelimiter => '\t', lineDelimiter => '\r\n'))
> --)
> java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid 
> format: "/19/2015 9:33:39 AM"
> {code}
> The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in 
> this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file 
> demo.tsv).
> If '\r\n' are replaced by '\n' with _sed_ before the request, the result is 
> correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or 
> without function TABLE (there is no error and the date is correctly converted 
> with to_timestamp function / columns d is correct in the result_pqt)
> keeping '\r\n' and trying to move (in another line in demo.tsv) the line that 
> produce error can prevent error (why ?)
> keeping '\r\n' and trying to remove/modify one or more special char (like in 
> "thá»\235i trang jean") can prevent error (why ?)
> Didn't manage to reduce more the file demo.tsv while keeping the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2020-02-17 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038357#comment-17038357
 ] 

benj commented on DRILL-6096:
-

Just trying to use this new functionality. Some points (tested in 1.17 and last 
1.18 @ 2020-02-17) :
 * Should at least add _"write_text"_ in description of allowed values for 
option _store.format_
 * Why _write_text_ doesn't appears in default storage configuration ?
 * Try to create write_text or equivalent in storage configuration but use of 
_"fieldDelimiter"_ produce _"Please retry: Error (invalid JSON mapping)"_ - 
need a new ticket ?

> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text  CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row

2020-02-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7588:

Description: 
With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = 
\r\n).
 The file contains some special char like
{noformat}
http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â±
 Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
{noformat}
The next request sometimes eat the first char of a line
{code:sql}
--CREATE TABLE dfs.test.`result_pqt` AS (
SELECT 
  columns[0] as d
 ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, 
fieldDelimiter => '\t', lineDelimiter => '\r\n'))
--)
java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: 
"/19/2015 9:33:39 AM"
{code}
The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in 
this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file 
demo.tsv).

If '\r\n' are replaced by '\n' with _sed_ before the request, the result is 
correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or 
without function TABLE (there is no error and the date is correctly converted 
with to_timestamp function / columns d is correct in the result_pqt)

keeping '\r\n' and trying to move (in another line in demo.tsv) the line that 
produce error can prevent error (why ?)
keeping '\r\n' and trying to remove/modify one or more special char (like in 
"thá»\235i trang jean") can prevent error (why ?)

Didn't manage to reduce more the file demo.tsv while keeping the problem.

Environment: (was: With a TSV file ([#demo.tsv.gz] in attachment) 
generated on _Windows_ (EOL = \r\n).
 The file contains some special char like
{noformat}
http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â±
 Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
{noformat}
The next request sometimes eat the first char of a line
{code:sql}
--CREATE TABLE dfs.test.`result_pqt` AS (
SELECT 
  columns[0] as d
 ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, 
fieldDelimiter => '\t', lineDelimiter => '\r\n'))
--)
java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: 
"/19/2015 9:33:39 AM"
{code}
The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in 
this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file 
demo.tsv).

If '\r\n' are replaced by '\n' with _sed_ before the request, the result is 
correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or 
without function TABLE (there is no error and the date is correctly converted 
with to_timestamp function / columns d is correct in the result_pqt)

keeping '\r\n' and trying to move (in another line in demo.tsv) the line that 
produce error can prevent error (why ?)
keeping '\r\n' and trying to remove/modify one or more special char (like in 
"thá»\235i trang jean") can prevent error (why ?)

Didn't manage to reduce more the file demo.tsv while keeping the problem.


 


)

> Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a 
> row
> 
>
> Key: DRILL-7588
> URL: https://issues.apache.org/jira/browse/DRILL-7588
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
> Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log
>
>
> With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = 
> \r\n).
>  The file contains some special char like
> {noformat}
> http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â±
>  Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
> {noformat}
> The next request sometimes eat the first char of a line
> {code:sql}
> --CREATE TABLE dfs.test.`result_pqt` AS (
> SELECT 
>   columns[0] as d
>  ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
> FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, 
> fieldDelimiter => '\t', lineDelimiter => '\r\n'))
> --)
> java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid 
> format: "/19/2015 9:33:39 AM"
> {code}
> The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in 
> this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file 
> demo.tsv).
> If '\r\n' are replaced by '\n' with _sed_ before the request, the result is 
> correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or 
> without function TABLE (there is no error and the date is correctly converted 
>

[jira] [Created] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row

2020-02-14 Thread benj (Jira)

benj created DRILL-7588:
---

 Summary: Function TABLE + option lineDelimiter = '\r\n' eats 
sometime first char of a row
 Key: DRILL-7588
 URL: https://issues.apache.org/jira/browse/DRILL-7588
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.17.0
 Environment: With a TSV file ([#demo.tsv.gz] in attachment) generated 
on _Windows_ (EOL = \r\n).
 The file contains some special char like
{noformat}
http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â±
 Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
{noformat}
The next request sometimes eat the first char of a line
{code:sql}
--CREATE TABLE dfs.test.`result_pqt` AS (
SELECT 
  columns[0] as d
 ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, 
fieldDelimiter => '\t', lineDelimiter => '\r\n'))
--)
java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: 
"/19/2015 9:33:39 AM"
{code}
The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in 
this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file 
demo.tsv).

If '\r\n' are replaced by '\n' with _sed_ before the request, the result is 
correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or 
without function TABLE (there is no error and the date is correctly converted 
with to_timestamp function / columns d is correct in the result_pqt)

keeping '\r\n' and trying to move (in another line in demo.tsv) the line that 
produce error can prevent error (why ?)
keeping '\r\n' and trying to remove/modify one or more special char (like in 
"thá»\235i trang jean") can prevent error (why ?)

Didn't manage to reduce more the file demo.tsv while keeping the problem.


 



Reporter: benj
 Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7569) dir0 problem reader - when path with wilcard and column named dir0

2020-02-05 Thread benj (Jira)

benj created DRILL-7569:
---

 Summary: dir0 problem reader - when path with wilcard and column 
named dir0
 Key: DRILL-7569
 URL: https://issues.apache.org/jira/browse/DRILL-7569
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: benj


If file with named columns (like csvh, parquet, json) contains a column named 
*dir0* ( dir[0-9]+), it can cause problems when requesting with wilcard on path.

{code:sql}
apache drill> SELECT * FROM dfs.tmp.`REP/exa.csvh`;
+-+--+
|  dir0   |  a   |
+-+--+
| coldir0 | cola |
+-+--+

apache drill> SELECT * FROM dfs.tmp.`R*/exa.csvh`;
Error: INTERNAL_ERROR ERROR: Failure while setting up text reader for file 
file:/tmp/REP/exa.csvh
{code}

The errors message are not the same depending on the input type file
{noformat}
CSVH => Error: INTERNAL_ERROR ERROR: Failure while setting up text reader for 
file file:...
PARQUET => Error: INTERNAL_ERROR ERROR: Error in parquet record reader.
Message: Failure in setting up reader
Parquet Metadata:...
JSON => Error: INTERNAL_ERROR ERROR: 
org.apache.drill.exec.exception.SchemaChangeException: It's not allowed to have 
regular field and implicit field share common name dir0. Either change regular 
field name in datasource, or change the default implicit field names.
{noformat}
Note that the JSON error message is more relevant and allows faster 
identification of the problem (even if (to my knowledge) dir* is not modifiable 
in default implicit field name).

I know you should avoid using dir0 for a column name. But when creating table 
it's "easy" to use a "SELECT *" which will include dir0 (and other dir*) (if 
path containing wildcard).

I have no good idea to solve this problem but it would be interesting to find a 
method to avoid falling into this trap.
Maybe *dir** should not appear automatically when _SELECT *_ but need implicit 
call like _SELECT dir0, dir1, *_ (maybe direceted by an option)
Maybe errors messages should be improved.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7568) Strange renaming of duplicate column name

2020-02-05 Thread benj (Jira)

benj created DRILL-7568:
---

 Summary: Strange renaming of duplicate column name
 Key: DRILL-7568
 URL: https://issues.apache.org/jira/browse/DRILL-7568
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0, 1.16.0, 1.15.0
Reporter: benj


(explicit called) duplicate columns name are automatically renamed by drill
{code:java}
apache drill> SELECT 1 a, 2 a, 3 a, 4 a, 5 a, 6 a;
+---++++++
| a | a0 | a1 | a2 | a3 | a4 |
+---++++++
| 1 | 2  | 3  | 4  | 5  | 6  |
+---++++++
{code}
That's ok, this rule seems "logical"

BUT
(with a csvh containing columns a,b and c :
{code:java}
SELECT *, a, a, a, a FROM dfs.tmp.`example.csvh`;
+--+--+--+--+--+--+--+
|  a   |  b   |  c   |  a0  | a00  |  a1  |  a2  |
+--+--+--+--+--+--+--+
| cola | colb | colc | cola | cola | cola | cola |
+--+--+--+--+--+--+--+
{code}
The renaming rule is not applying at the same way
The first duplicate a is well renaming *a0* but the second is renaming *a00* 
(instead of *a1*). Note that the third is renaming a1 (with an offset of 1 
compared to the expected) and so on.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (DRILL-7449) memory leak parse_url function

2020-01-28 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014216#comment-17014216
 ] 

benj edited comment on DRILL-7449 at 1/28/20 11:28 AM:
---

I have had a full check and in reality we havn't used the drill-url-tools 
because it sometimes produce incorrect values on big dataset (due to memory 
problem catch into UDF ?) .
EDIT 28/01/2020 : Just found a bug in drill-url-tools and proposed a 
correction. It may correct the problem.

After some other tests, the standard Drill *parse_url* works well (no Memory 
leak) +if remove the ORDER BY clause+.
And note that Memory leaked can already appears with url_parse (from 
drill-url-tools) if using ORDER BY clause produce already.

The only code that does not cause any critical problem for our use is regexp of 
the type:
{code:sql}
SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host
FROM (SELECT REGEXP_REPLACE(NULLIF(Url, 
''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity 
FROM ...)
{code}

Don't know why, but in terms of observation, ORDER BY clause produce number of 
error of different contexts with complex request and it's sometimes necessary 
to split the request into 2 distinct requests (one for the SELECT with 
computations and one for the SELECT with ORDER BY)

Note that with the regexp there is no error even with ORDER BY clause.


was (Author: benj641):
I have had a full check and in reality we havn't used the drill-url-tools 
because it sometimes produce incorrect values on big dataset (due to memory 
problem catch into UDF ?) .

After some other tests, the standard Drill *parse_url* works well (no Memory 
leak) +if remove the ORDER BY clause+.
And note that Memory leaked can already appears with url_parse (from 
drill-url-tools) if using ORDER BY clause produce already.

The only code that does not cause any critical problem for our use is regexp of 
the type:
{code:sql}
SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host
FROM (SELECT REGEXP_REPLACE(NULLIF(Url, 
''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity 
FROM ...)
{code}

Don't know why, but in terms of observation, ORDER BY clause produce number of 
error of different contexts with complex request and it's sometimes necessary 
to split the request into 2 distinct requests (one for the SELECT with 
computations and one for the SELECT with ORDER BY)

Note that with the regexp there is no error even with ORDER BY clause.

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, 
> embedded_sqlline_with_enable_debug_logging.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-20 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019540#comment-17019540
 ] 

benj commented on DRILL-7449:
-

[~arina], I would like but it's not possible, it's not a problem of size but a 
regulatory content issue.

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, 
> embedded_sqlline_with_enable_debug_logging.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
>

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-20 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019508#comment-17019508
 ] 

benj commented on DRILL-7449:
-

Hi [~IhorHuzenko]
I realized that the problem may from input passed to _parse_url_.
With the strict repetition of 2 extracted from beginning I can't produce the 
problem.  
But I have isolated typical row (from big original data) that can produce the 
problem when they are many.

Others example of possible rows:
{noformat}
{"Attributable":true,"Description":"Website has been identified as malicious by 
Bing","FirstReportedDateTime":"2018-03-12T17:40:01Z","IndicatorExpirationDateTime":"2018-04-11T23:39:23Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T17:40:01Z","NetworkDestinationAsn":0,"NetworkDestinationIPv4":"255.255.255.255","NetworkDestinationPort":80,"Tags":["??"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://www.guruvittal.org/lzp/gets.php?hl=Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¹Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%9AÃ%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â©Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â³-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%9AÃ%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â©Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â³-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â²Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%83Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%99Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%82Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â%C2%86","Version":1.5}
{"Attributable":true,"Description":"Website has been identified as malicious by 
Bing","FirstReportedDateTime":"2018-03-12T17:54:33Z","IndicatorExpirationDateTime":"2018-04-11T23:39:23Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T17:54:33Z","NetworkDestinationAsn":0,"NetworkDestinationIPv4":"255.255.255.255","NetworkDestinationPort":80,"Tags":["??"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://www.guruvittal.org/lzp/gets.php?hl=Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â·Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â²-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¨Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â½?Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â±Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â¯Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¿Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â½?-Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¹Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â±Ã%C2%83Â%C2%83Ã%C2%82Â%C2%83Ã%C2%83Â%C2%82Ã%C2%82Â%C2%98Ã%C2%83Â%C2%83Ã%C2%82Â%C2%82Ã%C2%83Â%C2%82Ã%C2%82Â¨","Version":1.5}
{"Attributable":true,"Description":"Website has been identified as malicious by

[jira] [Commented] (DRILL-7539) Aggregate expression is illegal in GROUP BY clause

2020-01-20 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019310#comment-17019310
 ] 

benj commented on DRILL-7539:
-

Please note that is also possible to bypass the problem with fully prefixing 
columns used in GROUP BY

Example (on the same way as before):
{code:sql}
/* OK because the GROUP BY is on x.b (not only b) */
apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP 
BY a, x.b;
+---+---+
| a | b |
+---+---+
| a | 1 |
+---+---+
{code}


> Aggregate expression is illegal in GROUP BY clause
> --
>
> Key: DRILL-7539
> URL: https://issues.apache.org/jira/browse/DRILL-7539
> Project: Apache Drill
>  Issue Type: Bug
>  Components: SQL Parser
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> When using GROUPED field in aggregate function it works unless the field is 
> aliased with the original name of the field.
> Example (minimalist example with no real sense but based on structure 
> actually used (with more complex GROUP BY part)):
> {code:sql}
> /* OK because aggregate is on b that is not a grouped field */
> apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x 
> GROUP BY a;
> +---+---+
> | a | b |
> +---+---+
> | a | 1 |
> +---+---+
> /* NOK because the aggregate on grouped field b is aliased to b (name used on 
> the group by) */
> apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x 
> GROUP BY a, b;
> Error: VALIDATION ERROR: From line 1, column 11 to line 1, column 16: 
> Aggregate expression is illegal in GROUP BY clause
> /* OK as aggregate on grouped_field b is aliased to c */
> apache drill 1.17> SELECT a, any_value(b) AS c FROM (SELECT 'a' a, 1 b) x 
> GROUP BY a, b;
> +---+---+
> | a | c |
> +---+---+
> | a | 1 |
> +---+---+
> {code}
> This is a problem that is easy to work around but it's easy to get caught. 
> And the bypass will sometimes requires an additional level of SELECT, which 
> is rarely desired.
> Tested to compare VS postgres that doesn't have this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7539) Aggregate expression is illegal in GROUP BY clause

2020-01-20 Thread benj (Jira)

benj created DRILL-7539:
---

 Summary: Aggregate expression is illegal in GROUP BY clause
 Key: DRILL-7539
 URL: https://issues.apache.org/jira/browse/DRILL-7539
 Project: Apache Drill
  Issue Type: Bug
  Components: SQL Parser
Affects Versions: 1.17.0
Reporter: benj


When using GROUPED field in aggregate function it works unless the field is 
aliased with the original name of the field.

Example (minimalist example with no real sense but based on structure actually 
used (with more complex GROUP BY part)):
{code:sql}
/* OK because aggregate is on b that is not a grouped field */
apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP 
BY a;
+---+---+
| a | b |
+---+---+
| a | 1 |
+---+---+

/* NOK because the aggregate on grouped field b is aliased to b (name used on 
the group by) */
apache drill 1.17> SELECT a, any_value(b) AS b FROM (SELECT 'a' a, 1 b) x GROUP 
BY a, b;
Error: VALIDATION ERROR: From line 1, column 11 to line 1, column 16: Aggregate 
expression is illegal in GROUP BY clause

/* OK as aggregate on grouped_field b is aliased to c */
apache drill 1.17> SELECT a, any_value(b) AS c FROM (SELECT 'a' a, 1 b) x GROUP 
BY a, b;
+---+---+
| a | c |
+---+---+
| a | 1 |
+---+---+
{code}

This is a problem that is easy to work around but it's easy to get caught. And 
the bypass will sometimes requires an additional level of SELECT, which is 
rarely desired.

Tested to compare VS postgres that doesn't have this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-16 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016797#comment-17016797
 ] 

benj commented on DRILL-7449:
-

hi [~IhorHuzenko], I have enable debug logging and the result is here:  
[^embedded_sqlline_with_enable_debug_logging.log.txt]

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, 
> embedded_sqlline_with_enable_debug_logging.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
>

[jira] [Updated] (DRILL-7449) memory leak parse_url function

2020-01-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7449:

Attachment: embedded_sqlline_with_enable_debug_logging.log.txt

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, 
> embedded_sqlline_with_enable_debug_logging.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
>

[jira] [Comment Edited] (DRILL-7449) memory leak parse_url function

2020-01-15 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016064#comment-17016064
 ] 

benj edited comment on DRILL-7449 at 1/15/20 3:07 PM:
--

[~IhorHuzenko], please find in attachment (execution with leak from my local 
machine in embedded 1.17 on xubuntu 18.04):
 - [^embedded_FullJsonProfile.txt]
 - [^embedded_sqlline.log.txt]

The Physical plan:
{noformat}
00-00Screen : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of 
records written): rowcount = 5845417.0, cumulative cost = {7.07295457E7 rows, 
7.424585516618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 739
00-01  Project(Fragment=[$0], Number of records written=[$1]) : rowType = 
RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 
5845417.0, cumulative cost = {7.0145004E7 rows, 7.418740099618018E8 cpu, 
5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 738
00-02Writer : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number 
of records written): rowcount = 5845417.0, cumulative cost = {6.4299587E7 rows, 
7.301831759618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 737
00-03  ProjectAllowDup(Domain=[$0]) : rowType = RecordType(ANY Domain): 
rowcount = 5845417.0, cumulative cost = {5.845417E7 rows, 7.243377589618018E8 
cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 736
00-04Project(Domain=[$0]) : rowType = RecordType(ANY Domain): 
rowcount = 5845417.0, cumulative cost = {5.2608753E7 rows, 7.184923419618018E8 
cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 735
00-05  SingleMergeExchange(sort0=[0]) : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {4.6763336E7 rows, 
7.126469249618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 734
01-01OrderedMuxExchange(sort0=[0]) : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {4.0917919E7 rows, 
6.658835889618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 
memory}, id = 733
02-01  SelectionVectorRemover : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {3.5072502E7 rows, 
6.600381719618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 
memory}, id = 732
02-02Sort(sort0=[$0], dir0=[ASC]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.9227085E7 
rows, 6.541927549618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 
4.6763336E7 memory}, id = 731
02-03  HashToRandomExchange(dist0=[[$0]]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.3381668E7 
rows, 1.28599174E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 0.0 memory}, 
id = 730
03-01Project(Domain=[ITEM($0, 'host')]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {1.7536251E7 
rows, 3.5072502E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 729
03-02  Project(parsed=[PARSE_URL($0)]) : rowType = 
RecordType(ANY parsed): rowcount = 5845417.0, cumulative cost = {1.1690834E7 
rows, 2.9227085E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 728
03-03Scan(table=[[dfs, tmp, 
fbingredagg.bigcopy.json]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/fbingredagg.bigcopy.json, numFiles=1, columns=[`Url`], 
files=[file:/tmp/fbingredagg.bigcopy.json], schema=null]]) : rowType = 
RecordType(ANY Url): rowcount = 5845417.0, cumulative cost = {5845417.0 rows, 
5845417.0 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 727
{noformat}
And the Operator profile
 Note that Rows are 8 695 808 although in the file there is 8 999 940 rows
{noformat}
Operator ID TypeAvg Setup Time  Max Setup Time  Avg Process Time
Max Process TimeMin Wait Time   Avg Wait Time   Max Wait Time   % 
Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory
00-xx-00SCREEN  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
0,94%   0,00%   0   -   -
00-xx-01PROJECT 0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
2,37%   0,00%   0   -   -
00-xx-02PARQUET_WRITER  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
0,000s  6,08%   0,00%   0   -   -
00-xx-03PROJECT_ALLOW_DUP   0,000s  0,000s  0,000s  0,000s  0,000s  
0,000s  0,000s  16,61%  0,00%   0   52KB52KB
00-xx-04PROJECT 0,001s  0,001s  0,000s  0,000s  0,000s  0,000s  0,000s  
35,03%  0,00%   0   52KB52KB
00-xx-05MERGING_RECEIVER0,000s  0,000s  0,000s  0,000s  40,382s 
40,382s 40,382s 38,96%  0,00%   0   52KB52KB
01-xx-00

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-15 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016064#comment-17016064
 ] 

benj commented on DRILL-7449:
-

[~IhorHuzenko], please find in attachment (execution with leak from my local 
machine in embedded 1.17 on xubuntu 18.04):
 - [^embedded_FullJsonProfile.txt]
 - [^embedded_sqlline.log.txt]

The Physical plan:
{noformat}
00-00Screen : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number of 
records written): rowcount = 5845417.0, cumulative cost = {7.07295457E7 rows, 
7.424585516618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 739
00-01  Project(Fragment=[$0], Number of records written=[$1]) : rowType = 
RecordType(VARCHAR(255) Fragment, BIGINT Number of records written): rowcount = 
5845417.0, cumulative cost = {7.0145004E7 rows, 7.418740099618018E8 cpu, 
5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 738
00-02Writer : rowType = RecordType(VARCHAR(255) Fragment, BIGINT Number 
of records written): rowcount = 5845417.0, cumulative cost = {6.4299587E7 rows, 
7.301831759618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 737
00-03  ProjectAllowDup(Domain=[$0]) : rowType = RecordType(ANY Domain): 
rowcount = 5845417.0, cumulative cost = {5.845417E7 rows, 7.243377589618018E8 
cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 736
00-04Project(Domain=[$0]) : rowType = RecordType(ANY Domain): 
rowcount = 5845417.0, cumulative cost = {5.2608753E7 rows, 7.184923419618018E8 
cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 memory}, id = 735
00-05  SingleMergeExchange(sort0=[0]) : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {4.6763336E7 rows, 
7.126469249618018E8 cpu, 5.985707595E9 io, 4.7885656064E10 network, 4.6763336E7 
memory}, id = 734
01-01OrderedMuxExchange(sort0=[0]) : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {4.0917919E7 rows, 
6.658835889618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 
memory}, id = 733
02-01  SelectionVectorRemover : rowType = RecordType(ANY 
Domain): rowcount = 5845417.0, cumulative cost = {3.5072502E7 rows, 
6.600381719618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 4.6763336E7 
memory}, id = 732
02-02Sort(sort0=[$0], dir0=[ASC]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.9227085E7 
rows, 6.541927549618018E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 
4.6763336E7 memory}, id = 731
02-03  HashToRandomExchange(dist0=[[$0]]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {2.3381668E7 
rows, 1.28599174E8 cpu, 5.985707595E9 io, 2.3942828032E10 network, 0.0 memory}, 
id = 730
03-01Project(Domain=[ITEM($0, 'host')]) : rowType = 
RecordType(ANY Domain): rowcount = 5845417.0, cumulative cost = {1.7536251E7 
rows, 3.5072502E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 729
03-02  Project(parsed=[PARSE_URL($0)]) : rowType = 
RecordType(ANY parsed): rowcount = 5845417.0, cumulative cost = {1.1690834E7 
rows, 2.9227085E7 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 728
03-03Scan(table=[[dfs, tmp, 
fbingredagg.bigcopy.json]], groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/fbingredagg.bigcopy.json, numFiles=1, columns=[`Url`], 
files=[file:/tmp/fbingredagg.bigcopy.json], schema=null]]) : rowType = 
RecordType(ANY Url): rowcount = 5845417.0, cumulative cost = {5845417.0 rows, 
5845417.0 cpu, 5.985707595E9 io, 0.0 network, 0.0 memory}, id = 727
{noformat}
And the Operator profile
 Note that Rows are 8 695 808 although in the file there is 8 999 940 rows
{noformat}
Operator ID TypeAvg Setup Time  Max Setup Time  Avg Process Time
Max Process TimeMin Wait Time   Avg Wait Time   Max Wait Time   % 
Fragment Time % Query TimeRowsAvg Peak Memory Max Peak Memory
00-xx-00SCREEN  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
0,94%   0,00%   0   -   -
00-xx-01PROJECT 0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
2,37%   0,00%   0   -   -
00-xx-02PARQUET_WRITER  0,000s  0,000s  0,000s  0,000s  0,000s  0,000s  
0,000s  6,08%   0,00%   0   -   -
00-xx-03PROJECT_ALLOW_DUP   0,000s  0,000s  0,000s  0,000s  0,000s  
0,000s  0,000s  16,61%  0,00%   0   52KB52KB
00-xx-04PROJECT 0,001s  0,001s  0,000s  0,000s  0,000s  0,000s  0,000s  
35,03%  0,00%   0   52KB52KB
00-xx-05MERGING_RECEIVER0,000s  0,000s  0,000s  0,000s  40,382s 
40,382s 40,382s 38,96%  0,00%   0   52KB52KB
01-xx-00SINGLE_SENDER   0,000s  0,000s  0,000s  0,000s  0,001s

[jira] [Updated] (DRILL-7449) memory leak parse_url function

2020-01-15 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7449:

Attachment: embedded_sqlline.log.txt
embedded_FullJsonProfile.txt

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
>

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-14 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014960#comment-17014960
 ] 

benj commented on DRILL-7449:
-

Hi [~IhorHuzenko]

The problem doesn't appears for each run. Sometimes (with exactly the same 
data) it will works 5 times before to crash.

With the official 1.17 on a small cluster 3 node (for each ~ 48 proc / 128 Go 
(DRILL_HEAP=15G, DRILL_MAX_DIRECT_MEMORY=80G))
 With a file of 688Mo / 1 118 320 JSON records

On cluster When comparing profile of correct and crashed executions I can see 
that :
 - crash appears at "02-xx-02 - EXTERNAL_SORT" level
 - on "02-xx-03 - UNORDERED_RECEIVER" :
 - correct execution : 99% of the Max Records are concentrated on 1 of the 8 
Minor fragment, and the cumulative total is correct
 - on crash execution : Max Record are ~ evenly/homogeneously distributed on 
the 8 Minor fragment and the cumulative total is incorrect (lower) (already 
incorrect in 03-xx-02 - PROJECT and 03-xx-00 - JSON_SUB_SCAN )

On my local Machine (1.17 too 8 Proc / 32Go),  in embedded mode, When comparing 
profile of correct and crashed executions I can see that :
 - crash appears at "02-xx-02 - EXTERNAL_SORT" level
 - The difference is on 03-xx-00 - JSON_SUB_SCAN, crash execution doesn't have 
the good number for Max Records
 - for 02-xx-03 - UNORDERED_RECEIVER , in correct and crash Max Records are ~ 
evenly/homogeneously distributed on the 6 Minor fragment

Example of log data from crash execution on cluster:
{noformat}
  2020-01-14 08:22:33,681 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query with id 
21e285b6-4d53-58fd-8a4d-dedc0cbfb86a issued by anonymous: CREATE TABLE 
dfs.test.`output_pqt` AS (
SELECT R.parsed.host AS D FROM (SELECT parse_url(T.Url) AS parsed FROM 
dfs.test.`demo2.big.json` AS T) AS R ORDER BY D
)
2020-01-14 08:22:33,724 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:foreman] INFO  
o.a.d.e.p.s.h.CreateTableHandler - Creating persistent table [output_pqt].
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,779 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:7] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:7: 
State to report: RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:3] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: 
State to report: RUNNING
2020-01-14 08:22:33,780 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:2:5] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:5: 
State to report: RUNNING
2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,782 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:1:2] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: 
State to report: RUNNING
2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO  
o.a.d.e.w.fragment.FragmentExecutor - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: 
State change requested AWAITING_ALLOCATION --> RUNNING
2020-01-14 08:22:33,787 [21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:frag:0:0] INFO  
o.a.d.e.w.f.FragmentStatusReporter - 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: 
State to report: RUNNING
2020-01-14 08:22:41,672 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,673 [BitServer-2] INFO  o.a.d.e.w.f.FragmentStatusReporter 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:0:0: State to report: 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,674 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,674 [BitServer-2] INFO  o.a.d.e.w.f.FragmentStatusReporter 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:1:2: State to report: 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,675 [BitServer-2] INFO  o.a.d.e.w.fragment.FragmentExecutor 
- 21e285b6-4d53-58fd-8a4d-dedc0cbfb86a:2:3: State change requested RUNNING --> 
CANCELLATION_REQUESTED
2020-01-14 08:22:41,675

[jira] [Created] (DRILL-7524) Distinct on array with any_value

2020-01-13 Thread benj (Jira)

benj created DRILL-7524:
---

 Summary: Distinct on array with any_value
 Key: DRILL-7524
 URL: https://issues.apache.org/jira/browse/DRILL-7524
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.17.0
Reporter: benj
 Attachments: IndexOutOfBoundsException.txt, 
NegativeArraySizeException.txt

AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it 
may appears as a solution to use any_value aggregate function to do some works.

But some problems appears:

With a dataset of 223664 rows like:
{code:sql}
SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1;
+-++
|   Url   |  Tags  |
+-++
| http://000.dijiushipindian.com/feed.rss | ["us"] |
+-++
{code}

With the own UDF function to_string that only do 

{code:java}
@Param FieldReader input;
...
String rowString = input.readObject().toString();
...
{code}


{code:sql}
SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */;
++
|  Tags  |
++
| ["us"] |
| ["cn"] |
...

SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */;
  java.lang.NegativeArraySizeException
{code}
Sometimes the error can be different (details in attachment): 
java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 
(expected: range(0, 8388608))

And before producing the error, the output show some results like below
{code}
+--+--+
|   Url 
   | Tags |
+--+--+
| http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html   
   | []   |
| 
http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html
 | []   |
{code}
And this result is not correct because field Tags is empty although this is 
never the case in the source file.

So maybe there is a problem with the aggregate function any_value.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-13 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014216#comment-17014216
 ] 

benj commented on DRILL-7449:
-

I have had a full check and in reality we havn't used the drill-url-tools 
because it sometimes produce incorrect values on big dataset (due to memory 
problem catch into UDF ?) .

After some other tests, the standard Drill *parse_url* works well (no Memory 
leak) +if remove the ORDER BY clause+.
And note that Memory leaked can already appears with url_parse (from 
drill-url-tools) if using ORDER BY clause produce already.

The only code that does not cause any critical problem for our use is regexp of 
the type:
{code:sql}
SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host
FROM (SELECT REGEXP_REPLACE(NULLIF(Url, 
''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity 
FROM ...)
{code}

Don't know why, but in terms of observation, ORDER BY clause produce number of 
error of different contexts with complex request and it's sometimes necessary 
to split the request into 2 distinct requests (one for the SELECT with 
computations and one for the SELECT with ORDER BY)

Note that with the regexp there is no error even with ORDER BY clause.

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
>

[jira] [Updated] (DRILL-7519) Error on case when different branche are array of same type but build differenlty

2020-01-10 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7519:

Attachment: full_log_DRILL7519.log

> Error on case when different branche are array of same type but build 
> differenlty
> -
>
> Key: DRILL-7519
> URL: https://issues.apache.org/jira/browse/DRILL-7519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
> Attachments: full_log_DRILL7519.log
>
>
> With 3 array build like
> {code:sql}
> SELECT T.s, typeof(T.s), modeof(T.s)
>   ,T.j, typeof(T.j), modeof(T.j)
>   ,T.j2.a, typeof(T.j2.a), modeof(T.j2.a)
> FROM (
>  SELECT split('a,b',',') as s
>  , convert_fromJSON('["c","d"]') AS j
>  , convert_fromJSON('{"tag":["e","f"]}') AS j2
> ) AS T
> +---+-++---+-++---+-++
> | s | EXPR$1  | EXPR$2 | j | EXPR$4  | EXPR$5 |  EXPR$6   | 
> EXPR$7  | EXPR$8 |
> +---+-++---+-++---+-++
> | ["a","b"] | VARCHAR | ARRAY  | ["c","d"] | VARCHAR | ARRAY  | ["e","f"] | 
> VARCHAR | ARRAY  |
> +---+-++---+-++---+-++
> {code}
> it's possible to use *s* and *j* in the branch of the same case, but it's not 
> possible to use *s or j* in accordance with *j2.tag*
> {code:sql}
> SELECT CASE WHEN true THEN T.s ELSE T.j END
>  , CASE WHEN false THEN T.s ELSE T.j END
> FROM (
>  SELECT split('a,b',',') AS s
>   , convert_fromJSON('["c","d"]') AS j
>   , convert_fromJSON('{"tag":["e","f"]}') AS j2
> ) AS T
> +---+---+
> |  EXPR$0   |  EXPR$1   |
> +---+---+
> | ["a","b"] | ["c","d"] |
> +---+---+
> SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
>  , CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
>  FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
> convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
> +---+---+
> |  EXPR$0   |  EXPR$1   |
> +---+---+
> | ["e","f"] | ["a","b"] |
> +---+---+
> /* But surprisingly */
> SELECT CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
> FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
> convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
> Error: SYSTEM ERROR: NullPointerException
> /* and */
> SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
> FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
> convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
> +---+
> |  EXPR$0   |
> +---+
> | ["e","f"] |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7449) memory leak parse_url function

2020-01-10 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012874#comment-17012874
 ] 

benj commented on DRILL-7449:
-

In the meantime and if it can helps someone

I have found this page 
[https://www.r-bloggers.com/two-new-apache-drill-udfs-for-processing-urils-and-internet-domain-names/]
 and now using the function _url_parse_ from 
[https://github.com/hrbrmstr/drill-url-tools]
 that use [http://galimatias.mola.io/]

 

> memory leak parse_url function
> --
>
> Key: DRILL-7449
> URL: https://issues.apache.org/jira/browse/DRILL-7449
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Assignee: Igor Guzenko
>Priority: Major
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
> org.apache.drill.exec.ops.FragmentContextImpl.close():546
> 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)
> org.apache.drill.exec.memory.BaseAllocator.close():520
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
>

[jira] [Created] (DRILL-7519) Error on case when different branche are array of same type but build differenlty

2020-01-08 Thread benj (Jira)

benj created DRILL-7519:
---

 Summary: Error on case when different branche are array of same 
type but build differenlty
 Key: DRILL-7519
 URL: https://issues.apache.org/jira/browse/DRILL-7519
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: benj


With 3 array build like
{code:sql}
SELECT T.s, typeof(T.s), modeof(T.s)
  ,T.j, typeof(T.j), modeof(T.j)
  ,T.j2.a, typeof(T.j2.a), modeof(T.j2.a)
FROM (
 SELECT split('a,b',',') as s
 , convert_fromJSON('["c","d"]') AS j
 , convert_fromJSON('{"tag":["e","f"]}') AS j2
) AS T
+---+-++---+-++---+-++
| s | EXPR$1  | EXPR$2 | j | EXPR$4  | EXPR$5 |  EXPR$6   | 
EXPR$7  | EXPR$8 |
+---+-++---+-++---+-++
| ["a","b"] | VARCHAR | ARRAY  | ["c","d"] | VARCHAR | ARRAY  | ["e","f"] | 
VARCHAR | ARRAY  |
+---+-++---+-++---+-++
{code}
it's possible to use *s* and *j* in the branch of the same case, but it's not 
possible to use *s or j* in accordance with *j2.tag*
{code:sql}
SELECT CASE WHEN true THEN T.s ELSE T.j END
 , CASE WHEN false THEN T.s ELSE T.j END
FROM (
 SELECT split('a,b',',') AS s
  , convert_fromJSON('["c","d"]') AS j
  , convert_fromJSON('{"tag":["e","f"]}') AS j2
) AS T
+---+---+
|  EXPR$0   |  EXPR$1   |
+---+---+
| ["a","b"] | ["c","d"] |
+---+---+

SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
 , CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
 FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
+---+---+
|  EXPR$0   |  EXPR$1   |
+---+---+
| ["e","f"] | ["a","b"] |
+---+---+

/* But surprisingly */
SELECT CASE WHEN false THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
Error: SYSTEM ERROR: NullPointerException

/* and */
SELECT CASE WHEN true THEN T.j2.tag ELSE T.s /*idem with T.j*/ END
FROM (SELECT split('a,b',',') AS s, convert_fromJSON('["c","d"]') AS j, 
convert_fromJSON('{"tag":["e","f"]}') AS j2) AS T;
+---+
|  EXPR$0   |
+---+
| ["e","f"] |
+---+

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7516) count(*) on empty JSON produce nothing

2020-01-07 Thread benj (Jira)

benj created DRILL-7516:
---

 Summary: count(*) on empty JSON produce nothing
 Key: DRILL-7516
 URL: https://issues.apache.org/jira/browse/DRILL-7516
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - JSON
Affects Versions: 1.17.0
Reporter: benj


With 2 files containing nothing

{code:bash}
touch 0.csv
touch 0.json
{code}

the count( * ) doesn't produce the same result

{code:sql}
apache drill> select count(*) from dfs.TEST.`0.json`;
++
| EXPR$0 |
++
++
No rows selected (0.151 seconds)

apache drill> select count(*) from dfs.TEST.`0.csv`;
++
| EXPR$0 |
++
| 0  |
++
1 row selected (0.415 seconds)
{code}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7515) ORDER BY clause produce error on GROUP BY with array field manager with any_value

2020-01-07 Thread benj (Jira)

benj created DRILL-7515:
---

 Summary: ORDER BY clause produce error on GROUP BY with array 
field manager with any_value
 Key: DRILL-7515
 URL: https://issues.apache.org/jira/browse/DRILL-7515
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Affects Versions: 1.17.0
Reporter: benj


With a parquet containing an array field, for example:
{code:sql}
apache drill 1.17> CREATE TABLE dfs.TEST.`example_any_pqt` AS (SELECT 'foo' AS 
a, 'bar' b, split('foo,bar',',') as c);

apache drill 1.17> SELECT *, typeof(c) AS type, sqltypeof(c) AS sql_type FROM 
dfs.TEST.`example_any_pqt`;
+-+-+---+-+--+
|  a  |  b  |   c   |  type   | sql_type |
+-+-+---+-+--+
| foo | bar | ["foo","bar"] | VARCHAR | ARRAY|
+-+-+---+-+--+
{code}
The next request work well
{code:sql}
apache drill 1.17> SELECT * FROM 
(SELECT a, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a)
ORDER BY a;
+-+---+
|  a  |EXPR$1 |
+-+---+
| foo | ["foo","bar"] |
+-+---+
{code}
But the next request (with the same struct as the previous request) failed
{code:sql}
apache drill 1.17> SELECT * FROM 
(SELECT a, b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b)
ORDER BY a;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.

Previous schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` 
(VARCHAR:OPTIONAL)], [`EXPR$2` (NULL:OPTIONAL)]], selectionVector=NONE]
Incoming schema BatchSchema [fields=[[`a` (VARCHAR:OPTIONAL)], [`b` 
(VARCHAR:OPTIONAL)], [`EXPR$2` (VARCHAR:REPEATED), children=([`$data$` 
(VARCHAR:REQUIRED)])]], selectionVector=NONE]
Fragment 0:0
{code}
Note that the same request +without the order by+ works well. It's also 
possible to use intermediate table and apply the ORDER BY in a second time.
{code:sql}
apache drill 1.17> SELECT * FROM 
(SELECT a, b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b);
+-+-+---+
|  a  |  b  |EXPR$2 |
+-+-+---+
| foo | bar | ["foo","bar"] |
+-+-+---+

apache drill 1.17> CREATE TABLE dfs.TEST.`ok_pqt` AS (SELECT * FROM (SELECT a, 
b, any_value(c) FROM dfs.TEST.`example_any_pqt` GROUP BY a, b));
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0  | 1 |
+--+---+
apache drill 1.17> SELECT * FROM dfs.TEST.`ok_pqt` ORDER BY a;
+-+-+---+
|  a  |  b  |EXPR$2 |
+-+-+---+
| foo | bar | ["foo","bar"] |
+-+-+---+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7493) convert_fromJSON and unicode

2019-12-20 Thread benj (Jira)

benj created DRILL-7493:
---

 Summary: convert_fromJSON and unicode
 Key: DRILL-7493
 URL: https://issues.apache.org/jira/browse/DRILL-7493
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


transform a json string (with \u char) into json struct
{code:sql}
apache drill> SELECT x_str, convert_fromJSON(x_str) AS x_array 
FROM (SELECT '["test=\u0014=test"]' x_str);
+--+--+
|x_str |   x_array|
+--+--+
| ["test=\u0014=test"] | ["test=\u0014=test"] |
+--+--+
{code}
Use json struct :
{code:sql}
apache drill> SELECT x_str
, x_array
, x_array[0] AS x_array0 
FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array
FROM (SELECT '["test=\u0014=test"]' x_str));
+--+--+-+
|x_str |   x_array|  x_array0   |
+--+--+-+
| ["test=\u0014=test"] | ["test=\u0014=test"] | test==test |
+--+--+-+
{code}
Note that the char \u0014 is interpreted in x_array0

if using split function on x_array0, an array is built with non interpreted 
\u
{code:sql}
apache drill> SELECT x_str
, x_array
, x_array[0] AS x_array0
, split(x_array[0],',') AS x_array0_split 
FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array 
FROM (SELECT '["test=\u0014=test"]' x_str));
+--+--+-+--+
|x_str |   x_array|  x_array0   |x_array0_split 
   |
+--+--+-+--+
| ["test=\u0014=test"] | ["test=\u0014=test"] | test==test | 
["test=\u0014=test"] |
+--+--+-+--+
{code}
It's not possible to use convert_fromJSON on the interpreted \u
{code:sql}
SELECT x_str
, x_array
, x_array[0] AS x_array0
, split(x_array[0],',') AS x_array0_split
, convert_fromJSON('["' || x_array[0] || '"]') AS convertJSONerror 
FROM(SELECT x_str, convert_fromJSON(x_str) AS x_array 
FROM (SELECT '["test=\u0014=test"]' x_str));
Error: DATA_READ ERROR: Illegal unquoted character ((CTRL-CHAR, code 20)): has 
to be escaped using backslash to be included in string value
 at [Source: (org.apache.drill.exec.vector.complex.fn.DrillBufInputStream); 
line: 1, column: 9]
{code}
don't work although the string is the same as the origin but \u is 
unfortunatly interpreted




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6963) create/aggregate/work with array

2019-11-29 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985067#comment-16985067
 ] 

benj commented on DRILL-6963:
-

For the second point (arry_agg), in attempt of an eventual official function, 
here is a simple implementation that can do that (without possibility to 
_DISTINCT_ or _ORDER BY_) 
{code:java}
package org.apache.drill.contrib.function;

import io.netty.buffer.DrillBuf;
import org.apache.drill.exec.expr.DrillAggFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.annotations.Workspace;
import org.apache.drill.exec.expr.holders.*;

import javax.inject.Inject;

// If dataset is too large, need : ALTER SESSION SET `planner.enable_hashagg` = 
false
public class ArrayAgg {

// STRING NULLABLE //   
@FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL)
public static class NullableVarChar_ArrayAgg implements DrillAggFunc {
  @Param NullableVarCharHolder input;
  @Workspace ObjectHolder agg;
  @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter 
out;
  @Inject DrillBuf buffer;

  @Override public void setup() { 
agg = new ObjectHolder();
  }

  @Override public void reset() {
agg = new ObjectHolder();
  }

  @Override public void add() {
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter 
listWriter;
if (agg.obj == null) {
  agg.obj = out.rootAsList();
}

if ( input.isSet == 0 )
return;

org.apache.drill.exec.expr.holders.VarCharHolder rowHolder = new 
org.apache.drill.exec.expr.holders.VarCharHolder();
byte[] inputBytes = 
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(
 input ).getBytes( com.google.common.base.Charsets.UTF_8 );
buffer.reallocIfNeeded(inputBytes.length); 
buffer.setBytes(0, inputBytes);
rowHolder.start = 0; 
rowHolder.end = inputBytes.length; 
rowHolder.buffer = buffer;

listWriter = 
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
listWriter.varChar().write( rowHolder );  
  }

  @Override public void output() {
  ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) 
agg.obj).endList();
  }
}

// INTEGER NULLABLE //
@FunctionTemplate(
 name = "array_agg",
 scope = FunctionScope.POINT_AGGREGATE,
 nulls = NullHandling.INTERNAL)
public static class NullableInt_ArrayAgg implements DrillAggFunc {
@Param NullableIntHolder input;
@Workspace ObjectHolder agg;
@Output 
org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
@Inject DrillBuf buffer;

@Override public void setup() {
  
 agg = new ObjectHolder();
}

@Override public void reset() {
 agg = new ObjectHolder();
}

@Override public void add() {
 org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter 
listWriter;
 if (agg.obj == null) {
   agg.obj = out.rootAsList();
 }
 
if ( input.isSet == 0 )
return;
 
 listWriter = 
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
 listWriter.integer().writeInt( input.value );
}

@Override public void output() {
  
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) 
agg.obj).endList();
}
}

// ...
}
{code}
 

> create/aggregate/work with array
> 
>
> Key: DRILL-6963
> URL: https://issues.apache.org/jira/browse/DRILL-6963
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Reporter: benj
>Priority: Major
>
> * Add the possibility to build array (like : SELECT array[a1,a2,a3...]) - 
> ideally work with all types
>  * Add a default array_agg (like : SELECT col1, array_agg(col2), 
> array_agg(DISTINCT col2) FROM ... GROUP BY col1) ;  - ideally work with all 
> types
>  * Add function/facilities/operator to work with array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-1755) Add support for arrays and scalars as first level elements in JSON files

2019-11-26 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982491#comment-16982491
 ] 

benj commented on DRILL-1755:
-

This problems seems partially OK,

With a file containing the previous description {noformat}[{"accessLevel": 
"public"},{"accessLevel": "private"}]{noformat}

{code:sql}
apache drill (1.16)> SELECT *, 'justfortest' AS mytext FROM 
dfs.tmp.`example.json`;
+-+-+
| accessLevel |   mytext|
+-+-+
| public  | justfortest |
| private | justfortest |
+-+-+
2 rows selected (0.127 seconds)
{code}

But some problems subsists, like

{code:sql}
apache drill (1.16)> SELECT 'justfortest' As mytext FROM dfs.tmp.`example.json`;
Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a 
record. Current token was START_ARRAY 

File  /tmp/example.json
Record  1
Column  2
Fragment 0:0
{code}




> Add support for arrays and scalars as first level elements in JSON files
> 
>
> Key: DRILL-1755
> URL: https://issues.apache.org/jira/browse/DRILL-1755
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Abhishek Girish
>Priority: Major
> Fix For: Future
>
> Attachments: drillbit.log
>
>
> Publicly available JSON data sometimes have the following structure (arrays 
> as first level elements):
> [
> {"accessLevel": "public"
> },
> {"accessLevel": "private"
> }
> ]
> Drill currently does not support Arrays or Scalars as first level elements. 
> Only maps are supported. We should add support for the arrays and scalars. 
> Log attached. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7452) Support comparison operator for Array

2019-11-21 Thread benj (Jira)

benj created DRILL-7452:
---

 Summary: Support comparison operator for Array
 Key: DRILL-7452
 URL: https://issues.apache.org/jira/browse/DRILL-7452
 Project: Apache Drill
  Issue Type: Wish
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj
 Attachments: example_array.parquet

It will be useful to have a comparison operator for nested types,  at less for 
Array.
sample file in attachment : example_array.parquet
{code:sql}
/* It's possible to do */
apache drill(1.16)> SELECT id, tags FROM `example_array.parquet`;
+++
|   id   |tags|
+++
| 7b8808 | [1,2,3]|
| 7b8808 | [1,20,3]   |
| 55a4be | [1,3,5,6]  |
+++

/* But it's not possible to use DISTINCT or ORDER BY on the field Tags (ARRAY) 
*/
/* https://drill.apache.org/docs/nested-data-limitations/ */
apache drill(1.16)> SELECT DISTINCT id, tags FROM `example_array_parquet` 
ORDER BY tags;
Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or 
repeated scalar type should not be used in group by, order by or in a 
comparison operator. Drill does not support compare between BIGINT:REPEATED and 
BIGINT:REPEATED.
{code}

It's possible to do that in Postgres
{code:sql}
SELECT DISTINCT id, tags
FROM
(
SELECT '7b8808' AS id, ARRAY[1,2,3] tags
UNION SELECT '7b8808', ARRAY[1,20,3]
UNION SELECT '55a4be', ARRAY[1,3,5,6]
) x
ORDER BY tags
7b8808;{1,2,3}
55a4be;{1,3,5,6}
7b8808;{1,20,3}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7379) Planning error

2019-11-21 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7379:

Description: 
sample file: [^example.parquet]

With data as:
{code:sql}
SELECT id, tags FROM `example_parquet`;
+++
|   id   |tags|
+++
| 7b8808 | ["peexe","signed","overlay"]   |
| 55a4ae | ["peexe","signed","upx","overlay"] |
+++
{code}
The next request is OK
{code:sql}
SELECT id, flatten(tags) tag 
FROM ( 
  SELECT id, any_value(tags) tags 
  FROM `example_parquet` 
  GROUP BY id 
) LIMIT 2;
+++
|   id   |  tag   |
+++
| 55a4ae | peexe  |
| 55a4ae | signed |
+++
{code}
But unexpectedly, the next query failed:
{code:sql}
SELECT tag, count(*) 
FROM (
  SELECT flatten(tags) tag 
  FROM (
SELECT id, any_value(tags) tags 
FROM `example_parquet`
GROUP BY id 
  )
) GROUP BY tag;
Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or 
repeated scalar type should not be used in group by, order by or in a 
comparison operator. Drill does not support compare between MAP:REPEATED and 
MAP:REPEATED.

/* Or other error with another set of data :
Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize 
incoming schema.  Errors:
 
Error in expression at index 0.  Error: Missing function implementation: 
[hash32asdouble(MAP-REPEATED, INT-REQUIRED)].  Full expression: null..
*/
{code}
These errors are incomprehensible because, the aggregate is on VARCHAR.

More, the request works if decomposed in 2 request with with the creation of an 
intermediate table like below:
{code:sql}
CREATE TABLE `tmp.parquet` AS (
  SELECT id, flatten(tags) tag 
  FROM ( 
SELECT id, any_value(tags) tags 
FROM `example_parquet` 
GROUP BY id 
));

SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag;
+-+---+
|   tag   | c |
+-+---+
| overlay | 2 |
| peexe   | 2 |
| signed  | 2 |
| upx | 1 |
+-+---+
{code}

  was:
With data as:
{code:sql}
SELECT id, tags FROM `example_parquet`;
+++
|   id   |tags|
+++
| 7b8808 | ["peexe","signed","overlay"]   |
| 55a4ae | ["peexe","signed","upx","overlay"] |
+++
{code}
The next request is OK
{code:sql}
SELECT id, flatten(tags) tag 
FROM ( 
  SELECT id, any_value(tags) tags 
  FROM `example_parquet` 
  GROUP BY id 
) LIMIT 2;
+++
|   id   |  tag   |
+++
| 55a4ae | peexe  |
| 55a4ae | signed |
+++
{code}
But unexpectedly, the next query failed:
{code:sql}
SELECT tag, count(*) 
FROM (
  SELECT flatten(tags) tag 
  FROM (
SELECT id, any_value(tags) tags 
FROM `example_parquet`
GROUP BY id 
  )
) GROUP BY tag;
Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or 
repeated scalar type should not be used in group by, order by or in a 
comparison operator. Drill does not support compare between MAP:REPEATED and 
MAP:REPEATED.

/* Or other error with another set of data :
Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize 
incoming schema.  Errors:
 
Error in expression at index 0.  Error: Missing function implementation: 
[hash32asdouble(MAP-REPEATED, INT-REQUIRED)].  Full expression: null..
*/
{code}
These errors are incomprehensible because, the aggregate is on VARCHAR.

More, the request works if decomposed in 2 request with with the creation of an 
intermediate table like below:
{code:sql}
CREATE TABLE `tmp.parquet` AS (
  SELECT id, flatten(tags) tag 
  FROM ( 
SELECT id, any_value(tags) tags 
FROM `example_parquet` 
GROUP BY id 
));

SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag;
+-+---+
|   tag   | c |
+-+---+
| overlay | 2 |
| peexe   | 2 |
| signed  | 2 |
| upx | 1 |
+-+---+
{code}


> Planning error
> --
>
> Key: DRILL-7379
> URL: https://issues.apache.org/jira/browse/DRILL-7379
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: example.parquet
>
>
> sample file: [^example.parquet]
> With data as:
> {code:sql}
> SELECT id, tags FROM `example_parquet`;
> +++
> |   id   |tags|
> +++
> | 7b8808 | ["peexe","signed","overlay"]   |
> | 55a4ae | ["peexe","signed","upx","overlay"] |
> +++
>

[jira] [Updated] (DRILL-7379) Planning error

2019-11-21 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7379:

Attachment: example.parquet

> Planning error
> --
>
> Key: DRILL-7379
> URL: https://issues.apache.org/jira/browse/DRILL-7379
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: example.parquet
>
>
> With data as:
> {code:sql}
> SELECT id, tags FROM `example_parquet`;
> +++
> |   id   |tags|
> +++
> | 7b8808 | ["peexe","signed","overlay"]   |
> | 55a4ae | ["peexe","signed","upx","overlay"] |
> +++
> {code}
> The next request is OK
> {code:sql}
> SELECT id, flatten(tags) tag 
> FROM ( 
>   SELECT id, any_value(tags) tags 
>   FROM `example_parquet` 
>   GROUP BY id 
> ) LIMIT 2;
> +++
> |   id   |  tag   |
> +++
> | 55a4ae | peexe  |
> | 55a4ae | signed |
> +++
> {code}
> But unexpectedly, the next query failed:
> {code:sql}
> SELECT tag, count(*) 
> FROM (
>   SELECT flatten(tags) tag 
>   FROM (
> SELECT id, any_value(tags) tags 
> FROM `example_parquet`
> GROUP BY id 
>   )
> ) GROUP BY tag;
> Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or 
> repeated scalar type should not be used in group by, order by or in a 
> comparison operator. Drill does not support compare between MAP:REPEATED and 
> MAP:REPEATED.
> /* Or other error with another set of data :
> Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to 
> materialize incoming schema.  Errors:
>  
> Error in expression at index 0.  Error: Missing function implementation: 
> [hash32asdouble(MAP-REPEATED, INT-REQUIRED)].  Full expression: null..
> */
> {code}
> These errors are incomprehensible because, the aggregate is on VARCHAR.
> More, the request works if decomposed in 2 request with with the creation of 
> an intermediate table like below:
> {code:sql}
> CREATE TABLE `tmp.parquet` AS (
>   SELECT id, flatten(tags) tag 
>   FROM ( 
> SELECT id, any_value(tags) tags 
> FROM `example_parquet` 
> GROUP BY id 
> ));
> SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag;
> +-+---+
> |   tag   | c |
> +-+---+
> | overlay | 2 |
> | peexe   | 2 |
> | signed  | 2 |
> | upx | 1 |
> +-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7449) memory leak parse_url function

2019-11-18 Thread benj (Jira)

benj created DRILL-7449:
---

 Summary: memory leak parse_url function
 Key: DRILL-7449
 URL: https://issues.apache.org/jira/browse/DRILL-7449
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


Requests with *parse_url* works well when the number of treated rows is low but 
produce memory leak when number of rows grows (~ between 500 000 and 1 million) 
(and for certain number of row sometimes the request works and sometimes it 
failed with memory leaks)

Extract from dataset tested:
{noformat}
{"Attributable":true,"Description":"Website has been identified as malicious by 
Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
{"Attributable":true,"Description":"Website has been identified as malicious by 
Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
{noformat}
Request tested:
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.use_new_reader` = true;
ALTER SESSION SET `store.parquet.compression` = 'snappy';
ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
ALTER SESSION SET `store.json.all_text_mode` = true;
ALTER SESSION SET `exec.enable_union_type` = true;
ALTER SESSION SET `store.json.all_text_mode` = true;

CREATE TABLE dfs.test.`output_pqt` AS
(
SELECT R.parsed.host AS Domain
FROM ( 
  SELECT parse_url(T.Url) AS parsed
  FROM dfs.test.`file.json` AS T
) AS R 
ORDER BY Domain
);
{code}
 
 Result when memory leak:
{noformat}
Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. Memory 
leaked: (256)
Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)


Fragment 3:0

Please, refer to logs for more information.

[Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]

  (java.lang.IllegalStateException) Memory was leaked by query. Memory leaked: 
(256)
Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)

org.apache.drill.exec.memory.BaseAllocator.close():520
org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
org.apache.drill.exec.ops.FragmentContextImpl.close():546
org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748 (state=,code=0)
java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
by query. Memory leaked: (256)
Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)


Fragment 3:0

Please, refer to logs for more information.

[Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]

  (java.lang.IllegalStateException) Memory was leaked by query. Memory leaked: 
(256)
Allocator(frag:3:0) 300/256/9337280/300 (res/actual/peak/limit)

org.apache.drill.exec.memory.BaseAllocator.close():520
org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
org.apache.drill.exec.ops.FragmentContextImpl.close():546
org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748

at 
org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:538)
at 
org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:610)
at

[jira] [Commented] (DRILL-7375) composite/nested type map/array convert_to/cast to varchar

2019-11-15 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975198#comment-16975198
 ] 

benj commented on DRILL-7375:
-

Waiting for a possible official version of such a feature, it is possible to 
use an own UDF like:
{code:java}
package org.apache.drill.contrib.function;

import io.netty.buffer.DrillBuf;
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.FunctionScope;
import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.vector.complex.reader.FieldReader;
import org.apache.drill.exec.expr.holders.*;
import javax.inject.Inject;

public class ToString {
 
@FunctionTemplate(
name = "to_string",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class NullableVarChar_Field_ToString implements DrillSimpleFunc {
  @Param FieldReader input;
  @Output VarCharHolder out;
  @Inject DrillBuf buffer;

  @Override public void setup() {
  }
  
  @Override public void eval() {
  String rowString = input.readObject().toString();

  buffer = buffer.reallocIfNeeded(rowString.length());
  buffer.setBytes(0, rowString.getBytes(), 0, rowString.length());
  out.start  = 0;
  out.end= rowString.length();
  out.buffer = buffer;
  }
}
}
{code}

Example of use:
{code:sql}
apache drill> SELECT j, typeof(j) AS tj, to_string(j) AS strj, 
typeof(to_string(j)) AS tstrj 
FROM (SELECT convert_fromJSON('{a:["1","2","3"]}' ) j);
+-+-+-+-+
|  j  | tj  |strj |  tstrj  |
+-+-+-+-+
| {"a":["1","2","3"]} | MAP | {"a":["1","2","3"]} | VARCHAR |
+-+-+-+-+
1 row selected (0.132 seconds)
{code}

With this function it's possible to "cast" anything in varchar and avoid 
storage problem in Parquet due to certain types. And it is eventually possible 
to cast the other way when requesting the Parquet file. 



> composite/nested type map/array convert_to/cast to varchar
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (convert_from + JSON) with convert_from 
> or transform a varchar to array (split)
> {code:sql}
> SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 
> 200}' ,'JSON') a);
> +---+-++
> | a | EXPR$1  | EXPR$2 |
> +---+-++
> | {"a":100,"b":200} | MAP | STRUCT |
> +---+-++
> SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( 
> SELECT 'foo,bar' AS str));
> +---+-++
> |a  | EXPR$1  | EXPR$2 |
> +---+-++
> | ["foo","bar"] | VARCHAR | ARRAY  |
> +---+-++
> {code}
> It will be very usefull :
>  # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast 
> syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
> +---+--+---+-+
> {code}
>  # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast 
> syntax" or any other method
>  Expected
> {code:sql}
> SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( 
> SELECT 'foo,bar' AS str));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR |
> +---+--+---+-+
> {code}
> Please note that these possibility of course exists in other database systems
>  Example with Postgres:
> {code:sql}
> SELECT '{"a":100,"b":200}'::json::text;
> => {"a":100,"b":200}
> SELECT array[1,2,3]::text;
>

[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on Drill embedded

2019-11-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7444:

Summary: JSON blank result on SELECT when too much byte in multiple files 
on Drill embedded  (was: JSON blank result on SELECT when too much byte in 
multiple files on embedded)

> JSON blank result on SELECT when too much byte in multiple files on Drill 
> embedded
> --
>
> Key: DRILL-7444
> URL: https://issues.apache.org/jira/browse/DRILL-7444
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JSON
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
> different results on a simple _SELECT_ when using +Drill embedded+.
> Problem appears from a number of byte (~ 102 400 000 in my case)
> {code:bash}
> #!/bin/bash
> # script gen.sh to reproduce the problem
> for ((i=1;i<=$1;++i));
> do
> echo -n '{"At":"'
> for j in {1..999};
> do
>   echo -n 'ab'
> done
> echo '"}'
> done
> {code}
> {noformat}
> == I ==
> $ gen.sh 1 > a.json
> $ gen.sh 239 > b.json
> $ wc -c *.json
> 1 a.json
>   239 b.json
> 10239 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> | aab... |
> ++
> => All is fine here
> == II ==
> $ gen.sh 1 > a.json
> $ gen.sh 240 > b.json
> $ wc -c *.json
> 1 a.json
>   240 b.json
> 10240 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> ||
> ++
> => In a surprising way field `At` is empty
> == III ==
> $ gen.sh 10240 > ab.json
> $ wc -c *.json 
> 10240 ab.json
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
> ++ 
> |At  |
> ++
> | aab... |
> ++
> => All is fine here although the number of lines is equal to case II
>   {noformat}
> The Version of the Drill 1.17 tested here is the latest at 2019-11-13
> This problem doesn't appears with Drill embedded 1.16



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on embedded

2019-11-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7444:

Description: 
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_ when using +Drill embedded+.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II
  {noformat}

The Version of the Drill 1.17 tested here is the latest at 2019-11-13
This problem doesn't appears with Drill embedded 1.16

  was:
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_ when using Drill embedded.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II
  {noformat}

The Version of the Drill 1.17 tested here is the latest at 2019-11-13


> JSON blank result on SELECT when too much byte in multiple files on embedded
> 
>
> Key: DRILL-7444
> URL: https://issues.apache.org/jira/browse/DRILL-7444
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JSON
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
> different results on a simple _SELECT_ when using +Drill embedded+.
> Problem appears from a number of byte (~ 102 400 000 in my case)
> {code:bash}
> #!/bin/bash
> # script gen.sh to reproduce the problem
> for ((i=1;i<=$1;++i));
> do
> echo -n '{"At":"'
> for j in {1..999};
> do
>   echo -n 'ab'
> done
> echo '"}'
> done
> {code}
> {noformat}
> == I ==
> $ gen.sh 1 > a.json
> $ gen.sh 239 > b.json
> $ wc -c *.json
> 1 a.json
>   239 b.json
> 10239 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> | aab... |
> ++
> => All is fine here
> == II ==
> $ gen.sh 1 > a.json
> $ gen.sh 240 > b.json
> $ wc -c *.json
> 1 a.json
>   240 b.json
> 10240 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> ||
> ++
> => In a surprising way field `At` is empty
> == III ==
> $

[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on embedded

2019-11-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7444:

Description: 
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_ when using Drill embedded.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II
  {noformat}

The Version of the Drill 1.17 tested here is the latest at 2019-11-13

  was:
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II

{noformat}
 * This problem doesn't appears in Drill 1.16.
 * The Version of the Drill 1.17 tested here is the latest at 2019-11-13
  

Summary: JSON blank result on SELECT when too much byte in multiple 
files on embedded  (was: JSON blank result on SELECT when too much byte in 
multiple files )

> JSON blank result on SELECT when too much byte in multiple files on embedded
> 
>
> Key: DRILL-7444
> URL: https://issues.apache.org/jira/browse/DRILL-7444
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JSON
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
> different results on a simple _SELECT_ when using Drill embedded.
> Problem appears from a number of byte (~ 102 400 000 in my case)
> {code:bash}
> #!/bin/bash
> # script gen.sh to reproduce the problem
> for ((i=1;i<=$1;++i));
> do
> echo -n '{"At":"'
> for j in {1..999};
> do
>   echo -n 'ab'
> done
> echo '"}'
> done
> {code}
> {noformat}
> == I ==
> $ gen.sh 1 > a.json
> $ gen.sh 239 > b.json
> $ wc -c *.json
> 1 a.json
>   239 b.json
> 10239 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> | aab... |
> ++
> => All is fine here
> == II ==
> $ gen.sh 1 > a.json
> $ gen.sh 240 > b.json
> $ wc -c *.json
> 1 a.json
>   240 b.json
> 10240 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |

[jira] [Updated] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files

2019-11-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7444:

Description: 
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json
$ bash drill-embedded
apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II

{noformat}
 * This problem doesn't appears in Drill 1.16.
 * The Version of the Drill 1.17 tested here is the latest at 2019-11-13
  

  was:
2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total

apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total

apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json

apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II

{noformat}

* This problem doesn't appears in Drill 1.16.
* The Version of the Drill 1.17 tested here is the latest at 2019-11-13
 


> JSON blank result on SELECT when too much byte in multiple files 
> -
>
> Key: DRILL-7444
> URL: https://issues.apache.org/jira/browse/DRILL-7444
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JSON
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
> different results on a simple _SELECT_.
> Problem appears from a number of byte (~ 102 400 000 in my case)
> {code:bash}
> #!/bin/bash
> # script gen.sh to reproduce the problem
> for ((i=1;i<=$1;++i));
> do
> echo -n '{"At":"'
> for j in {1..999};
> do
>   echo -n 'ab'
> done
> echo '"}'
> done
> {code}
> {noformat}
> == I ==
> $ gen.sh 1 > a.json
> $ gen.sh 239 > b.json
> $ wc -c *.json
> 1 a.json
>   239 b.json
> 10239 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> | aab... |
> ++
> => All is fine here
> == II ==
> $ gen.sh 1 > a.json
> $ gen.sh 240 > b.json
> $ wc -c *.json
> 1 a.json
>   240 b.json
> 10240 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> ||
> ++
> => In a surprising way field `At` is empty
> == III ==
> $ gen.sh 10240 > ab.json
> $ wc -c *.json 
> 10240 ab.json
> $ bash drill-embedded
> apache drill> SELECT * FROM

[jira] [Created] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files

2019-11-14 Thread benj (Jira)

benj created DRILL-7444:
---

 Summary: JSON blank result on SELECT when too much byte in 
multiple files 
 Key: DRILL-7444
 URL: https://issues.apache.org/jira/browse/DRILL-7444
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - JSON
Affects Versions: 1.17.0
Reporter: benj


2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
different results on a simple _SELECT_.

Problem appears from a number of byte (~ 102 400 000 in my case)
{code:bash}
#!/bin/bash
# script gen.sh to reproduce the problem
for ((i=1;i<=$1;++i));
do
echo -n '{"At":"'
for j in {1..999};
do
echo -n 'ab'
done
echo '"}'
done
{code}
{noformat}
== I ==
$ gen.sh 1 > a.json
$ gen.sh 239 > b.json
$ wc -c *.json
1 a.json
  239 b.json
10239 total

apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
| aab... |
++
=> All is fine here


== II ==
$ gen.sh 1 > a.json
$ gen.sh 240 > b.json
$ wc -c *.json
1 a.json
  240 b.json
10240 total

apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
++
|   At   |
++
||
++
=> In a surprising way field `At` is empty

== III ==
$ gen.sh 10240 > ab.json
$ wc -c *.json 
10240 ab.json

apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
++ 
|At  |
++
| aab... |
++
=> All is fine here although the number of lines is equal to case II

{noformat}

* This problem doesn't appears in Drill 1.16.
* The Version of the Drill 1.17 tested here is the latest at 2019-11-13
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7426) Json support lists of different types

2019-10-29 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961802#comment-16961802
 ] 

benj commented on DRILL-7426:
-

In my particular case schema is difficult to predict, but as [~cgivre] says, an 
option to get/force value as string will be great.

But what is particularly surprising in this case is that
{noformat}
apache drill> ALTER SESSION SET `store.json.all_text_mode` = true;
apache drill> ALTER SESSION SET `exec.enable_union_type` = true;

/* I) doesn't work with simple array */
{"name": "toto",
"info": ["LOAD", 5, [] ],
"response": 1 }
apache drill> SELECT * FROM dfs.test.`file.json` LIMIT 1;
Error: SYSTEM ERROR: SchemaChangeRuntimeException: Inner vector type mismatch. 
Requested type: [minor_type: VARCHAR
mode: OPTIONAL
], actual type: [minor_type: UNION
mode: OPTIONAL
sub_type: VARCHAR
sub_type: LIST
]

/* II) but work with array of array */
{"name": "toto",
"info": [ ["LOAD", 5, [] ] ],
"response": 1 }
apache drill> SELECT * FROM dfs.test.`file.json` LIMIT 1;
+--+---+--+
| name |   info| response |
+--+---+--+
| toto | [["LOAD","5",[]]] | 1|
+--+---+--+
1 row selected (0.133 seconds)

/* III) and it also work WHEN acceding to first field of array of array 
(info[0]) that seems the same as the array of the case (I)*/
apache drill> SELECT *, info[0], info[0][0], info[0][1], info[0][2] FROM 
dfs.test.`file.json` LIMIT 1;
+--+--+--+-++++
| name | info | response | EXPR$1  | EXPR$2 | EXPR$3 | 
EXPR$4 |
+--+--+--+-++++
| toto | [["LOAD","5",[]],[]] | 1| ["LOAD","5",[]] | LOAD   | 5  | 
[] |
+--+--+--+-++++
1 row selected (0.185 seconds)
{noformat}

> Json support lists of different types
> -
>
> Key: DRILL-7426
> URL: https://issues.apache.org/jira/browse/DRILL-7426
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Trivial
>
> With a file.json like
> {code:json}
> {
> "name": "toto",
> "info": [["LOAD", []]],
> "response": 1
> }
> {code}
> A simple SELECT gives an error
> {code:sql}
> apache drill> SELECT * FROM dfs.test.`file.json`;
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.
> {code}
> But there is an option _exec.enable_union_type_ that allows these request
> {code:sql}
> apache drill> ALTER SESSION SET `exec.enable_union_type` = true;
> apache drill> SELECT * FROM dfs.test.`file.json`;
> +--+---+--+
> | name | info  | response |
> +--+---+--+
> | toto | [["LOAD",[]]] | 1|
> +--+---+--+
> 1 row selected (0.283 seconds)
> {code}
> The usage of this option is not evident. So, it will be useful to mention 
> after the error message the possibility to set it.
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.  SET 
> the option 'exec.enable_union_type' to true and try again;
> {noformat}
> This behaviour is used for other error, example:
> {noformat}
> ...
> Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due 
> to either a cartesian join or an inequality join. 
> If a cartesian or inequality join is used intentionally, set the option 
> 'planner.enable_nljoin_for_scalar_only' to false and try again.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7426) Json support lists of different types

2019-10-28 Thread benj (Jira)

benj created DRILL-7426:
---

 Summary: Json support lists of different types
 Key: DRILL-7426
 URL: https://issues.apache.org/jira/browse/DRILL-7426
 Project: Apache Drill
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.16.0
Reporter: benj


With a file.json like
{code:json}
{
"name": "toto",
"info": [["LOAD", []]],
"response": 1
}
{code}
A simple SELECT gives an error
{code:sql}
apache drill> SELECT * FROM dfs.test.`file.json`;
Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
value of type LIST. Drill does not support lists of different types.
{code}
But there is an option _exec.enable_union_type_ that allows these request
{code:sql}
apache drill> ALTER SESSION SET `exec.enable_union_type` = true;
apache drill> SELECT * FROM dfs.test.`file.json`;
+--+---+--+
| name | info  | response |
+--+---+--+
| toto | [["LOAD",[]]] | 1|
+--+---+--+
1 row selected (0.283 seconds)
{code}
The usage of this option is not evident. So, it will be useful to mention after 
the error message the possibility to set it.
{noformat}
Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
value of type LIST. Drill does not support lists of different types.  SET 
the option 'exec.enable_union_type' to true and try again;
{noformat}
This behaviour is used for other error, example:
{noformat}
...
Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due 
to either a cartesian join or an inequality join. 
If a cartesian or inequality join is used intentionally, set the option 
'planner.enable_nljoin_for_scalar_only' to false and try again.
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7420) window function improve ROWS clause/frame possibilities

2019-10-23 Thread benj (Jira)

benj created DRILL-7420:
---

 Summary: window function improve ROWS clause/frame possibilities
 Key: DRILL-7420
 URL: https://issues.apache.org/jira/browse/DRILL-7420
 Project: Apache Drill
  Issue Type: New Feature
Affects Versions: 1.16.0
Reporter: benj


The possibility of window frame are currently limited in Apache Drill.
  
 ROWS clauses is only possible with "BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW".
 It will be useful to have possibilities to use:
 * "BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING"
 * "BETWEEN x PRECEDING AND y FOLLOWING"

{code:sql}
/* ROWS clause is only possible with "BETWEEN UNBOUNDED PRECEDING AND CURRENT 
ROW" */
apache drill> SELECT *, sum(a) OVER(ORDER BY b ROWS BETWEEN UNBOUNDED PRECEDING 
AND CURRENT ROW)  FROM (SELECT 1 a, 1 b, 1 c);
+---+---+---++
| a | b | c | EXPR$3 |
+---+---+---++
| 1 | 1 | 1 | 1  |
+---+---+---++
1 row selected (1.357 seconds)

/* ROWS is currently not possible with "BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING" (it's possible with RANGE but with single ORDER BY only ) 
*/
apache drill> SELECT *, sum(a) OVER(ORDER BY b, c ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING)  FROM (SELECT 1 a, 1 b, 1 c);
Error: UNSUPPORTED_OPERATION ERROR: This type of window frame is currently not 
supported 
See Apache Drill JIRA: DRILL-3188

/* ROWS is currently not possible with "BETWEEN x PRECEDING AND y FOLLOWING" */
apache drill> SELECT *, sum(a) OVER(ORDER BY b ROWS BETWEEN 1 PRECEDING AND 1 
FOLLOWING)  FROM (SELECT 1 a, 1 b, 1 c);
Error: UNSUPPORTED_OPERATION ERROR: This type of window frame is currently not 
supported 
See Apache Drill JIRA: DRILL-3188
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7017) lz4 codec for (un)compression

2019-10-22 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957093#comment-16957093
 ] 

benj commented on DRILL-7017:
-

Not sure to understand because lz4 is already (by default) in 
jars/3rdparty/lz4-1.3.0.jar in Apache Drill and it doesn't work.
Even with adding "org.apache.hadoop.io.compress.Lz4Codec" in 
io.compression.codecs in core-site.xml and 
Djava.library.path=/usr/hdp/.../lib/native/
{code:sql}
SELECT * FROM dfs.test.`a.csvh.lz4`;
Error: EXECUTION_ERROR ERROR: native lz4 library not available
{code}


> lz4 codec for (un)compression
> -
>
> Key: DRILL-7017
> URL: https://issues.apache.org/jira/browse/DRILL-7017
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Storage - Text  CSV
>Affects Versions: 1.15.0
>Reporter: benj
>Priority: Major
>
> I didn't find in the documentation what compression formats are supported. 
> But as it's possible to use drill on compressed file, like
> {code:java}
> SELECT * FROM tmp.`myfile.csv.gz`;
> {code}
> It will be useful to have the possibility to use this functionality for lz4 
> file ([https://github.com/lz4/lz4])
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7404) window function RANGE with compound ORDER BY

2019-10-14 Thread benj (Jira)

benj created DRILL-7404:
---

 Summary: window function RANGE with compound ORDER BY
 Key: DRILL-7404
 URL: https://issues.apache.org/jira/browse/DRILL-7404
 Project: Apache Drill
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.16.0
Reporter: benj


When creating a ticket CALCITE-3402 (to ask for improve the window functions), 
it's appears that the documentation of drill seems not up to date

[https://drill.apache.org/docs/aggregate-window-functions/]
{code:java}
frame_clause
If an ORDER BY clause is used for an aggregate function, an explicit frame 
clause is required. The frame clause refines the set of rows in a function's 
window, including or excluding sets of rows within the ordered result. The 
frame clause consists of the ROWS or RANGE keyword and associated specifiers.
{code}
 But it's currently (1.16) possible to write ORDER BY clause in window function 
+without+ specify an explicit RANGE clause.

In this case, an +implicit+ frame clause is used.

And normally the default/implicit framing option is {{RANGE UNBOUNDED 
PRECEDING}}, which is the same as {{RANGE BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW (and should perhaps also be more explicitly specified) }}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7291:

Component/s: Documentation

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation, Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, 
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7246) DESCRIBE on Parquet File

2019-10-14 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7246:

Description: 
It will be nice if it's possible to use DESCRIBE on Parquet File.
 Example :
{code:sql}
DESCRIBE dfs.tmp.`test_parquet`;
+---+---+-+
|  COLUMN_NAME  | DATA_TYPE | IS_NULLABLE |
+---+---+-+
| MyColumn  | INT   | YES | 
| AnotherColumn | DATE  | NO  | 
| AdditionnalColumn | VARCHAR   | YES | 
+---+---+-+
{code}
And more why not propose this possibility for any file (in the case of the CSV, 
DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit useless, 
this would at least list the available columns.

 

  was:
It will be nice if it's possible to use DESCRIBE on Parquet File.
 Example :
{code:sql}
DESCRIBE dfs.tmp.`test_parquet`;
+---+---+-+
|  COLUMN_NAME  | DATA_TYPE | IS_NULLABLE |
+---+---+-+
| MyColumn  | INT   | YES | 
| AnotherColumn | DATE  | NO  | 
| AdditionnalColumn | VARCHAR   | YES | 
+---+---+-+
{code}
And more why not propose this possibility for any file (in the case of the CSV, 
DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit useless, 
this would at least list the available columns.

 


> DESCRIBE on Parquet File
> 
>
> Key: DRILL-7246
> URL: https://issues.apache.org/jira/browse/DRILL-7246
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> It will be nice if it's possible to use DESCRIBE on Parquet File.
>  Example :
> {code:sql}
> DESCRIBE dfs.tmp.`test_parquet`;
> +---+---+-+
> |  COLUMN_NAME  | DATA_TYPE | IS_NULLABLE |
> +---+---+-+
> | MyColumn  | INT   | YES | 
> | AnotherColumn | DATE  | NO  | 
> | AdditionnalColumn | VARCHAR   | YES | 
> +---+---+-+
> {code}
> And more why not propose this possibility for any file (in the case of the 
> CSV, DATA_TYPE will be VARCHAR and IS_NULLABLE YES) - if it's a little bit 
> useless, this would at least list the available columns.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-14 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950784#comment-16950784
 ] 

benj commented on DRILL-7291:
-

[~arina] you are right, with the option +store.parquet.use_new_reader+ the 
problem disappears in 1.17 (last commit) and even in 1.16 and 1.15.
I had not considered this option since it is marked "Not supported in this 
release." in the Comment field.
So it can be that the problem is solved by means of a change of the comment of 
the option _store.parquet.use_new_reader_
And maybe this option could be true by default in future version (but can it be 
negative impacts ?)
Appreciated your help to solve this problem.

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, 
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-11 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949569#comment-16949569
 ] 

benj commented on DRILL-7291:
-

Please find in attchment the error log [^sqlline_error.log]

I was intrigued by the "sc:  REQUIRED BINARY O:UTF8 R:0 D:0" in 
relation with the error "Error: INTERNAL_ERROR ERROR: null"
So I have tried the request below that solve the problem here.
{code:sql}
CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc
, COALESCE(sc,'')  /* USE COALESCE to avoid NULL VALUE although csv empty value 
are protected by quote */
FROM dfs.tmp.`short_no_binary_quote.csvh`;

SELECT * FROM dfs.tmp.`t2`;
+--+--+--++---+---+-++
|   sha1   |   md5| 
 crc32   |   fn   |  fs   |  pc   | osc | EXPR$7 |
+--+--+--++---+---+-++
| 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 ||
| 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png  | 366   | 21386 | 362 ||
| 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml| 1684  | 21842 | 362 ||
+--+--+--++---+---+-++
3 rows selected (0.111 seconds)
{code}

But there is a something wrong because:
{code:sql}
CREATE TABLE dfs.tmp.`t2` AS SELECT sha1, md5, crc32, fn, fs, pc, osc
, COALESCE(sc,'FLAGFLAGFLAG')  /* USE COALESCE to avoid NULL VALUE although csv 
empty value are protected by quote - NOTE that FLAGFLAGFLAG will not appear in 
EXPR$7*/
FROM dfs.tmp.`short_no_binary_quote.csvh`;

SELECT * FROM dfs.tmp.`t2` /* FLAGFLAGFLAG will not appear in EXPR$7 */ ;
+--+--+--++---+---+-++
|   sha1   |   md5| 
 crc32   |   fn   |  fs   |  pc   | osc | EXPR$7 |
+--+--+--++---+---+-++
| 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 ||
| 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png  | 366   | 21386 | 362 ||
| 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml| 1684  | 21842 | 362 ||
+--+--+--++---+---+-++
3 rows selected (0.156 seconds)
{code}



> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, 
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM

[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-11 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7291:

Attachment: sqlline_error.log

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, 
> sqlline_error.log
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-11 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949501#comment-16949501
 ] 

benj commented on DRILL-7291:
-

Exactly done the same as you (see below), but obtain an "+Error: INTERNAL_ERROR 
ERROR: null+" at the end
{code:sql}
./drill-embedded
Apache Drill 1.17.0-SNAPSHOT
"A Drill in the hand is better than two in the bush."
apache drill> ALTER SESSION SET `store.parquet.compression` = 'gzip';
+--++
|  ok  |  summary   |
+--++
| true | store.parquet.compression updated. |
+--++
1 row selected (0.339 seconds)
apache drill> select * from dfs.tmp.`short_no_binary_quote.csvh`;
+--+--+--++---+---+-++
|   sha1   |   md5| 
 crc32   |   fn   |  fs   |  pc   | osc | sc |
+--+--+--++---+---+-++
| 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 ||
| 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png  | 366   | 21386 | 362 ||
| 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml| 1684  | 21842 | 362 ||
+--+--+--++---+---+-++
3 rows selected (1.008 seconds)
apache drill> use dfs.tmp;
+--+-+
|  ok  |   summary   |
+--+-+
| true | Default schema changed to [dfs.tmp] |
+--+-+
1 row selected (0.087 seconds)
apache drill (dfs.tmp)> create table t as select * from 
dfs.tmp.`short_no_binary_quote.csvh`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0  | 3 |
+--+---+
1 row selected (0.306 seconds)
apache drill (dfs.tmp)> select * from t;
Error: INTERNAL_ERROR ERROR: null

Fragment 0:0
{code}

But, the parquet seems OK

{code:bash}
hadoop jar parquet-tools-1.10.0.jar cat /tmp/t/0_0_0.parquet
2019-10-11 15:48:14,047 INFO hadoop.InternalParquetRecordReader: RecordReader 
initialized will read a total of 3 records.
2019-10-11 15:48:14,048 INFO hadoop.InternalParquetRecordReader: at row 0. 
reading next block
2019-10-11 15:48:14,063 INFO zlib.ZlibFactory: Successfully loaded & 
initialized native-zlib library
2019-10-11 15:48:14,063 INFO compress.CodecPool: Got brand-new decompressor 
[.gz]
2019-10-11 15:48:14,068 INFO hadoop.InternalParquetRecordReader: block read in 
memory in 19 ms. row count = 3
sha1 = 000F8527DCCAB6642252BBCFA1B8072D33EE
md5 = 68CE322D8A896B6E4E7E3F18339EC85C
crc32 = E39149E4
fn = Blended_Coolers_Vanilla_NL.png
fs = 30439
pc = 19042
osc = 362
sc = 

sha1 = 0091728653B7D55DF30BFAFE86C52F2F4A59
md5 = 81AE5D302A0E6D33182CB69ED791181C
crc32 = 5594C3B0
fn = ic_menu_notifications.png
fs = 366
pc = 21386
osc = 362
sc = 

sha1 = 065F1900120613745CC5E25A57C84624DC2B
md5 = AEB7C147EF7B7CEE91807B500A378BA4
crc32 = 24400952
fn = points_program_fragment.xml
fs = 1684
pc = 21842
osc = 362
sc =

hadoop jar parquet-tools-1.10.0.jar meta /tmp/t/0_0_0.parquet
2019-10-11 15:51:25,255 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: reading another 1 footers
2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:file:/tmp/t/0_0_0.parquet 
creator: parquet-mr version 1.10.0 (build ${buildNumber}) 
extra:   drill-writer.version = 3 
extra:   drill.version = 1.17.0-SNAPSHOT 

file schema: root 

sha1:REQUIRED BINARY O:UTF8 R:0 D:0
md5: REQUIRED BINARY O:UTF8 R:0 D:0
crc32:   REQUIRED BINARY O:UTF8 R:0 D:0
fn:  REQUIRED BINARY O:UTF8 R:0 D:0
fs:  REQUIRED BINARY O:UTF8 R:0 D:0
pc:  REQUIRED BINARY O:UTF8 R:0 D:0
osc: REQUIRED BINARY O:UTF8 R:0 D:0
sc:  REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:3 TS:914 OFFSET:4 

sha1: BINARY GZIP DO:0 FPO:4 SZ:210/239/1,14 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min:

[jira] [Comment Edited] (DRILL-7348) Aggregate on Subquery with Select Distinct or UNION fails to Group By

2019-10-09 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947512#comment-16947512
 ] 

benj edited comment on DRILL-7348 at 10/9/19 10:01 AM:
---

I have tested in 1.15 and 1.16.
I think thata the problem of [~snapdoodle] is that date is a reserved keyword 

{code:sql}
SELECT date, COUNT(1)
FROM (
SELECT DISTINCT id, date, status
FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1
Error: PARSE ERROR: Encountered "date ,"
{code}

So you need quote around date:

{code:sql}
SELECT `date`, COUNT(1)
FROM (
  SELECT DISTINCT id, `date`, status
  FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1;
+---+-+
| date  | EXPR$1  |
+---+-+
| b | 1   |
| c | 2   |
+---+-+
{code}

The result is correct here.
Maybe there is also a real problem, but without the content of the file 
dfs.`path` it will be difficult to conclude.



was (Author: benj641):
I have tested in 1.15 and 1.16.
I think thata the problem of [~snapdoodle] is that date is a reserved keyword 

{code:code}
SELECT date, COUNT(1)
FROM (
SELECT DISTINCT id, date, status
FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1
Error: PARSE ERROR: Encountered "date ,"
{code}

So you need quote around date:

{code:sql}
SELECT `date`, COUNT(1)
FROM (
  SELECT DISTINCT id, `date`, status
  FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1;
+---+-+
| date  | EXPR$1  |
+---+-+
| b | 1   |
| c | 2   |
+---+-+
{code}

The result is correct here.
Maybe there is also a real problem, but without the content of the file 
dfs.`path` it will be difficult to conclude.


> Aggregate on Subquery with Select Distinct or UNION fails to Group By
> -
>
> Key: DRILL-7348
> URL: https://issues.apache.org/jira/browse/DRILL-7348
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.15.0
>Reporter: Keith G Yu
>Priority: Major
>
> The following query fails to group properly.
> {code:java}
> SELECT date, COUNT(1)
> FROM (
> SELECT DISTINCT
> id,
> date,
> status
> FROM table(dfs.`path`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> )
> GROUP BY 1{code}
> This also fails to group properly.
> {code:java}
> SELECT date, COUNT(1)
> FROM (
> SELECT
> id,
> date,
> status
> FROM table(dfs.`path1`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> UNION
> SELECT
> id,
> date,
> status
> FROM table(dfs.`path2`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> )
> GROUP BY 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7348) Aggregate on Subquery with Select Distinct or UNION fails to Group By

2019-10-09 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947512#comment-16947512
 ] 

benj commented on DRILL-7348:
-

I have tested in 1.15 and 1.16.
I think thata the problem of [~snapdoodle] is that date is a reserved keyword 

{code:code}
SELECT date, COUNT(1)
FROM (
SELECT DISTINCT id, date, status
FROM (select 1 id, 'b' date, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1
Error: PARSE ERROR: Encountered "date ,"
{code}

So you need quote around date:

{code:sql}
SELECT `date`, COUNT(1)
FROM (
  SELECT DISTINCT id, `date`, status
  FROM (select 1 id, 'b' `date`, 3 status UNION SELECT 2, 'c', 4 UNION SELECT 
1, 'c', 4) AS r1
) AS r2 
GROUP BY 1;
+---+-+
| date  | EXPR$1  |
+---+-+
| b | 1   |
| c | 2   |
+---+-+
{code}

The result is correct here.
Maybe there is also a real problem, but without the content of the file 
dfs.`path` it will be difficult to conclude.


> Aggregate on Subquery with Select Distinct or UNION fails to Group By
> -
>
> Key: DRILL-7348
> URL: https://issues.apache.org/jira/browse/DRILL-7348
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.15.0
>Reporter: Keith G Yu
>Priority: Major
>
> The following query fails to group properly.
> {code:java}
> SELECT date, COUNT(1)
> FROM (
> SELECT DISTINCT
> id,
> date,
> status
> FROM table(dfs.`path`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> )
> GROUP BY 1{code}
> This also fails to group properly.
> {code:java}
> SELECT date, COUNT(1)
> FROM (
> SELECT
> id,
> date,
> status
> FROM table(dfs.`path1`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> UNION
> SELECT
> id,
> date,
> status
> FROM table(dfs.`path2`(type => 'text', fieldDelimiter => ',', 
> extractHeader => TRUE))
> )
> GROUP BY 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-07 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945871#comment-16945871
 ] 

benj commented on DRILL-7291:
-

In fact, the problem appear in +embeded mode+ (bin/drill-embedded) and not in 
sqlline mode.
So, it may be a little less serious and that explains the differences in 
behavior.

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-07 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945825#comment-16945825
 ] 

benj commented on DRILL-7291:
-

Sure, sorry for the delay but I preferred to double check some points with the 
original file:
 - There is a column 'FileName' in the csvh => Renaming columns doesn't fix the 
problem.
 - There is microsoft end of line in the original csvh => Changing to unix end 
of line doesn't fix the problem.
 - There is some field without "quote surround" => Forcing quote everywhere 
doesn't fix the problem.
 - There is some binary data in the csvh. But the same problem appears with a 
small extract with no binary. So I prefer to push a small file rather than the 
big one.
 - And I have again double check on the last git 1.17 without fix the problem.

So in attachment there is a minimalist csvh file (1 row of header and 3 rows of 
data). I put also just below 
 [^short_no_binary_quote.csvh]
{code:java}
"sha1","md5","crc32","fn","fs","pc","osc","sc"
"000F8527DCCAB6642252BBCFA1B8072D33EE","68CE322D8A896B6E4E7E3F18339EC85C","E39149E4","Blended_Coolers_Vanilla_NL.png","30439","19042","362",""
"0091728653B7D55DF30BFAFE86C52F2F4A59","81AE5D302A0E6D33182CB69ED791181C","5594C3B0","ic_menu_notifications.png","366","21386","362",""
"065F1900120613745CC5E25A57C84624DC2B","AEB7C147EF7B7CEE91807B500A378BA4","24400952","points_program_fragment.xml","1684","21842","362",""
{code}


{code:sql}
/*
"csvh": {
  "type": "text",
  "extensions": [
"csvh"
  ],
  "extractHeader": true,
  "delimiter": ","
},
*/
ALTER SESSION set exec.storage.enable_v3_text_reader=true;
ALTER SESSION SET `store.parquet.compression` = 'gzip';
SELECT * FROM `DRILL_7291/short_no_binary_quote.csvh`;
+--+--+--++---+---+-++
|   sha1   |   md5| 
 crc32   |   fn   |  fs   |  pc   | osc | sc |
+--+--+--++---+---+-++
| 000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 ||
| 0091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png  | 366   | 21386 | 362 ||
| 065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml| 1684  | 21842 | 362 ||
+--+--+--++---+---+-++

CREATE TABLE `DRILL_7291/problem_pqt` AS( SELECT * FROM 
`DRILL_7291/short_no_binary_quote.csvh`);
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0  | 3 |
+--+---+

SELECT * FROM `DRILL_7291/problem_pqt`;
Error: DATA_READ ERROR: Error reading page data
{code}

And All work's fine with 'snappy' or 'none'

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM

[jira] [Updated] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-07 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7291:

Attachment: short_no_binary_quote.csvh

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet, short_no_binary_quote.csvh
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-07 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585
 ] 

benj edited comment on DRILL-7291 at 10/7/19 9:55 AM:
--

Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that 
{code:sql}
SELECT count( * ) FROM `file_gzip_pqt` WHERE SpecialCode =' '; /* OK */
SELECT count( * ) FROM `file_gzip_pqt` WHERE SpecialCode <> ''; /* NOK */
{code} 
But, maybe it's because all the values of 'SpecialCode' column are empty ("")


was (Author: benj641):
Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
>

[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-07 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585
 ] 

benj edited comment on DRILL-7291 at 10/7/19 9:44 AM:
--

Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)


was (Author: benj641):
Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)

if forcing each column with {code:sql}CAST(COLUMN AS VARCHAR){code} when 
{code:sql}CREATE AS{code} it seems there is no problem after. 
But I think more investigation needed. I will try to push more elements.

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

2019-10-04 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944585#comment-16944585
 ] 

benj commented on DRILL-7291:
-

Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE `file_gzip_pqt` AS (SELECT * FROM `file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM `file_gzip_pqt`; /* OK */
SELECT count(*) FROM `file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM `file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)

if forcing each column with {code:sql}CAST(COLUMN AS VARCHAR){code} when 
{code:sql}CREATE AS{code} it seems there is no problem after. 
But I think more investigation needed. I will try to push more elements.

> parquet with compression gzip doesn't work well
> ---
>
> Key: DRILL-7291
> URL: https://issues.apache.org/jira/browse/DRILL-7291
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE `file_snappy_pqt` 
>  AS(SELECT * FROM `file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE `file_gzip_pqt` 
>  AS(SELECT * FROM `file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM `file_pqt`;=> 15728036
> SELECT COUNT(*) FROM `file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM `file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM `file_pqt` WHERE `Code2` = '';=> 0
> SELECT COUNT(*) FROM `file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM `file_pqt` WHERE `CRC32` = 'B33D600C';  => 2
> SELECT COUNT(*) FROM `file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7396) Exception when trying to access last element of an array with repeated_count

2019-10-04 Thread benj (Jira)

benj created DRILL-7396:
---

 Summary: Exception when trying to access last element of an array 
with repeated_count
 Key: DRILL-7396
 URL: https://issues.apache.org/jira/browse/DRILL-7396
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


Use of array in drill is not friendly
{code:sql}
SELECT (split('a,b,c',','))[0]; /*NOK */
Error: SYSTEM ERROR: ClassCastException: 
org.apache.drill.common.expression.FunctionCall cannot be cast to 
org.apache.drill.common.expression.SchemaPath

/* outer SELECT needed*/
SELECT x[0] FROM (SELECT split('a,b,c',',') x); /* OK */
{code}
And access last element of an array is worse
{code:sql}
SELECT x[repeated_count(x) - 1] AS lasteltidx FROM (SELECT split('a,b,c',',') 
x);
Error: SYSTEM ERROR: ClassCastException: org.apache.calcite.rex.RexCall cannot 
be cast to org.apache.calcite.rex.RexLiteral

/* while */
SELECT x[2] lastelt, (repeated_count(x) - 1) AS lasteltidx FROM (SELECT 
split('a,b,c',',') x);
+-++
| lastelt | lasteltidx |
+-++
| c   | 2  |
+-++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7395) Partial Partition By to CTAS Parquet files

2019-10-03 Thread benj (Jira)

benj created DRILL-7395:
---

 Summary: Partial Partition By to CTAS Parquet files
 Key: DRILL-7395
 URL: https://issues.apache.org/jira/browse/DRILL-7395
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj


In the case of a data set with few value are prevailing while most have weak 
occurrences, it will be useful to have the abilities to create Parquet with a 
partial _PARTITION BY_.

It would then be possible to group all the small occurrences together without 
being "impacted" by the "too" common values.

It's not exactly the same, but it exists partial index on some database 
(https://www.postgresql.org/docs/current/indexes-partial.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7004) improve show files functionnality

2019-10-02 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943152#comment-16943152
 ] 

benj commented on DRILL-7004:
-

Ok, thanks for the precision.

Note that in a suprising way, in the presented case there are not many file in 
each directory (60 files per directory) but there are many directories and the 
system tested have several server with several dozen of processors each. So 
it's strange that the time increase linearly from 1 directory (60 files) to 24 
directories (of 60 files each) or to 31*24 directories (of 60 files each). _(Ok 
there is a factor 2 for the 24 Directories - but it's not enough meaningful)_

> improve show files functionnality
> -
>
> Key: DRILL-7004
> URL: https://issues.apache.org/jira/browse/DRILL-7004
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Storage - Other
>Affects Versions: 1.15.0
>Reporter: benj
>Priority: Major
>
> For instant, it's possible to show files/directories in a particular 
> directory with the command
> {code:java}
> SHOW files FROM tmp.`mypath`;
> {code}
> It would be certainly very useful to improve this functionality with :
>  * possibility to list recursively
>  * possibility to use at least wildcard
> {code:java}
> SHOW files FROM tmp.`mypath/*/test/*/*a*`;
> {code}
>  * possibility to use the result like a table
> {code:java}
> SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7004) improve show files functionnality

2019-10-02 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942863#comment-16942863
 ] 

benj commented on DRILL-7004:
-

Despite the parallelization, for a system containing several hundreds of 
thousands of files, it is really too long and therefore unusable.

Example:
{code:java}
DIR_root
|--DIR_DAY_x
   |--DIR_HOUR_y
  |--File_MINUTE_z

with x from 1 to 31, y from 0 to 23, z from 0 to 59
{code}
{code:sql}
/* mydfs.myminutes : location = "DIR_ROOT/DIR_DAY_1/DIR_HOUR_0" */
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myminutes' 
and is_file = true;
=> time ~ 0.65 seconds - 60 files > (time of unix find : 0.042s)

/* mydfs.myhours : location = "DIR_ROOT/DIR_DAY_1" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.myhours' 
and is_file = true;
=> time ~ 9 seconds - 1440 files (60*24) >>> (time of unix find : 0.095s)

/* mydfs.mydays : location = "DIR_ROOT/" */
alter session SET `storage.list_files_recursively` = true;
SELECT * FROM INFORMATION_SCHEMA.`FILES` WHERE schema_name = 'mydfs.mydays' and 
is_file = true;
=> time ~ 417 seconds - 44640 files (60*24*31) >>  (time of unix find : 
1.5s (with print))
{code}
It's comprehensible that there is overhead compared to unix tools, but the 
average time per file is too much expensive - Here : 0.01s, (ie 2h30 to scan 1 
million files)
 it's a pity that it's really more efficient to make a `find path/ -type f > 
mytmp.csv` and next `SELECT * FROM mytmp.csv` _(with necessary permission)_

> improve show files functionnality
> -
>
> Key: DRILL-7004
> URL: https://issues.apache.org/jira/browse/DRILL-7004
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Storage - Other
>Affects Versions: 1.15.0
>Reporter: benj
>Priority: Major
>
> For instant, it's possible to show files/directories in a particular 
> directory with the command
> {code:java}
> SHOW files FROM tmp.`mypath`;
> {code}
> It would be certainly very useful to improve this functionality with :
>  * possibility to list recursively
>  * possibility to use at least wildcard
> {code:java}
> SHOW files FROM tmp.`mypath/*/test/*/*a*`;
> {code}
>  * possibility to use the result like a table
> {code:java}
> SELECT p.* FROM (SHOW files FROM tmp.`mypath`) AS p WHERE ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7392) Exclude some files when requesting directory

2019-10-01 Thread benj (Jira)

benj created DRILL-7392:
---

 Summary: Exclude some files when requesting directory
 Key: DRILL-7392
 URL: https://issues.apache.org/jira/browse/DRILL-7392
 Project: Apache Drill
  Issue Type: Wish
Reporter: benj
 Fix For: 1.16.0


Currently Drill ignores files starting with dot ('.') or underscore ('_').

When requesting directory with file of different types or different schema and 
present at multiple levels of the tree file, it will be useful/more flexible, 
to have also option(s) to exclude some files by extension or maybe with a 
regexp.

For Example:
{code:java}
myTable
|--D1
   |--file1.csv
   |-file2.csv
|--D2
   | SubD2
  |--file1.csv
   |--file1.csv
   |--file1.xml 
   |--file1.json
{code}
without enter in a debate of what is a good the organisation/disposition for 
the data, currently to request all the csv files of this example, the way is:
{code:sql}
SELECT * FROM `myTable/*/*.csv`
UNION
SELECT * FROM `myTable/*/*/*.csv`
{code}
It will be useful to have the capacity to request directly _myTable_ like:
{code:sql}
/* ALTER SESSION SET exclude_files='xml,json' */
/* or */
/* ALTER SESSION SET only_files='csv' */
SELECT * FROM myTable
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7390) kvgen/flatten doesn't produce same result from .json or .parquet

2019-09-27 Thread benj (Jira)

benj created DRILL-7390:
---

 Summary: kvgen/flatten doesn't produce same result from .json or 
.parquet
 Key: DRILL-7390
 URL: https://issues.apache.org/jira/browse/DRILL-7390
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types, Functions - Drill, Storage - 
JSON, Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj
 Attachments: ANIMALS_json.tar.gz, ANIMALS_pqt.tar.gz

With a Parquet produce from JSON (_ANIMALS_json_ and _ANIMALS_pqt_ in 
attachment in tar.gz format)
{code:sql}
CREATE TABLE `ANIMALS_pqt` AS 
(SELECT * FROM `ANIMALS_json`);
{code}
Same request, using kvgen and flatten, applied on JSON and Parquet doesn't 
produce the same results
{code:sql}
SELECT count(*) FROM 
 (SELECT f FROM 
  (SELECT flatten(k) AS f FROM 
   (SELECT kvgen(animals) AS k FROM `ANIMALS_json`)) AS x)
=>
8482290

SELECT count(*) FROM 
 (SELECT f FROM 
  (SELECT flatten(k) AS f FROM 
   (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt`)) AS x)
=>
929430
{code}
Or another example:
{code:sql}
SELECT count(*) FROM 
 (SELECT f FROM 
  (SELECT flatten(k) AS f FROM 
   (SELECT kvgen(animals) AS k FROM `ANIMALS_json`)) AS x WHERE 
x.f.key='Cat')
=>
121368

SELECT count(*) FROM 
 (SELECT f FROM 
  (SELECT flatten(k) AS f FROM 
   (SELECT kvgen(animals) AS k FROM `ANIMALS_pqt`)) AS x WHERE 
x.f.key='Cat')
=>
13470
{code}
The real result is the json one, as proved by:
{code:bash}
cat ANIMALS_json/*.json | grep -c "Cat"
121368
{code}
Please note that, here, It's appear the particular file 
_ANIMALS_pqt/1_0_0.parquet_ is not well computed but the other are correct:
{code:sql}
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT 
kvgen(animals) AS k FROM `ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE 
x.f.key='Cat');
=> 107898
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT 
kvgen(animals) AS k FROM `ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE 
x.f.key='Cat');
=> 2429
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT 
kvgen(animals) AS k FROM `ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE 
x.f.key='Cat');
=> 5419
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT 
kvgen(animals) AS k FROM `ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE 
x.f.key='Cat');
=> 5622
{code}
   2429+5419+5622=13470  (result of request on ANIMALS_pqt)
 107898+2429+5419+5622=121368 (result of request on ANIMALS_json)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7389) JSON empty list avoid Parquet creation

2019-09-27 Thread benj (Jira)

benj created DRILL-7389:
---

 Summary: JSON empty list avoid Parquet creation
 Key: DRILL-7389
 URL: https://issues.apache.org/jira/browse/DRILL-7389
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - JSON, Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj


With a JSON file with only one row with an empty list as below, it's possible 
to request the file but there is an error when trying to create a Parquet
 File ANIMALS_1.json:
{code:json}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{code}
{code:sql}
SELECT * FROM `ANIMALS_1.json`;
+---+
|animals
|
+---+
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
 |
+---+

CREATE TABLE `ANIMALS_1_pqt` AS 
(SELECT * FROM `ANIMALS_1.json`);
=>
Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an 
empty group: optional group Horse {}
{code}
 
 But if the json file contains a second line with a non-empty list for "Horse", 
it's possible to request file and create the Parquet
 File ANIMALS_2.json:
{code:json}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
{code}
{code:sql}
SELECT * FROM `ANIMALS_2.json`;
+---+
| animals  
+---+
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
 |
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}}
 |
+---+

CREATE TABLE `ANIMALS_2_pqt` AS 
(SELECT * FROM `ANIMALS_2.json`);
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0  | 2 |
+--+---+
{code}
 
Many problems appears with this when manipulating multiple JSON with "rare" 
value (and when do not master the generation).

It's very annoying to have no possibility push data in parquet where there is 
missing/null value in JSON. 
The possibility to cast in varchar (DRILL-7375) the data could allow the 
parquet storage 
 
In the simple case of the example discussed here, it's possible to change the 
type of the input file from JSON to CSV and it will work. But it does not 
answer all the problems and it doesn't allow to keep some part in "json"  and 
some other in "text"

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6958) CTAS csv with option

2019-09-23 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935943#comment-16935943
 ] 

benj commented on DRILL-6958:
-

In the next example, with a table with a column that contain a piece of json 
like
{code:sql}
SELECT * FROM `example.parquet` LIMIT 2;
+-++---+
|   hash  |   date |
info
   |
+-++---+
| B29C56F | 2019-09-23 | {"Number": 322, "scans": {"nameofprocess": 
{"detection": false, "version": "1.2"}}, {"othername": {"detection": true, 
"version": "0.1"}}} |
| C28956E | 2019-09-22 | {"Number": 312, "scans": {"thirdname": {"detection": 
false, "version": "1.0"}}}  
 |
+---+--+---+
SELECT typeof(hash) AS hash, typeof(`date`) AS `date`, typeof(info) AS info 
FROM `example.parquet` LIMIT 1;
+-++--+
|  hash   |  date  | info |
+-++--+
| VARCHAR | DATE   | MAP  |
+-++--+
{code}
It's not possible to push in a right way into a CSV file because of the 
presence of separator and quote inside the json.
 And there is no possibility to manually avoid this problem with a change of 
separator or introduce quote because the type MAP is not convertible in VARCHAR 
(DRILL-7375), so it's not possible to manually concatenate data

 

> CTAS csv with option
> 
>
> Key: DRILL-6958
> URL: https://issues.apache.org/jira/browse/DRILL-6958
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text  CSV
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
>
> Currently, it may be difficult to produce well-formed CSV with CTAS (see 
> comment below).
> It appears necessary to have some additional/configuratble options to write 
> CSV file with CTAS :
>  * possibility to change/define the separator,
>  * possibility to write or not the header,
>  * possibility to force the write of only 1 file instead of lot of parts,
>  * possibility to force quoting
>  * possibility to use/change escape char
>  * ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7379) Planning error

2019-09-19 Thread benj (Jira)

benj created DRILL-7379:
---

 Summary: Planning error
 Key: DRILL-7379
 URL: https://issues.apache.org/jira/browse/DRILL-7379
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


With data as:
{code:sql}
SELECT id, tags FROM `example_parquet`;
+++
|   id   |tags|
+++
| 7b8808 | ["peexe","signed","overlay"]   |
| 55a4ae | ["peexe","signed","upx","overlay"] |
+++
{code}
The next request is OK
{code:sql}
SELECT id, flatten(tags) tag 
FROM ( 
  SELECT id, any_value(tags) tags 
  FROM `example_parquet` 
  GROUP BY id 
) LIMIT 2;
+++
|   id   |  tag   |
+++
| 55a4ae | peexe  |
| 55a4ae | signed |
+++
{code}
But unexpectedly, the next query failed:
{code:sql}
SELECT tag, count(*) 
FROM (
  SELECT flatten(tags) tag 
  FROM (
SELECT id, any_value(tags) tags 
FROM `example_parquet`
GROUP BY id 
  )
) GROUP BY tag;
Error: SYSTEM ERROR: UnsupportedOperationException: Map, Array, Union or 
repeated scalar type should not be used in group by, order by or in a 
comparison operator. Drill does not support compare between MAP:REPEATED and 
MAP:REPEATED.

/* Or other error with another set of data :
Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize 
incoming schema.  Errors:
 
Error in expression at index 0.  Error: Missing function implementation: 
[hash32asdouble(MAP-REPEATED, INT-REQUIRED)].  Full expression: null..
*/
{code}
These errors are incomprehensible because, the aggregate is on VARCHAR.

More, the request works if decomposed in 2 request with with the creation of an 
intermediate table like below:
{code:sql}
CREATE TABLE `tmp.parquet` AS (
  SELECT id, flatten(tags) tag 
  FROM ( 
SELECT id, any_value(tags) tags 
FROM `example_parquet` 
GROUP BY id 
));

SELECT tag, count(*) c FROM `tmp_parquet` GROUP BY tag;
+-+---+
|   tag   | c |
+-+---+
| overlay | 2 |
| peexe   | 2 |
| signed  | 2 |
| upx | 1 |
+-+---+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7378) Allowing less outer/inner select

2019-09-19 Thread benj (Jira)

benj created DRILL-7378:
---

 Summary: Allowing less outer/inner select
 Key: DRILL-7378
 URL: https://issues.apache.org/jira/browse/DRILL-7378
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


Currently, it's not possible to exploit the result of some function like 
_kvgen_ or _flatten_ and an inner/outer select is needed for some operations.
It will be easiest to allow the use of the results of theses functions directly.
Example:
{code:sql}
CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') 
j;
+--+
|j |
+--+
| {"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}} |
+--+
{code}
But it's not possible to simply do
{code:sql}
SELECT 
kvgen(CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON'));
Error: PLAN ERROR: Failure while materializing expression in constant 
expression evaluator 
[CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}', 
'JSON')].  Errors: 
Error in expression at index -1.  Error: Only ProjectRecordBatch could have 
complex writer function. You are using complex writer function convert_fromJSON 
in a non-project operation!.  Full expression: --UNKNOWN EXPRESSION--.
{code}
It's only possible to do
{code:sql}
SELECT kvgen(c) AS k FROM (SELECT 
CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') 
c);
+--+
|k  
   |
+--+
| 
[{"key":"Tuesday","value":{"close":"22:00"}},{"key":"Friday","value":{"close":"23:00"}}]
 |
+--+
{code}

Its possible to cascade with flatten:
{code:sql}
SELECT flatten(kvgen(c)) f FROM (SELECT 
CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') 
c);
+-+
|  f  |
+-+
| {"key":"Tuesday","value":{"close":"22:00"}} |
| {"key":"Friday","value":{"close":"23:00"}}  |
+-+
{code}
But it's not possible to use directly use the result of flatten to select key 
or value
{code:sql}
SELECT (flatten(kvgen(r.c))).key f FROM (SELECT 
CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') 
c) r;
Error: VALIDATION ERROR: From line 1, column 9 to line 1, column 27: 
Incompatible types
{code}
You have to inner/outer select like:
{code:sql}
SELECT r.f.key k FROM (SELECT flatten(kvgen(c)) f FROM (SELECT 
CONVERT_FROM('{"Tuesday":{"close":"22:00"},"Friday":{"close":"23:00"}}','JSON') 
c)) r;
+-+
|k|
+-+
| Tuesday |
| Friday  |
+-+
{code}

it would be useful to be able to write/read shorter and simpler queries with 
limiting when it's possible the need of inner/outer SELECT.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7375) composite/nested type map/array convert_to/cast to varchar

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Summary: composite/nested type map/array convert_to/cast to varchar  (was: 
composite type map cast/convert_to)

> composite/nested type map/array convert_to/cast to varchar
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (convert_from + JSON) with convert_from 
> or transform a varchar to array (split)
> {code:sql}
> SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 
> 200}' ,'JSON') a);
> +---+-++
> | a | EXPR$1  | EXPR$2 |
> +---+-++
> | {"a":100,"b":200} | MAP | STRUCT |
> +---+-++
> SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( 
> SELECT 'foo,bar' AS str));
> +---+-++
> |a  | EXPR$1  | EXPR$2 |
> +---+-++
> | ["foo","bar"] | VARCHAR | ARRAY  |
> +---+-++
> {code}
> It will be very usefull :
>  # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast 
> syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
> +---+--+---+-+
> {code}
>  # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast 
> syntax" or any other method
>  Expected
> {code:sql}
> SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( 
> SELECT 'foo,bar' AS str));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR |
> +---+--+---+-+
> {code}
> Please note that these possibility of course exists in other database systems
>  Example with Postgres:
> {code:sql}
> SELECT '{"a":100,"b":200}'::json::text;
> => {"a":100,"b":200}
> SELECT array[1,2,3]::text;
> => {1,2,3}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-7375) composite type map cast/convert_to

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Description: 
As it possible to cast varchar to map (convert_from + JSON) with convert_from 
or transform a varchar to array (split)
{code:sql}
SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a);
+---+-++
| a | EXPR$1  | EXPR$2 |
+---+-++
| {"a":100,"b":200} | MAP | STRUCT |
+---+-++
SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+-++
|a  | EXPR$1  | EXPR$2 |
+---+-++
| ["foo","bar"] | VARCHAR | ARRAY  |
+---+-++
{code}
It will be very usefull :
 # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" 
or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}

 # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast 
syntax" or any other method
 Expected
{code:sql}
SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR |
+---+--+---+-+
{code}


Please note that these possibility of course exists in other database systems
 Example with Postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text;
=> {"a":100,"b":200}
SELECT array[1,2,3]::text;
=> {1,2,3}
{code}

  was:
As it possible to cast varchar to map (convert_from + JSON) with convert_from 
or transform a varchar to array (split)
{code:sql}
SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a);
+---+-++
| a | EXPR$1  | EXPR$2 |
+---+-++
| {"a":100,"b":200} | MAP | STRUCT |
+---+-++
SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+-++
|a  | EXPR$1  | EXPR$2 |
+---+-++
| ["foo","bar"] | VARCHAR | ARRAY  |
+---+-++
{code}
It will be very usefull :
 # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" 
or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}

 # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast 
syntax" or any other method
 Expected
{code:sql}
SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR |
+---+--+---+-+
{code}
 
 Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text;
=> {"a":100,"b":200}
SELECT array[1,2,3]::text;
=> {1,2,3}
{code}


> composite type map cast/convert_to
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (convert_from + JSON) with convert_from 
> or transform a

[jira] [Updated] (DRILL-7375) composite type map cast/convert_to

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Description: 
As it possible to cast varchar to map (convert_from + JSON) with convert_from 
or transform a varchar to array (split)
{code:sql}
SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a);
+---+-++
| a | EXPR$1  | EXPR$2 |
+---+-++
| {"a":100,"b":200} | MAP | STRUCT |
+---+-++
SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+-++
|a  | EXPR$1  | EXPR$2 |
+---+-++
| ["foo","bar"] | VARCHAR | ARRAY  |
+---+-++
{code}
It will be very usefull :
 # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast syntax" 
or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}

 # to have the capacity to "cast" the +_ARRAY_ into VARCHAR+ with a "cast 
syntax" or any other method
 Expected
{code:sql}
SELECT a, sqltypeof(a) ta, va, sqltypeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va FROM (SELECT split(str,',') AS a FROM ( SELECT 
'foo,bar' AS str));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| ["foo","bar"] | ARRAY| ["foo","bar"] | VARCHAR |
+---+--+---+-+
{code}
 
 Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text;
=> {"a":100,"b":200}
SELECT array[1,2,3]::text;
=> {1,2,3}
{code}

  was:
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
 

 

Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text;
SELECT (array[1,2,3])::text;
{code}


> composite type map cast/convert_to
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (convert_from + JSON) with convert_from 
> or transform a varchar to array (split)
> {code:sql}
> SELECT a, typeof(a), sqltypeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 
> 200}' ,'JSON') a);
> +---+-++
> | a | EXPR$1  | EXPR$2 |
> +---+-++
> | {"a":100,"b":200} | MAP | STRUCT |
> +---+-++
> SELECT a, typeof(a), sqltypeof(a)FROM (SELECT split(str,',') AS a FROM ( 
> SELECT 'foo,bar' AS str));
> +---+-++
> |a  | EXPR$1  | EXPR$2 |
> +---+-++
> | ["foo","bar"] | VARCHAR | ARRAY  |
> +---+-++
> {code}
> It will be very usefull :
>  # to have the capacity to "cast" the +_MAP_ into VARCHAR+ with a "cast 
> syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va FROM (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
>

[jira] [Updated] (DRILL-7375) composite type map cast/convert_to

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Description: 
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
 

 

Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text;
SELECT (array[1,2,3])::text;
{code}

  was:
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
 

 

Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text
{code}


> composite type map cast/convert_to
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (JSON) with convert_from
> {code:sql}
> SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
> +---++
> | a | EXPR$1 |
> +---++
> | {"a":100,"b":200} | MAP|
> +---++
> {code}
> It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
> with a "cast syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
> +---+--+---+-+
> {code}
>  
>  
> Please note that these possibility of course exists in other database systems
>  Example with postgres:
> {code:sql}
> SELECT '{"a":100,"b":200}'::json::text;
> SELECT (array[1,2,3])::text;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-7375) composite type map cast/convert_to

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Description: 
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
 

 

Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text
{code}

  was:
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text
{code}


> composite type map cast/convert_to
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (JSON) with convert_from
> {code:sql}
> SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
> +---++
> | a | EXPR$1 |
> +---++
> | {"a":100,"b":200} | MAP|
> +---++
> {code}
> It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
> with a "cast syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
> +---+--+---+-+
> {code}
>  
>  
> Please note that these possibility of course exists in other database systems
>  Example with postgres:
> {code:sql}
> SELECT '{"a":100,"b":200}'::json::text
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-7375) composite type map cast/convert_to

2019-09-16 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7375:

Description: 
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}
It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility
 Expected:
{code:sql}
SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
,'JSON') a));
+---+--+---+-+
| a | ta   | va| tva |
+---+--+---+-+
| {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
+---+--+---+-+
{code}
Please note that these possibility of course exists in other database systems
 Example with postgres:
{code:sql}
SELECT '{"a":100,"b":200}'::json::text
{code}

  was:
As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}

It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility

Please note that these possibility of course exists in other database systems 
(example postgres : _SELECT '{"a":100,"b":200}'::json::text_)


> composite type map cast/convert_to
> --
>
> Key: DRILL-7375
> URL: https://issues.apache.org/jira/browse/DRILL-7375
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> As it possible to cast varchar to map (JSON) with convert_from
> {code:sql}
> SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
> +---++
> | a | EXPR$1 |
> +---++
> | {"a":100,"b":200} | MAP|
> +---++
> {code}
> It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
> with a "cast syntax" or with a "convert_to" possibility
>  Expected:
> {code:sql}
> SELECT a, typeof(a) ta, va, typeof(va) tva FROM (
> SELECT a, CAST(a AS varchar) va from (SELECT CONVERT_FROM('{a : 100, b: 200}' 
> ,'JSON') a));
> +---+--+---+-+
> | a | ta   | va| tva |
> +---+--+---+-+
> | {"a":100,"b":200} | MAP  | {"a":100,"b":200} | VARCHAR |
> +---+--+---+-+
> {code}
> Please note that these possibility of course exists in other database systems
>  Example with postgres:
> {code:sql}
> SELECT '{"a":100,"b":200}'::json::text
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (DRILL-7375) composite type map cast/convert_to

2019-09-13 Thread benj (Jira)

benj created DRILL-7375:
---

 Summary: composite type map cast/convert_to
 Key: DRILL-7375
 URL: https://issues.apache.org/jira/browse/DRILL-7375
 Project: Apache Drill
  Issue Type: Wish
  Components: Functions - Drill
Affects Versions: 1.16.0
Reporter: benj


As it possible to cast varchar to map (JSON) with convert_from
{code:sql}
SELECT a, typeof(a) from (SELECT CONVERT_FROM('{a : 100, b: 200}' ,'JSON') a);
+---++
| a | EXPR$1 |
+---++
| {"a":100,"b":200} | MAP|
+---++
{code}

It will be very usefull to have the capacity to "cast" the _MAP_ into VARCHAR 
with a "cast syntax" or with a "convert_to" possibility

Please note that these possibility of course exists in other database systems 
(example postgres : _SELECT '{"a":100,"b":200}'::json::text_)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-6975) TO_CHAR does not seems work well depends on LOCALE

2019-09-11 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-6975:

Affects Version/s: 1.14.0
   1.16.0

> TO_CHAR does not seems work well depends on LOCALE
> --
>
> Key: DRILL-6975
> URL: https://issues.apache.org/jira/browse/DRILL-6975
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.14.0, 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
>
> Strange results from TO_CHAR function when using different LOCALE.
> {code:java}
> SELECT TO_CHAR((CAST('2008-2-23' AS DATE)), '-MMM-dd') FROM (VALUES(1));
> 2008-Feb-23 (in documentation (en_US.UTF-8)) 
> 2008-févr.-2 (fr_FR.UTF-8)
> {code}
> surprisingly by adding a space ('-MMM-dd ') (or any character) at the end 
> of the format  the result becomes correct (so there is no problem when format 
> a timestamp with ' MMM dd HH:mm:ss')
> {code:java}
> SELECT TO_CHAR(1256.789383, '#,###.###') FROM (VALUES(1));
> 1,256.789 (in documentation (en_US.UTF-8)) 
> 1 256,78 (fr_FR.UTF-8)
> {code}
> Even worse results can be achieved
> {code:java}
> SELECT TO_CHAR(12567,'#,###.###');
> 12,567 (en_US.UTF-8)
> 12 56 (fr_FR.UTF-8)
> {code}
> Again, with the add of a space/char at the end we get a better result.
> I don't have tested all the locale, but for the last example, the result is 
> right with de_DE.UTF-8 : 12.567
> The situation is identical in 1.14
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-7371) DST/UTC cast/to_timestamp problem

2019-09-11 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7371:

Component/s: Client - JDBC

> DST/UTC cast/to_timestamp problem
> -
>
> Key: DRILL-7371
> URL: https://issues.apache.org/jira/browse/DRILL-7371
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC, Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in 
> [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw]
>  find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp])
> {code:sql}
> SELECT TIMEOFDAY();
> +-+
> |   EXPR$0|
> +-+
> | 2019-09-11 08:20:12.247 UTC |
> +-+
> {code}
> Problems appears when _cast/to_timestamp_ date (date related to the DST 
> (Daylight Save Time) of some countries).
> To illustrate, all the next requests give the same +wrong+ results:
> {code:sql}
> SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z');
> SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss');
> SELECT cast('2018-03-25 02:22:40' as timestamp);
> SELECT cast('2018-03-25 02:22:40 +' as timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-25 03:22:40.0 |
> +---+
> {code}
> while the result should be "2018-03-25 +02+:22:40.0"
> An UTC date and time in string shouldn't change when casting to UTC timestamp.
>  To illustrate, the next requests produce +good+ results:
> {code:java}
> SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z');
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-26 02:22:40.0 |
> +---+
> SELECT CAST('2018-03-24 02:22:40' AS timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-24 02:22:40.0 |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (DRILL-7371) DST/UTC cast/to_timestamp problem

2019-09-11 Thread benj (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927527#comment-16927527
 ] 

benj commented on DRILL-7371:
-

[~vvysotskyi], the problem occurs on all Daylight Saving Time of Europe (Paris).

>From investigation after your message, it's appears that the problem appears 
>in console mode (/bin/sqlline -u jdbc:drill:zk=...:2181;schema=myhdfs).
 The result is almost wrong with an execution via zeppelin (jdbc too).
 But there is no problem with the request launched directly in the Apache Drill 
web interface ([http://...:8047/query).]

So it's seems that the problem probably comes with JDBC.

 

> DST/UTC cast/to_timestamp problem
> -
>
> Key: DRILL-7371
> URL: https://issues.apache.org/jira/browse/DRILL-7371
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in 
> [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw]
>  find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp])
> {code:sql}
> SELECT TIMEOFDAY();
> +-+
> |   EXPR$0|
> +-+
> | 2019-09-11 08:20:12.247 UTC |
> +-+
> {code}
> Problems appears when _cast/to_timestamp_ date (date related to the DST 
> (Daylight Save Time) of some countries).
> To illustrate, all the next requests give the same +wrong+ results:
> {code:sql}
> SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z');
> SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss');
> SELECT cast('2018-03-25 02:22:40' as timestamp);
> SELECT cast('2018-03-25 02:22:40 +' as timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-25 03:22:40.0 |
> +---+
> {code}
> while the result should be "2018-03-25 +02+:22:40.0"
> An UTC date and time in string shouldn't change when casting to UTC timestamp.
>  To illustrate, the next requests produce +good+ results:
> {code:java}
> SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z');
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-26 02:22:40.0 |
> +---+
> SELECT CAST('2018-03-24 02:22:40' AS timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-24 02:22:40.0 |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (DRILL-7371) DST/UTC cast/to_timestamp problem

2019-09-11 Thread benj (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7371:

Summary: DST/UTC cast/to_timestamp problem  (was: DST/UTC problem)

> DST/UTC cast/to_timestamp problem
> -
>
> Key: DRILL-7371
> URL: https://issues.apache.org/jira/browse/DRILL-7371
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Major
>
> With LC_TIME=fr_FR.UTF-8 and +drillbits configured in UTC+ (like specified in 
> [http://www.openkb.info/2015/05/understanding-drills-timestamp-and.html#.VUzhotpVhHw]
>  find from [https://drill.apache.org/docs/data-type-conversion/#to_timestamp])
> {code:sql}
> SELECT TIMEOFDAY();
> +-+
> |   EXPR$0|
> +-+
> | 2019-09-11 08:20:12.247 UTC |
> +-+
> {code}
> Problems appears when _cast/to_timestamp_ date (date related to the DST 
> (Daylight Save Time) of some countries).
> To illustrate, all the next requests give the same +wrong+ results:
> {code:sql}
> SELECT to_timestamp('2018-03-25 02:22:40 UTC','-MM-dd HH:mm:ss z');
> SELECT to_timestamp('2018-03-25 02:22:40','-MM-dd HH:mm:ss');
> SELECT cast('2018-03-25 02:22:40' as timestamp);
> SELECT cast('2018-03-25 02:22:40 +' as timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-25 03:22:40.0 |
> +---+
> {code}
> while the result should be "2018-03-25 +02+:22:40.0"
> An UTC date and time in string shouldn't change when casting to UTC timestamp.
>  To illustrate, the next requests produce +good+ results:
> {code:java}
> SELECT to_timestamp('2018-03-26 02:22:40 UTC','-MM-dd HH:mm:ss z');
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-26 02:22:40.0 |
> +---+
> SELECT CAST('2018-03-24 02:22:40' AS timestamp);
> +---+
> |EXPR$0 |
> +---+
> | 2018-03-24 02:22:40.0 |
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

1 2 >

1 - 100 of 193 matches

Mail list logo