[jira] [Closed] (DRILL-8443) upgrade netty to 4.1.94 due to CVE

2023-12-13 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning closed DRILL-8443.
-
Resolution: Duplicate

> upgrade netty to 4.1.94 due to CVE
> --
>
> Key: DRILL-8443
> URL: https://issues.apache.org/jira/browse/DRILL-8443
> Project: Apache Drill
>  Issue Type: Task
>  Components:  Server
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/apache/drill/security/dependabot/45



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8466) logback 1.3.14 (due to CVE)

2023-12-02 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8466:
--
Summary: logback 1.3.14 (due to CVE)  (was: logback 1.3.13 (due to CVE))

> logback 1.3.14 (due to CVE)
> ---
>
> Key: DRILL-8466
> URL: https://issues.apache.org/jira/browse/DRILL-8466
> Project: Apache Drill
>  Issue Type: Improvement
>  Components:  Server
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/advisories/GHSA-vmq6-5m68-f53m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8466) logback 1.3.13 (due to CVE)

2023-12-01 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8466:
-

 Summary: logback 1.3.13 (due to CVE)
 Key: DRILL-8466
 URL: https://issues.apache.org/jira/browse/DRILL-8466
 Project: Apache Drill
  Issue Type: Improvement
  Components:  Server
Reporter: PJ Fanning


https://github.com/advisories/GHSA-vmq6-5m68-f53m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8465) check data input when loading iceberg data

2023-12-01 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8465:
-

 Summary: check data input when loading iceberg data
 Key: DRILL-8465
 URL: https://issues.apache.org/jira/browse/DRILL-8465
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Iceberg
Reporter: PJ Fanning






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8464) GitHubActions: checkout action needs to be upgraded to v4 due to node16 deprecation

2023-11-26 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8464:
-

 Summary: GitHubActions: checkout action needs to be upgraded to v4 
due to node16 deprecation 
 Key: DRILL-8464
 URL: https://issues.apache.org/jira/browse/DRILL-8464
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning



The following actions uses node12 which is deprecated and will be forced to run 
on node16: actions/checkout@v2. For more info: 
https://github.blog/changelog/2023-06-13-github-actions-all-actions-will-run-on-node16-instead-of-node12-by-default/




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8463) upgrade to bouncy castle jdk1.8 jars

2023-11-26 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8463:
-

 Summary: upgrade to bouncy castle jdk1.8 jars
 Key: DRILL-8463
 URL: https://issues.apache.org/jira/browse/DRILL-8463
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


They have stopped releasing the the JDK 1.5 supporting jars. This lib is 
important for security purposes. 

https://www.bouncycastle.org/latest_releases.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8462) upgrade to poi 5.2.5

2023-11-26 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8462:
--
Description: 
Includes some regression fixes but these probably don't affect Drill usage.

https://poi.apache.org/changes.html

  was:Includes some regression fixes but these probably don't affect Drill 
usage.


> upgrade to poi 5.2.5
> 
>
> Key: DRILL-8462
> URL: https://issues.apache.org/jira/browse/DRILL-8462
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> Includes some regression fixes but these probably don't affect Drill usage.
> https://poi.apache.org/changes.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8462) upgrade to poi 5.2.5

2023-11-26 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8462:
-

 Summary: upgrade to poi 5.2.5
 Key: DRILL-8462
 URL: https://issues.apache.org/jira/browse/DRILL-8462
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


Includes some regression fixes but these probably don't affect Drill usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (DRILL-8460) Bump zookeeper jar to 3.7.2 due to CVE

2023-10-31 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning reopened DRILL-8460:
---
  Assignee: (was: PJ Fanning)

This is not fixed. The CI build had some test failures that indicate that we 
may nor be able to upgrade.

> Bump zookeeper jar to 3.7.2 due to CVE
> --
>
> Key: DRILL-8460
> URL: https://issues.apache.org/jira/browse/DRILL-8460
> Project: Apache Drill
>  Issue Type: Sub-task
>Affects Versions: 1.21.1
>Reporter: PJ Fanning
>Priority: Major
> Fix For: 1.22.0
>
>
> https://github.com/apache/drill/security/dependabot/51



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8460) bump zookeeper jar to 3.7.2 due to cve

2023-10-30 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8460:
--
Parent: DRILL-8452
Issue Type: Sub-task  (was: Improvement)

> bump zookeeper jar to 3.7.2 due to cve
> --
>
> Key: DRILL-8460
> URL: https://issues.apache.org/jira/browse/DRILL-8460
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/apache/drill/security/dependabot/51



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8460) bump zookeeper jar to 3.7.2 due to cve

2023-10-30 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8460:
-

 Summary: bump zookeeper jar to 3.7.2 due to cve
 Key: DRILL-8460
 URL: https://issues.apache.org/jira/browse/DRILL-8460
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://github.com/apache/drill/security/dependabot/51



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8459) bump avro to 1.11.3 due to cve

2023-10-30 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8459:
--
Parent: DRILL-8452
Issue Type: Sub-task  (was: Improvement)

> bump avro to 1.11.3 due to cve
> --
>
> Key: DRILL-8459
> URL: https://issues.apache.org/jira/browse/DRILL-8459
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/apache/drill/security/dependabot/49



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8459) bump avro to 1.11.3 due to cve

2023-10-30 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8459:
-

 Summary: bump avro to 1.11.3 due to cve
 Key: DRILL-8459
 URL: https://issues.apache.org/jira/browse/DRILL-8459
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://github.com/apache/drill/security/dependabot/49



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8456) uptake POI 5.2.4

2023-09-28 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8456:
-

 Summary: uptake POI 5.2.4
 Key: DRILL-8456
 URL: https://issues.apache.org/jira/browse/DRILL-8456
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


latest release with some transitive dependencies having security patches



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8445) Upgrade Janino

2023-07-04 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8445:
-

 Summary: Upgrade Janino
 Key: DRILL-8445
 URL: https://issues.apache.org/jira/browse/DRILL-8445
 Project: Apache Drill
  Issue Type: Task
  Components:  Server
Reporter: PJ Fanning


I'm not familar with exactly how janino is used inside Drill.

There is a new 3.1.10 release today to fix 
[https://github.com/janino-compiler/janino/issues/201]

This may be an issue if Janino is used to parse input that may not be entirely 
trustworthy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8443) upgrade netty to 4.1.94 due to CVE

2023-06-24 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8443:
-

 Summary: upgrade netty to 4.1.94 due to CVE
 Key: DRILL-8443
 URL: https://issues.apache.org/jira/browse/DRILL-8443
 Project: Apache Drill
  Issue Type: Task
  Components:  Server
Reporter: PJ Fanning


https://github.com/apache/drill/security/dependabot/45



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8431) add immutable wrapper for ObjectMapper

2023-05-09 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8431:
-

 Summary: add immutable wrapper for ObjectMapper
 Key: DRILL-8431
 URL: https://issues.apache.org/jira/browse/DRILL-8431
 Project: Apache Drill
  Issue Type: Task
  Components:  Server
Reporter: PJ Fanning


The Jackson based code in Drill is quite complicated and passes around 
ObjectMapper instances in a way that is difficult to maintain.

We need to balance the objective of trying to reuse ObjectMapper instances 
(because they are fairly expensive to create) but avoid the risk that code 
modifies an ObjectMapper instance (extra config or extra modules added) in a 
way that affects other code that uses the ObjectMapper instance.

Jackson 3 (which is under development but a long way off) moves towards making 
ObjectMappers immutable. Mapper Builders are used instead to configure mappers. 
Some of these API changes are already backported to Jackson 2.

My suggestion in this Jira is that we create a new Drill class called 
ImmutableObjectMapper and this exposes API methods for reading and writing JSON 
but that hides methods for configuring the mapper. We can wrap some of our 
ObjectMappers. It will probably take a few iterations to get everything 
switched over but we can start with the low hanging fruit.

This class would allow the Java compiler to check for any untidy attempts to 
modify an ObjectMapper that was created elsewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8430) add factory method for creating Jackson ObjectMappers

2023-05-09 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8430:
-

 Summary: add factory method for creating Jackson ObjectMappers
 Key: DRILL-8430
 URL: https://issues.apache.org/jira/browse/DRILL-8430
 Project: Apache Drill
  Issue Type: Task
  Components:  Server
Reporter: PJ Fanning


See https://issues.apache.org/jira/browse/DRILL-8415

It's useful to keep any customisation of the ObjectMapper creation in 1 place 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8429) jackson 2.14.3

2023-05-05 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8429:
-

 Summary: jackson 2.14.3
 Key: DRILL-8429
 URL: https://issues.apache.org/jira/browse/DRILL-8429
 Project: Apache Drill
  Issue Type: Task
  Components:  Server
Reporter: PJ Fanning


Jackson 2.14.3 has perf and security hardening improvements

https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.3

prelude to DRILL-8415



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8415) Jackson 2.15

2023-05-05 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719976#comment-17719976
 ] 

PJ Fanning commented on DRILL-8415:
---

[~cgivre] [~dzamo] would it be a good idea to create a factory method in 
drill-common for creating ObjectMappers. It would be a good way of centralising 
the logic about creating and configuring these mappers. `new ObjectMapper()` 
has the problem of relying on the default settings for everything. 

> Jackson 2.15
> 
>
> Key: DRILL-8415
> URL: https://issues.apache.org/jira/browse/DRILL-8415
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> I'm not advocating for an upgrade to [Jackson 
> 2.15|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.15]. 
> 2.15.0-rc1 has just been released and 2.15.0 should be out soon.
> There are some security focused enhancements including a new class called 
> StreamReadConstraints. The defaults on 
> [StreamReadConstraints|https://javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.15.0-rc1/com/fasterxml/jackson/core/StreamReadConstraints.html]
>  are pretty high but it is not inconceivable that some Drill users might need 
> to relax them. Parsing large strings as numbers is sub-quadratic, thus the 
> default limit of 1000 chars or bytes (depending on input context).
> When the Drill team consider upgrading to Jackson 2.15 or above, you might 
> also want to consider adding some way for users to configure the 
> StreamReadConstraints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8415) Jackson 2.15

2023-03-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8415:
-

 Summary: Jackson 2.15
 Key: DRILL-8415
 URL: https://issues.apache.org/jira/browse/DRILL-8415
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


I'm not advocating for an upgrade to [Jackson 
2.15|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.15]. 
2.15.0-rc1 has just been released and 2.15.0 should be out soon.

There are some security focused enhancements including a new class called 
StreamReadConstraints. The defaults on 
[StreamReadConstraints|https://javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.15.0-rc1/com/fasterxml/jackson/core/StreamReadConstraints.html]
 are pretty high but it is not inconceivable that some Drill users might need 
to relax them. Parsing large strings as numbers is sub-quadratic, thus the 
default limit of 1000 chars or bytes (depending on input context).

When the Drill team consider upgrading to Jackson 2.15 or above, you might also 
want to consider adding some way for users to configure the 
StreamReadConstraints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8405) upgrade to snakeyaml 2.0 due to cve

2023-02-26 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8405:
-

 Summary: upgrade to snakeyaml 2.0 due to cve
 Key: DRILL-8405
 URL: https://issues.apache.org/jira/browse/DRILL-8405
 Project: Apache Drill
  Issue Type: Task
Reporter: PJ Fanning


https://bitbucket.org/snakeyaml/snakeyaml/issues/561/cve-2022-1471-vulnerability-in



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8363) upgrade postgresql to 42.4.3 due to security issue

2022-11-29 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8363:
-

 Summary: upgrade postgresql to 42.4.3 due to security issue
 Key: DRILL-8363
 URL: https://issues.apache.org/jira/browse/DRILL-8363
 Project: Apache Drill
  Issue Type: Task
  Components: Storage - JDBC
Reporter: PJ Fanning


https://github.com/advisories/GHSA-562r-vg33-8x8h



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8362) upgrade excel-streaming-reader v4.0.5

2022-11-29 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8362:
-

 Summary: upgrade excel-streaming-reader v4.0.5
 Key: DRILL-8362
 URL: https://issues.apache.org/jira/browse/DRILL-8362
 Project: Apache Drill
  Issue Type: Task
Reporter: PJ Fanning


A few small issues have been fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8343) Upgrade Commons Text to 1.10.0

2022-10-24 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623066#comment-17623066
 ] 

PJ Fanning commented on DRILL-8343:
---

Duplicate of DRILL-8323

> Upgrade Commons Text to 1.10.0
> --
>
> Key: DRILL-8343
> URL: https://issues.apache.org/jira/browse/DRILL-8343
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Jason-Morries Adam
>Priority: Critical
>
> Apache Commons Text versions prior to 1.10.0 are vulnerable to 
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889], which 
> involves potential script execution when processing untrusted input using 
> {{{}StringLookup{}}}. Direct and transitive references to Apache Commons Text 
> prior to 1.10.0 should be upgraded to avoid the default interpolation 
> behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8334) upgrade to okhttp 4.10.0 due to CVEs in kotlin transitive dependencies

2022-10-14 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8334:
-

 Summary: upgrade to okhttp 4.10.0 due to CVEs in kotlin transitive 
dependencies
 Key: DRILL-8334
 URL: https://issues.apache.org/jira/browse/DRILL-8334
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


[https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp]

It's a fairly minot bump from 4.9.3 to 4.10.0

okhttp 4.10.0 uses a newer copy of kotlin-stdlib that doesn't have CVEs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8332) upgrade to jackson 2.13.4.20221013

2022-10-14 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8332:
--
Description: 
* [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003]
 * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004]
 * both fixes have been backported (the CVEs themselves need to be updated to 
reflect this)

There was a gradle module issue in 2.13.4.20221012 so upgrading to 
2.13.4.20221013

  was:
* [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003]
 * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004]
 * both fixes have been backported (the CVEs themselves need to be updated to 
reflect this)


> upgrade to jackson 2.13.4.20221013
> --
>
> Key: DRILL-8332
> URL: https://issues.apache.org/jira/browse/DRILL-8332
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003]
>  * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004]
>  * both fixes have been backported (the CVEs themselves need to be updated to 
> reflect this)
> There was a gradle module issue in 2.13.4.20221012 so upgrading to 
> 2.13.4.20221013



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8332) upgrade to jackson 2.13.4.20221013

2022-10-14 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8332:
--
Summary: upgrade to jackson 2.13.4.20221013  (was: upgrade to jackson 
2.13.4.20221012)

> upgrade to jackson 2.13.4.20221013
> --
>
> Key: DRILL-8332
> URL: https://issues.apache.org/jira/browse/DRILL-8332
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003]
>  * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004]
>  * both fixes have been backported (the CVEs themselves need to be updated to 
> reflect this)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8332) upgrade to jackson 2.13.4.20221012

2022-10-13 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8332:
-

 Summary: upgrade to jackson 2.13.4.20221012
 Key: DRILL-8332
 URL: https://issues.apache.org/jira/browse/DRILL-8332
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


* [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003]
 * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004]
 * both fixes have been backported (the CVEs themselves need to be updated to 
reflect this)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8326) snakeyaml 1.33

2022-10-01 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8326:
-

 Summary: snakeyaml 1.33
 Key: DRILL-8326
 URL: https://issues.apache.org/jira/browse/DRILL-8326
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


[https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] – fixes bug in code 
point limit added in 1.32



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8321) Change kafka_2.13 dependency scope to test

2022-09-30 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611539#comment-17611539
 ] 

PJ Fanning commented on DRILL-8321:
---

I opened an issue and PR at https://issues.apache.org/jira/browse/DRILL-8324

> Change kafka_2.13 dependency scope to test 
> ---
>
> Key: DRILL-8321
> URL: https://issues.apache.org/jira/browse/DRILL-8321
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.20.2
>Reporter: Maksym Rymar
>Assignee: Maksym Rymar
>Priority: Minor
> Fix For: 1.20.3
>
>
> Drill has 2 scala dependencies:
>  * {{org.apache.kafka.kafka_2.13}}
>  * {{com.madhukaraphatak.java-sizeof_2.11}}
> which are targets on different scala versions 2.13 and 2.11. But Scala has no 
> backward compatibility for major releases, so we can’t have 2 libraries 
> compiled on various versions of scala.
> To solve the issue there are only 2 ways:
>  # Compile both libraries on the same major Scala version.
>  # Remove one of the libraries from Drill
> {{kafka_2.13}} is server side (kafka’s server side) dependency and is 
> unnecessary on the client side (Drill). Probably, it was added carelessly to 
> Drill to a compile scope, while it is necessary only in a test scope.
> So {{kafka_2.13}} can be removed from compile scope. It will reduce the Drill 
> package size and the main – it will solve scala version conflict.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8323) upgrade commons-text to 1.10.0

2022-09-29 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8323:
--
Description: 
[https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0]

https://issues.apache.org/jira/browse/TEXT-191 affects one of our tests - I 
have fixed the test in my PR - the old expected value was wrong due to TEXT-191 
bug

  was:https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0


> upgrade commons-text to 1.10.0
> --
>
> Key: DRILL-8323
> URL: https://issues.apache.org/jira/browse/DRILL-8323
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> [https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0]
> https://issues.apache.org/jira/browse/TEXT-191 affects one of our tests - I 
> have fixed the test in my PR - the old expected value was wrong due to 
> TEXT-191 bug



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8324) remove dependency on java-sizeof jar

2022-09-29 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8324:
-

 Summary: remove dependency on java-sizeof jar
 Key: DRILL-8324
 URL: https://issues.apache.org/jira/browse/DRILL-8324
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


[https://github.com/phatak-dev/java-sizeof] is not maintained and ties us to a 
very old version of Scala.

It looks like it should be easy to rewrite the code in Java and have it in 
Drill itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8323) upgrade commons-text to 1.10.0

2022-09-29 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8323:
-

 Summary: upgrade commons-text to 1.10.0
 Key: DRILL-8323
 URL: https://issues.apache.org/jira/browse/DRILL-8323
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8321) Change kafka_2.13 dependency scope to test

2022-09-29 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611053#comment-17611053
 ] 

PJ Fanning commented on DRILL-8321:
---

Any chance we can drop [https://github.com/phatak-dev/java-sizeof] ? If it 
isn't published for recent scala versions, it is a real millstone around our 
necks.

> Change kafka_2.13 dependency scope to test 
> ---
>
> Key: DRILL-8321
> URL: https://issues.apache.org/jira/browse/DRILL-8321
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.20.2
>Reporter: Maksym Rymar
>Assignee: Maksym Rymar
>Priority: Minor
> Fix For: 2.0.0
>
>
> Drill has 2 scala dependencies:
>  * {{org.apache.kafka.kafka_2.13}}
>  * {{com.madhukaraphatak.java-sizeof_2.11}}
> which are targets on different scala versions 2.13 and 2.11. But Scala has no 
> backward compatibility for major releases, so we can’t have 2 libraries 
> compiled on various versions of scala.
> To solve the issue there are only 2 ways:
>  # Compile both libraries on the same major Scala version.
>  # Remove one of the libraries from Drill
> {{kafka_2.13}} is server side (kafka’s server side) dependency and is 
> unnecessary on the client side (Drill). Probably, it was added carelessly to 
> Drill to a compile scope, while it is necessary only in a test scope.
> So {{kafka_2.13}} can be removed from compile scope. It will reduce the Drill 
> package size and the main – it will solve scala version conflict.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-7878) Fix LGTM Alerts

2022-09-29 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611051#comment-17611051
 ] 

PJ Fanning commented on DRILL-7878:
---

[~yaybeNo] can this be closed? lgtm is closing down and many of the issues have 
been dealt with anyway

> Fix LGTM Alerts
> ---
>
> Key: DRILL-7878
> URL: https://issues.apache.org/jira/browse/DRILL-7878
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Evan Wong
>Priority: Major
>
> Try and deal with all alerts from LGTM badge
> [https://lgtm.com/projects/g/apache/drill/alerts/?mode=list]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (DRILL-8313) Introduce configuration for yaml parsing to override the default max file size

2022-09-23 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning closed DRILL-8313.
-
Resolution: Won't Fix

Thanks [~dzamo] - I'll close this

> Introduce configuration for yaml parsing to override the default max file size
> --
>
> Key: DRILL-8313
> URL: https://issues.apache.org/jira/browse/DRILL-8313
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files.
> Need to allow users to specify another value if they need to.
> [https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java]
> I only became aware of this issue in the last few hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8313) introduce configuration for yaml parsing to override the default max file size

2022-09-20 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607120#comment-17607120
 ] 

PJ Fanning commented on DRILL-8313:
---

[~dzamo] [~cgivre] When searching Drill code, I can find no direct use of 
snakeyaml in Drill. The only thing I found was:

drill-rdbms-metastore/pom.xml
{code:java}


  org.yaml
  snakeyaml
 {code}
Do you think we need to worry about yaml files that are larger than 3Mb here?

> introduce configuration for yaml parsing to override the default max file size
> --
>
> Key: DRILL-8313
> URL: https://issues.apache.org/jira/browse/DRILL-8313
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files.
> Need to allow users to specify another value if they need to.
> [https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java]
> I only became aware of this issue in the last few hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8313) introduce configuration for yaml parsing to override the default max file size

2022-09-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8313:
-

 Summary: introduce configuration for yaml parsing to override the 
default max file size
 Key: DRILL-8313
 URL: https://issues.apache.org/jira/browse/DRILL-8313
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files.

Need to allow users to specify another value if they need to.

[https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java]

I only became aware of this issue in the last few hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8309) uptake slf4j 2.0.1

2022-09-17 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8309:
-

 Summary: uptake slf4j 2.0.1
 Key: DRILL-8309
 URL: https://issues.apache.org/jira/browse/DRILL-8309
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


log4j 2.19.0 and logback 2.19.0 support slf4j 2.0.1

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8308) uptake POI 5.2.3

2022-09-17 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8308:
-

 Summary: uptake POI 5.2.3
 Key: DRILL-8308
 URL: https://issues.apache.org/jira/browse/DRILL-8308
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Other
Affects Versions: 2.0.0
Reporter: PJ Fanning


https://poi.apache.org/changes.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve

2022-09-13 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603783#comment-17603783
 ] 

PJ Fanning commented on DRILL-8300:
---

Another release - maybe another CVE - unclear from release notes

[https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes]

[https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data]

> Upgrade to snakeyaml 1.32 due to cve
> 
>
> Key: DRILL-8300
> URL: https://issues.apache.org/jira/browse/DRILL-8300
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/advisories/GHSA-3mc7-4q67-w48m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve

2022-09-13 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8300:
--
Environment: (was: Another release - maybe another CVE - unclear from 
release notes

[https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes]

[https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data])

> Upgrade to snakeyaml 1.32 due to cve
> 
>
> Key: DRILL-8300
> URL: https://issues.apache.org/jira/browse/DRILL-8300
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/advisories/GHSA-3mc7-4q67-w48m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve

2022-09-13 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8300:
--
Environment: 
Another release - maybe another CVE - unclear from release notes

[https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes]

[https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data]
Summary: Upgrade to snakeyaml 1.32 due to cve  (was: Upgrade to 
snakeyaml 1.31 due to cve)

> Upgrade to snakeyaml 1.32 due to cve
> 
>
> Key: DRILL-8300
> URL: https://issues.apache.org/jira/browse/DRILL-8300
> Project: Apache Drill
>  Issue Type: Bug
> Environment: Another release - maybe another CVE - unclear from 
> release notes
> [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes]
> [https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data]
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/advisories/GHSA-3mc7-4q67-w48m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8304) Update Calcite to 1.32

2022-09-10 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602711#comment-17602711
 ] 

PJ Fanning commented on DRILL-8304:
---

Includes a CVE fix - [https://calcite.apache.org/docs/history.html]

[CVE-2022-39135|http://cve.mitre.org/cgi-bin/cvename.cgi?name=2022-39135]

> Update Calcite to 1.32
> --
>
> Key: DRILL-8304
> URL: https://issues.apache.org/jira/browse/DRILL-8304
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Assignee: Vova Vysotskyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions

2022-09-08 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601752#comment-17601752
 ] 

PJ Fanning commented on DRILL-8301:
---

See https://github.com/apache/drill/pull/2637

> Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
> ---
>
> Key: DRILL-8301
> URL: https://issues.apache.org/jira/browse/DRILL-8301
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set 
> an explicit encoding which means it relies on the JVM default (which differs 
> by JVM install).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8302) tidy up some char conversions

2022-09-08 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8302:
--
Description: 
As part of DRILL-8301, I spotted code that could be tidied up. The aim of this 
issue is to reduce the size of DRILL-8301 without introducing changes to the 
char encodings.
 * uses of a pattern like `new String("")` - IntelliJ and other tools 
highlight this as unnecessary
 * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use 
`new String(bytes, StandardCharsets.UTF_8)`
 * use Base64 encodeToString instead of case where we encode to bytes and then 
do our own encoding of those bytes to a String
 * Change existing code with `Charset.forName("UTF-8")` to use 
`StandardCharsets.UTF_8`

  was:
As part of DRILL-8301, I spotted code that could be tidied up. The aim of this 
issue is to reduce the size of DRILL-8301 without introducing changes to the 
char encodings.
 * uses of a pattern like `new String("")` - IntelliJ and other tools 
highlight this as unnecessary
 * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use 
`new String(bytes, StandardCharsets.UTF_8)`
 * use Base64 encodeToString instead of case where we encode to bytes and then 
do our own encoding of those bytes to a String
 * Replace existing code with `Charset.forName("UTF-8")` to use 
`StandardCharsets.UTF_8`


> tidy up some char conversions
> -
>
> Key: DRILL-8302
> URL: https://issues.apache.org/jira/browse/DRILL-8302
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> As part of DRILL-8301, I spotted code that could be tidied up. The aim of 
> this issue is to reduce the size of DRILL-8301 without introducing changes to 
> the char encodings.
>  * uses of a pattern like `new String("")` - IntelliJ and other tools 
> highlight this as unnecessary
>  * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use 
> `new String(bytes, StandardCharsets.UTF_8)`
>  * use Base64 encodeToString instead of case where we encode to bytes and 
> then do our own encoding of those bytes to a String
>  * Change existing code with `Charset.forName("UTF-8")` to use 
> `StandardCharsets.UTF_8`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8302) tidy up some char conversions

2022-09-08 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8302:
-

 Summary: tidy up some char conversions
 Key: DRILL-8302
 URL: https://issues.apache.org/jira/browse/DRILL-8302
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


As part of DRILL-8301, I spotted code that could be tidied up. The aim of this 
issue is to reduce the size of DRILL-8301 without introducing changes to the 
char encodings.
 * uses of a pattern like `new String("")` - IntelliJ and other tools 
highlight this as unnecessary
 * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use 
`new String(bytes, StandardCharsets.UTF_8)`
 * use Base64 encodeToString instead of case where we encode to bytes and then 
do our own encoding of those bytes to a String
 * Replace existing code with `Charset.forName("UTF-8")` to use 
`StandardCharsets.UTF_8`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions

2022-09-08 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8301:
-

 Summary: Standardise on UTF-8 encoding for char to byte (and vice 
versa) conversions
 Key: DRILL-8301
 URL: https://issues.apache.org/jira/browse/DRILL-8301
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set an 
explicit encoding which means it relies on the JVM default (which differs by 
JVM install).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8300) upgrade to snakeyaml 1.31 due to cve

2022-09-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8300:
-

 Summary: upgrade to snakeyaml 1.31 due to cve
 Key: DRILL-8300
 URL: https://issues.apache.org/jira/browse/DRILL-8300
 Project: Apache Drill
  Issue Type: Bug
Reporter: PJ Fanning


https://github.com/advisories/GHSA-3mc7-4q67-w48m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8298) possible bug in NonCoveringIndexPlanGenerator

2022-09-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8298:
--
Issue Type: Bug  (was: Improvement)

> possible bug in NonCoveringIndexPlanGenerator
> -
>
> Key: DRILL-8298
> URL: https://issues.apache.org/jira/browse/DRILL-8298
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: PJ Fanning
>Priority: Major
>
> I'm not a Calcite expert by LGTM.com and IntelliJ suggest that this set and 
> the type of the instance in the contains check do not type match.
> {code:java}
> (restrictedScanTraitSet.contains(RelCollationTraitDef.INSTANCE)) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8299) type matching in MetadataContext

2022-09-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8299:
--
Issue Type: Bug  (was: Improvement)

> type matching in MetadataContext
> 
>
> Key: DRILL-8299
> URL: https://issues.apache.org/jira/browse/DRILL-8299
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: PJ Fanning
>Priority: Major
>
> The dirModifCheckMap used in this lookup is keyed using a HDFS Path instance, 
> not a string, so this code is not going to work:
> {code:java}
>   public boolean getStatus(String dir) {
> if (dirModifCheckMap.containsKey(dir)) {
>   return dirModifCheckMap.get(dir);
> }
> return false;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8299) type matching in MetadataContext

2022-09-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8299:
-

 Summary: type matching in MetadataContext
 Key: DRILL-8299
 URL: https://issues.apache.org/jira/browse/DRILL-8299
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


The dirModifCheckMap used in this lookup is keyed using a HDFS Path instance, 
not a string, so this code is not going to work:


{code:java}
  public boolean getStatus(String dir) {
if (dirModifCheckMap.containsKey(dir)) {
  return dirModifCheckMap.get(dir);
}
return false;
  }
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8298) possible bug in NonCoveringIndexPlanGenerator

2022-09-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8298:
-

 Summary: possible bug in NonCoveringIndexPlanGenerator
 Key: DRILL-8298
 URL: https://issues.apache.org/jira/browse/DRILL-8298
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


I'm not a Calcite expert by LGTM.com and IntelliJ suggest that this set and the 
type of the instance in the contains check do not type match.


{code:java}
(restrictedScanTraitSet.contains(RelCollationTraitDef.INSTANCE)) 
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8297) remove or fix OrderedPartitionRecordBatch

2022-09-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8297:
-

 Summary: remove or fix OrderedPartitionRecordBatch
 Key: DRILL-8297
 URL: https://issues.apache.org/jira/browse/DRILL-8297
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


The constructor will always throw a NullPointerException because cache is 
always null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8296) possible type bug in SplunkBatchReader

2022-09-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8296:
--
Description: 

{code:java}
  if (path.nameEquals("**")) {
return true;
  } else {
return specialFields.contains(path.getAsNamePart());
  }
{code}


LGTM and IntelliJ both say that NamePart type does not match the type stored in 
specialFields collection.

  was:
```
  if (path.nameEquals("**")) {
return true;
  } else {
return specialFields.contains(path.getAsNamePart());
  }
```

LGTM and IntelliJ both say that NamePart type does not match the type stored in 
specialFields collection.


> possible type bug in SplunkBatchReader
> --
>
> Key: DRILL-8296
> URL: https://issues.apache.org/jira/browse/DRILL-8296
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: splunk
>Reporter: PJ Fanning
>Priority: Major
>
> {code:java}
>   if (path.nameEquals("**")) {
> return true;
>   } else {
> return specialFields.contains(path.getAsNamePart());
>   }
> {code}
> LGTM and IntelliJ both say that NamePart type does not match the type stored 
> in specialFields collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8296) possible type bug in SplunkBatchReader

2022-09-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8296:
-

 Summary: possible type bug in SplunkBatchReader
 Key: DRILL-8296
 URL: https://issues.apache.org/jira/browse/DRILL-8296
 Project: Apache Drill
  Issue Type: Improvement
  Components: splunk
Reporter: PJ Fanning


```
  if (path.nameEquals("**")) {
return true;
  } else {
return specialFields.contains(path.getAsNamePart());
  }
```

LGTM and IntelliJ both say that NamePart type does not match the type stored in 
specialFields collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8282) upgrade to hadoop-common 3.2.4 due to cve

2022-08-22 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8282:
-

 Summary: upgrade to hadoop-common 3.2.4 due to cve 
 Key: DRILL-8282
 URL: https://issues.apache.org/jira/browse/DRILL-8282
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://github.com/advisories/GHSA-8wm5-8h9c-47pc

* this change requires some reload4j dependency changes too - see broken build 
- https://github.com/apache/drill/pull/2628



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (DRILL-8267) Remove commons-configuration dependency management

2022-07-27 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning closed DRILL-8267.
-
Resolution: Won't Fix

This doesn't need to be done

> Remove commons-configuration dependency management
> --
>
> Key: DRILL-8267
> URL: https://issues.apache.org/jira/browse/DRILL-8267
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10
> This jar is EOL and has many very insecure dependencies.
> Looks like this dependency is not used by Drill or any of its dependencies. 
> Hadoop uses commons-configuration2 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8267) remove commons-configuration dependency

2022-07-19 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8267:
--
Description: 
https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10

This jar is EOL and has many very insecure dependencies.

Looks like this dependency is not used by Drill or any of its dependencies. 
Hadoop uses commons-configuration2 instead.

  was:
https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10

This jar is EOL and has many very insecure dependencies.

We should use commons-configuration2.


> remove commons-configuration dependency
> ---
>
> Key: DRILL-8267
> URL: https://issues.apache.org/jira/browse/DRILL-8267
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10
> This jar is EOL and has many very insecure dependencies.
> Looks like this dependency is not used by Drill or any of its dependencies. 
> Hadoop uses commons-configuration2 instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8267) remove commons-configuration dependency

2022-07-19 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8267:
--
Summary: remove commons-configuration dependency  (was: switch to 
commons-configuration2)

> remove commons-configuration dependency
> ---
>
> Key: DRILL-8267
> URL: https://issues.apache.org/jira/browse/DRILL-8267
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10
> This jar is EOL and has many very insecure dependencies.
> We should use commons-configuration2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8267) switch to commons-configuration2

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8267:
-

 Summary: switch to commons-configuration2
 Key: DRILL-8267
 URL: https://issues.apache.org/jira/browse/DRILL-8267
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10

This jar is EOL and has many very insecure dependencies.

We should use commons-configuration2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (DRILL-8266) address number casting issues in github scan

2022-07-19 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8266:
--
Summary: address number casting issues in github scan  (was: address number 
casting issues in https://github.com/apache/drill/security/code-scanning)

> address number casting issues in github scan
> 
>
> Key: DRILL-8266
> URL: https://issues.apache.org/jira/browse/DRILL-8266
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> https://github.com/apache/drill/security/code-scanning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8266) address number casting issues in https://github.com/apache/drill/security/code-scanning

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8266:
-

 Summary: address number casting issues in 
https://github.com/apache/drill/security/code-scanning
 Key: DRILL-8266
 URL: https://issues.apache.org/jira/browse/DRILL-8266
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://github.com/apache/drill/security/code-scanning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8265) upgrade aws-java-sdk-s3 due to CVE

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8265:
-

 Summary: upgrade aws-java-sdk-s3 due to CVE
 Key: DRILL-8265
 URL: https://issues.apache.org/jira/browse/DRILL-8265
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.12.260



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8262) Xalan is EOL and has a never to be fixed CVE

2022-07-19 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568722#comment-17568722
 ] 

PJ Fanning commented on DRILL-8262:
---

https://github.com/apache/drill/pull/2607

> Xalan is EOL and has a never to be fixed CVE
> 
>
> Key: DRILL-8262
> URL: https://issues.apache.org/jira/browse/DRILL-8262
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: PJ Fanning
>Priority: Major
>
> Xalan is no longer supported.
> https://lists.apache.org/thread/s8kjny5270ssfcp46v0fl39lk98987w7
> It is better to use JAXP TransformerFactory than using xalan directly. If you 
> add xalan dependency just to ensure that you have a JAXP compliant 
> transformer on the classpath, this is unnecessary - the Java runtime has a 
> built-in implementation.
> Drill dependency:
> https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8264) upgrade joda to fix security warning

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8264:
-

 Summary: upgrade joda to fix security warning
 Key: DRILL-8264
 URL: https://issues.apache.org/jira/browse/DRILL-8264
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


A bug in joda-time pom causes this:
https://github.com/apache/drill/security/code-scanning/27



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8263) use secure, non-preview version of libpam4j

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8263:
-

 Summary: use secure, non-preview version of libpam4j
 Key: DRILL-8263
 URL: https://issues.apache.org/jira/browse/DRILL-8263
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


https://github.com/apache/drill/blob/master/exec/java-exec/pom.xml#L32

See dependency with CVE in:
https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8262) Xalan is EOL and has a never to be fixed CVE

2022-07-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8262:
-

 Summary: Xalan is EOL and has a never to be fixed CVE
 Key: DRILL-8262
 URL: https://issues.apache.org/jira/browse/DRILL-8262
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


Xalan is no longer supported.

https://lists.apache.org/thread/s8kjny5270ssfcp46v0fl39lk98987w7

It is better to use JAXP TransformerFactory than using xalan directly. If you 
add xalan dependency just to ensure that you have a JAXP compliant transformer 
on the classpath, this is unnecessary - the Java runtime has a built-in 
implementation.

Drill dependency:
https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (DRILL-8096) format-excel reader: support different Shared String implementations

2022-07-12 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565411#comment-17565411
 ] 

PJ Fanning commented on DRILL-8096:
---

This is not implemented. excel-streaming-reader that Drill uses does now use 
ReadOnlySharedStringTable so that is one part of this issue that is already 
addressed - but supporting allowing users to choose the implemenation when 
using Drill is not yet supported. The feature is potentially useful but maybe 
better to wait till users start reporting issues about memory footprint before 
adding extra Drill features.

> format-excel reader: support different Shared String implementations
> 
>
> Key: DRILL-8096
> URL: https://issues.apache.org/jira/browse/DRILL-8096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> One of the biggest users of memory and processing time when reading Excel 
> files is handling the Shared Strings Table.
> excel-streaming-reader v3.3.0 supports 3 implementations.
> I would suggest that Drill should use the ReadOnlySharedStringTable as the 
> default.
> Drill currently uses the full featured Apache POI SharedStringTable by 
> default (which requires more memory and parsing effort).
> There is also a TempFileSharedStringTable which uses a temp file to keep the 
> data out of heap memory. This is still pretty fast because it is implemented 
> using a H2 database MVMap.
> If supporting allowing users configure which implementation they want sounds 
> useful, I can do a PR.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8251) Upgrade hadoop 2 (to 2.10.2) due to CVE

2022-06-19 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8251:
-

 Summary: Upgrade hadoop 2 (to 2.10.2) due to CVE 
 Key: DRILL-8251
 URL: https://issues.apache.org/jira/browse/DRILL-8251
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.20.1
Reporter: PJ Fanning



Relates to https://github.com/apache/drill/security/dependabot/21



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8240) Revisit clone of log4j Strings class

2022-05-30 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543925#comment-17543925
 ] 

PJ Fanning commented on DRILL-8240:
---

The issue is that Apache Hive code uses a class from log4j-api jar but Drill 
does not include log4j-api jar as a dependency when it uses Apache Hive. So 
far, the solution is for Drill to have a copy of the log4j class that Hive 
needs. This java file needs to be kept up to date - we upgraded Log4j during 
the Log4j panic at the tirn of this year - but never upgraded the java file.

I believe that Drill should not be copying log4j classes like this and that it 
should include the log4j-api jar as a dependency when using Apache Hive. If 
Drill team insists on not adding this dependency, then we are stuck with having 
to merge in all the changes that happen to the Java file.

> Revisit clone of log4j Strings class
> 
>
> Key: DRILL-8240
> URL: https://issues.apache.org/jira/browse/DRILL-8240
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Functions - Hive
>Affects Versions: 1.20.1
>Reporter: PJ Fanning
>Priority: Major
>
> See https://issues.apache.org/jira/browse/DRILL-8044 for background.
> The code added there is now out of date. After the log4j panic late last 
> year, 5 commits were made to modify the real log4j class and these are 
> missing from the Drill copy.
> Compare 
> https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
>  to 
> https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
> The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is 
> missing from the Drill copy.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (DRILL-8240) Revisit clone of log4j Strings class

2022-05-30 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543925#comment-17543925
 ] 

PJ Fanning edited comment on DRILL-8240 at 5/30/22 12:27 PM:
-

The issue is that Apache Hive code uses a class from log4j-api jar but Drill 
does not include log4j-api jar as a dependency when it uses Apache Hive. So 
far, the solution is for Drill to have a copy of the log4j class that Hive 
needs. This java file needs to be kept up to date - we upgraded Log4j during 
the Log4j panic at the turn of this year - but never upgraded the java file.

I believe that Drill should not be copying log4j classes like this and that it 
should include the log4j-api jar as a dependency when using Apache Hive. If 
Drill team insists on not adding this dependency, then we are stuck with having 
to merge in all the changes that happen to the Java file.


was (Author: pj.fanning):
The issue is that Apache Hive code uses a class from log4j-api jar but Drill 
does not include log4j-api jar as a dependency when it uses Apache Hive. So 
far, the solution is for Drill to have a copy of the log4j class that Hive 
needs. This java file needs to be kept up to date - we upgraded Log4j during 
the Log4j panic at the tirn of this year - but never upgraded the java file.

I believe that Drill should not be copying log4j classes like this and that it 
should include the log4j-api jar as a dependency when using Apache Hive. If 
Drill team insists on not adding this dependency, then we are stuck with having 
to merge in all the changes that happen to the Java file.

> Revisit clone of log4j Strings class
> 
>
> Key: DRILL-8240
> URL: https://issues.apache.org/jira/browse/DRILL-8240
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Functions - Hive
>Affects Versions: 1.20.1
>Reporter: PJ Fanning
>Priority: Major
>
> See https://issues.apache.org/jira/browse/DRILL-8044 for background.
> The code added there is now out of date. After the log4j panic late last 
> year, 5 commits were made to modify the real log4j class and these are 
> missing from the Drill copy.
> Compare 
> https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
>  to 
> https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
> The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is 
> missing from the Drill copy.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (DRILL-8240) Revisit clone of log4j Strings class

2022-05-29 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543600#comment-17543600
 ] 

PJ Fanning commented on DRILL-8240:
---

[~dzamo], [~cgivre], [~luoc] Any thoughts on how we should proceed here? Should 
we just update the Drill copy of the code?

> Revisit clone of log4j Strings class
> 
>
> Key: DRILL-8240
> URL: https://issues.apache.org/jira/browse/DRILL-8240
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Functions - Hive
>Affects Versions: 1.20.1
>Reporter: PJ Fanning
>Priority: Major
>
> See https://issues.apache.org/jira/browse/DRILL-8044 for background.
> The code added there is now out of date. After the log4j panic late last 
> year, 5 commits were made to modify the real log4j class and these are 
> missing from the Drill copy.
> Compare 
> https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
>  to 
> https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
> The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is 
> missing from the Drill copy.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (DRILL-8240) Revisit clone of log4j Strings class

2022-05-29 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8240:
-

 Summary: Revisit clone of log4j Strings class
 Key: DRILL-8240
 URL: https://issues.apache.org/jira/browse/DRILL-8240
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Hive
Affects Versions: 1.20.1
Reporter: PJ Fanning


See https://issues.apache.org/jira/browse/DRILL-8044 for background.

The code added there is now out of date. After the log4j panic late last year, 
5 commits were made to modify the real log4j class and these are missing from 
the Drill copy.

Compare 
https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java
 to 
https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java

The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is 
missing from the Drill copy.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (DRILL-8230) upgrade to poi 5.2.2

2022-05-20 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8230:
-

 Summary: upgrade to poi 5.2.2
 Key: DRILL-8230
 URL: https://issues.apache.org/jira/browse/DRILL-8230
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (DRILL-8176) upgrade jackson due to CVE-2020-36518

2022-03-25 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8176:
-

 Summary: upgrade jackson due to CVE-2020-36518
 Key: DRILL-8176
 URL: https://issues.apache.org/jira/browse/DRILL-8176
 Project: Apache Drill
  Issue Type: Bug
Reporter: PJ Fanning


https://nvd.nist.gov/vuln/detail/CVE-2020-36518



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8154) upgrade to poi 5.2.1

2022-03-04 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8154:
-

 Summary: upgrade to poi 5.2.1
 Key: DRILL-8154
 URL: https://issues.apache.org/jira/browse/DRILL-8154
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


https://poi.apache.org/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8150) upgrade to log4j 2.17.2

2022-02-28 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8150:
-

 Summary: upgrade to log4j 2.17.2
 Key: DRILL-8150
 URL: https://issues.apache.org/jira/browse/DRILL-8150
 Project: Apache Drill
  Issue Type: Improvement
Reporter: PJ Fanning


https://logging.apache.org/log4j/2.x/changes-report.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8149) format-excel plugin needs to support POI IOUtils byte array overrides to support big files

2022-02-24 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8149:
-

 Summary: format-excel plugin needs to support POI IOUtils byte 
array overrides to support big files
 Key: DRILL-8149
 URL: https://issues.apache.org/jira/browse/DRILL-8149
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Affects Versions: 1.19.0
Reporter: PJ Fanning


[https://poi.apache.org/components/configuration.html] - see 
[org.apache.poi.util.IOUtils.setByteArrayMaxOverride(int 
maxOverride)|https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-]

Core POI code tries to set limits on resource allocations. 
excel-streaming-reader may not be as heavily affected by these settings because 
it only used parts of the core POI codebase.

POI 5.2.1 (due in next few weeks) fixes a few issues but there is some evidence 
that core POI users are hitting issues when loading large files and having to 
set  the byte array max override setting.

I can do some testing of the format-excel plugin to see if it can hit these 
issues with large files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (DRILL-8095) format-excel reader - upgrade to POI 5.2.0

2022-01-15 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning resolved DRILL-8095.
---
Fix Version/s: 1.20.0
   Resolution: Fixed

PR merged

> format-excel reader - upgrade to POI 5.2.0
> --
>
> Key: DRILL-8095
> URL: https://issues.apache.org/jira/browse/DRILL-8095
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
> Fix For: 1.20.0
>
>
> Upgrade to latest POI release



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8095) format-excel reader - upgrade to POI 5.2.0

2022-01-15 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8095:
--
Description: Upgrade to latest POI release  (was: I've recently added a 
feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style 
information. This is not enabled by default. It saves memory and processing 
time to ignore the cell styles.

The current Drill format-excel code does not use the cell styles.

At some point in the future, it may be worth having a Drill feature that allows 
it to infer the schema for the sheet based on the cell styles but until such a 
feature is added, the parsing the cell styles is a waste of compute resources.

If this sounds, useful, I can submit a PR.)
Summary: format-excel reader - upgrade to POI 5.2.0  (was: format-excel 
reader should ignore cell styles)

It appears that Drill code need the excel styles to work out if the cell data 
is a cell - so need to keep parsing the style data.

 

was:

I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
optionally ignore cell style information. This is not enabled by default. It 
saves memory and processing time to ignore the cell styles.

The current Drill format-excel code does not use the cell styles.

At some point in the future, it may be worth having a Drill feature that allows 
it to infer the schema for the sheet based on the cell styles but until such a 
feature is added, the parsing the cell styles is a waste of compute resources.

If this sounds, useful, I can submit a PR.

> format-excel reader - upgrade to POI 5.2.0
> --
>
> Key: DRILL-8095
> URL: https://issues.apache.org/jira/browse/DRILL-8095
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> Upgrade to latest POI release



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8106) format-excel does not handle missing cells properly

2022-01-12 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474927#comment-17474927
 ] 

PJ Fanning commented on DRILL-8106:
---

[~cgivre] this is the issue that you emailed me about.

> format-excel does not handle missing cells properly
> ---
>
> Key: DRILL-8106
> URL: https://issues.apache.org/jira/browse/DRILL-8106
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> ExcelBatchReader uses cellIterator assuming that this will return cells for 
> all columns - but this is not how that code works - the iterator only returns 
> non-empty cells.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8106) format-excel does not handle missing cells properly

2022-01-12 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8106:
-

 Summary: format-excel does not handle missing cells properly
 Key: DRILL-8106
 URL: https://issues.apache.org/jira/browse/DRILL-8106
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


ExcelBatchReader uses cellIterator assuming that this will return cells for all 
columns - but this is not how that code works - the iterator only returns 
non-empty cells.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8095) format-excel reader should ignore cell styles

2022-01-12 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474699#comment-17474699
 ] 

PJ Fanning commented on DRILL-8095:
---

In theory, cell styling should not affect Drill based on how it currently 
parses the data. I can add a PR after the POI 5.2.0 release goes out (at 
weekend, hopefully). If you have any examples of xlsx files that cause problems 
with existing Drill code - could you send them to me? You can email if the data 
is sensitive.

> format-excel reader should ignore cell styles
> -
>
> Key: DRILL-8095
> URL: https://issues.apache.org/jira/browse/DRILL-8095
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
> optionally ignore cell style information. This is not enabled by default. It 
> saves memory and processing time to ignore the cell styles.
> The current Drill format-excel code does not use the cell styles.
> At some point in the future, it may be worth having a Drill feature that 
> allows it to infer the schema for the sheet based on the cell styles but 
> until such a feature is added, the parsing the cell styles is a waste of 
> compute resources.
> If this sounds, useful, I can submit a PR.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8096) format-excel reader: support different Shared String implementations

2022-01-12 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474696#comment-17474696
 ] 

PJ Fanning commented on DRILL-8096:
---

[~cgivre] I'm heading up the POI 5.2.0 release and that will be released in a 
few days if noone drops a late -1. So I'm planning to wait till that is 
released and including the POI and associated lib updates in my next Drill PR.

> format-excel reader: support different Shared String implementations
> 
>
> Key: DRILL-8096
> URL: https://issues.apache.org/jira/browse/DRILL-8096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> One of the biggest users of memory and processing time when reading Excel 
> files is handling the Shared Strings Table.
> excel-streaming-reader v3.3.0 supports 3 implementations.
> I would suggest that Drill should use the ReadOnlySharedStringTable as the 
> default.
> Drill currently uses the full featured Apache POI SharedStringTable by 
> default (which requires more memory and parsing effort).
> There is also a TempFileSharedStringTable which uses a temp file to keep the 
> data out of heap memory. This is still pretty fast because it is implemented 
> using a H2 database MVMap.
> If supporting allowing users configure which implementation they want sounds 
> useful, I can do a PR.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8096) format-excel reader: support different Shared String implementations

2021-12-27 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8096:
-

 Summary: format-excel reader: support different Shared String 
implementations
 Key: DRILL-8096
 URL: https://issues.apache.org/jira/browse/DRILL-8096
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


One of the biggest users of memory and processing time when reading Excel files 
is handling the Shared Strings Table.

excel-streaming-reader v3.3.0 supports 3 implementations.

I would suggest that Drill should use the ReadOnlySharedStringTable as the 
default.

Drill currently uses the full featured Apache POI SharedStringTable by default 
(which requires more memory and parsing effort).

There is also a TempFileSharedStringTable which uses a temp file to keep the 
data out of heap memory. This is still pretty fast because it is implemented 
using a H2 database MVMap.

If supporting allowing users configure which implementation they want sounds 
useful, I can do a PR.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8095) format-excel reader should ignore cell styles

2021-12-27 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8095:
-

 Summary: format-excel reader should ignore cell styles
 Key: DRILL-8095
 URL: https://issues.apache.org/jira/browse/DRILL-8095
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
optionally ignore cell style information. This is not enabled by default. It 
saves memory and processing time to ignore the cell styles.

The current Drill format-excel code does not use the cell styles.

At some point in the future, it may be worth having a Drill feature that allows 
it to infer the schema for the sheet based on the cell styles but until such a 
feature is added, the parsing the cell styles is a waste of compute resources.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8095) format-excel reader should ignore cell styles

2021-12-27 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8095:
--
Description: 
I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
optionally ignore cell style information. This is not enabled by default. It 
saves memory and processing time to ignore the cell styles.

The current Drill format-excel code does not use the cell styles.

At some point in the future, it may be worth having a Drill feature that allows 
it to infer the schema for the sheet based on the cell styles but until such a 
feature is added, the parsing the cell styles is a waste of compute resources.

If this sounds, useful, I can submit a PR.

  was:
I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
optionally ignore cell style information. This is not enabled by default. It 
saves memory and processing time to ignore the cell styles.

The current Drill format-excel code does not use the cell styles.

At some point in the future, it may be worth having a Drill feature that allows 
it to infer the schema for the sheet based on the cell styles but until such a 
feature is added, the parsing the cell styles is a waste of compute resources.


> format-excel reader should ignore cell styles
> -
>
> Key: DRILL-8095
> URL: https://issues.apache.org/jira/browse/DRILL-8095
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> I've recently added a feature to excel-streaming-reader (in v3.3.0) to 
> optionally ignore cell style information. This is not enabled by default. It 
> saves memory and processing time to ignore the cell styles.
> The current Drill format-excel code does not use the cell styles.
> At some point in the future, it may be worth having a Drill feature that 
> allows it to infer the schema for the sheet based on the cell styles but 
> until such a feature is added, the parsing the cell styles is a waste of 
> compute resources.
> If this sounds, useful, I can submit a PR.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8071) format-excel data parsing should use POI code

2021-12-13 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458822#comment-17458822
 ] 

PJ Fanning commented on DRILL-8071:
---

I scaled back the scope of this issue to just what was covered in the linked PR.

I removed this from the description:

The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

> format-excel data parsing should use POI code
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Affects Versions: 1.19.0
>Reporter: PJ Fanning
>Priority: Major
>
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8071) format-excel data parsing should use POI code

2021-12-13 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8071:
--
Description: 
There is also custom code for handling the conversion of the raw numbers 
representing dates/timestamps but this also seems like a bad idea. The Cell 
class has getLocalDateTimeCellValue and this has the right logic for converting 
1904 and 1900 based dates - yes, Excel uses 2 different formats.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 

  was:
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

There is also custom code for handling the conversion of the raw numbers 
representing dates/timestamps but this also seems like a bad idea. The Cell 
class has getLocalDateTimeCellValue and this has the right logic for converting 
1904 and 1900 based dates - yes, Excel uses 2 different formats.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 


> format-excel data parsing should use POI code
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Affects Versions: 1.19.0
>Reporter: PJ Fanning
>Priority: Major
>
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8071) format-excel data parsing should use POI code

2021-12-13 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8071:
--
Summary: format-excel data parsing should use POI code  (was: format-excel 
should use POI DataFormatter)

> format-excel data parsing should use POI code
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Affects Versions: 1.19.0
>Reporter: PJ Fanning
>Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.987653999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8070) format-excel assumes that rowIterator returns every row

2021-12-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8070:
--
Summary: format-excel assumes that rowIterator returns every row  (was: 
format-excel assumes that rowIterator returns every row - it doesn't)

> format-excel assumes that rowIterator returns every row
> ---
>
> Key: DRILL-8070
> URL: https://issues.apache.org/jira/browse/DRILL-8070
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> In ExcelBatchReader, this code makes the wrong assumption:
> {code:java}
>     for (int i = 1; i < rowNumber; i++) {
>          currentRow = rowIterator.next();
>     } {code}
>  
> There are 2 for loops like this.
> Empty Rows will not necessarily be returned by the iterator. Basically, rows 
> without populated cells could easily be skipped. Think of the Sheet as being 
> represented as a sparse matrix - because it is stored like this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter

2021-12-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8071:
--
Description: 
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

There is also custom code for handling the conversion of the raw numbers 
representing dates/timestamps but this also seems like a bad idea. The Cell 
class has getLocalDateTimeCellValue and this has the right logic for converting 
1904 and 1900 based dates - yes, Excel uses 2 different formats.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 

  was:
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 


> format-excel should use POI DataFormatter
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.987653999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira

[jira] [Commented] (DRILL-8071) format-excel should use POI DataFormatter

2021-12-07 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454585#comment-17454585
 ] 

PJ Fanning commented on DRILL-8071:
---

[~cgivre] I spotted what I think is another issue in the excel code

> format-excel should use POI DataFormatter
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.987653999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter

2021-12-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8071:
--
Description: 
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 

  was:
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 


> format-excel should use POI DataFormatter
> -
>
> Key: DRILL-8071
> URL: https://issues.apache.org/jira/browse/DRILL-8071
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.987653999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8071) format-excel should use POI DataFormatter

2021-12-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8071:
-

 Summary: format-excel should use POI DataFormatter
 Key: DRILL-8071
 URL: https://issues.apache.org/jira/browse/DRILL-8071
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Reporter: PJ Fanning


The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.987653999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't

2021-12-07 Thread PJ Fanning (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454580#comment-17454580
 ] 

PJ Fanning commented on DRILL-8070:
---

[~cgivre] I spotted this when looking at DRILL-8069

> format-excel assumes that rowIterator returns every row - it doesn't
> 
>
> Key: DRILL-8070
> URL: https://issues.apache.org/jira/browse/DRILL-8070
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> In ExcelBatchReader, this code makes the wrong assumption:
> {code:java}
>     for (int i = 1; i < rowNumber; i++) {
>          currentRow = rowIterator.next();
>     } {code}
>  
> There are 2 for loops like this.
> Empty Rows will not necessarily be returned by the iterator. Basically, rows 
> without populated cells could easily be skipped. Think of the Sheet as being 
> represented as a sparse matrix - because it is stored like this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't

2021-12-07 Thread PJ Fanning (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated DRILL-8070:
--
Description: 
In ExcelBatchReader, this code makes the wrong assumption:
{code:java}
    for (int i = 1; i < rowNumber; i++) {
         currentRow = rowIterator.next();
    } {code}
 
There are 2 for loops like this.

Empty Rows will not necessarily be returned by the iterator. Basically, rows 
without populated cells could easily be skipped. Think of the Sheet as being 
represented as a sparse matrix - because it is stored like this.

 

 

 

  was:
In ExcelBatchReader, this code makes the wrong assumption:

```

for (int i = 1; i < rowNumber; i++) {
  currentRow = rowIterator.next();
}

```

 

There are 2 for loops like this.

 

Empty Rows will not necessarily be returned by the iterator. Basically, rows 
without populated cells could easily be skipped. Think of the Sheet as being 
represented as a sparse matrix - because it is stored like this.

 

 

 


> format-excel assumes that rowIterator returns every row - it doesn't
> 
>
> Key: DRILL-8070
> URL: https://issues.apache.org/jira/browse/DRILL-8070
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Data Types
>Reporter: PJ Fanning
>Priority: Major
>
> In ExcelBatchReader, this code makes the wrong assumption:
> {code:java}
>     for (int i = 1; i < rowNumber; i++) {
>          currentRow = rowIterator.next();
>     } {code}
>  
> There are 2 for loops like this.
> Empty Rows will not necessarily be returned by the iterator. Basically, rows 
> without populated cells could easily be skipped. Think of the Sheet as being 
> represented as a sparse matrix - because it is stored like this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't

2021-12-07 Thread PJ Fanning (Jira)
PJ Fanning created DRILL-8070:
-

 Summary: format-excel assumes that rowIterator returns every row - 
it doesn't
 Key: DRILL-8070
 URL: https://issues.apache.org/jira/browse/DRILL-8070
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Data Types
Reporter: PJ Fanning


In ExcelBatchReader, this code makes the wrong assumption:

```

for (int i = 1; i < rowNumber; i++) {
  currentRow = rowIterator.next();
}

```

 

There are 2 for loops like this.

 

Empty Rows will not necessarily be returned by the iterator. Basically, rows 
without populated cells could easily be skipped. Think of the Sheet as being 
represented as a sparse matrix - because it is stored like this.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   >