[jira] [Closed] (DRILL-8443) upgrade netty to 4.1.94 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning closed DRILL-8443. - Resolution: Duplicate > upgrade netty to 4.1.94 due to CVE > -- > > Key: DRILL-8443 > URL: https://issues.apache.org/jira/browse/DRILL-8443 > Project: Apache Drill > Issue Type: Task > Components: Server >Reporter: PJ Fanning >Priority: Major > > https://github.com/apache/drill/security/dependabot/45 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8466) logback 1.3.14 (due to CVE)
[ https://issues.apache.org/jira/browse/DRILL-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8466: -- Summary: logback 1.3.14 (due to CVE) (was: logback 1.3.13 (due to CVE)) > logback 1.3.14 (due to CVE) > --- > > Key: DRILL-8466 > URL: https://issues.apache.org/jira/browse/DRILL-8466 > Project: Apache Drill > Issue Type: Improvement > Components: Server >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-vmq6-5m68-f53m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8466) logback 1.3.13 (due to CVE)
PJ Fanning created DRILL-8466: - Summary: logback 1.3.13 (due to CVE) Key: DRILL-8466 URL: https://issues.apache.org/jira/browse/DRILL-8466 Project: Apache Drill Issue Type: Improvement Components: Server Reporter: PJ Fanning https://github.com/advisories/GHSA-vmq6-5m68-f53m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8465) check data input when loading iceberg data
PJ Fanning created DRILL-8465: - Summary: check data input when loading iceberg data Key: DRILL-8465 URL: https://issues.apache.org/jira/browse/DRILL-8465 Project: Apache Drill Issue Type: Improvement Components: Storage - Iceberg Reporter: PJ Fanning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8464) GitHubActions: checkout action needs to be upgraded to v4 due to node16 deprecation
PJ Fanning created DRILL-8464: - Summary: GitHubActions: checkout action needs to be upgraded to v4 due to node16 deprecation Key: DRILL-8464 URL: https://issues.apache.org/jira/browse/DRILL-8464 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning The following actions uses node12 which is deprecated and will be forced to run on node16: actions/checkout@v2. For more info: https://github.blog/changelog/2023-06-13-github-actions-all-actions-will-run-on-node16-instead-of-node12-by-default/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8463) upgrade to bouncy castle jdk1.8 jars
PJ Fanning created DRILL-8463: - Summary: upgrade to bouncy castle jdk1.8 jars Key: DRILL-8463 URL: https://issues.apache.org/jira/browse/DRILL-8463 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning They have stopped releasing the the JDK 1.5 supporting jars. This lib is important for security purposes. https://www.bouncycastle.org/latest_releases.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8462) upgrade to poi 5.2.5
[ https://issues.apache.org/jira/browse/DRILL-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8462: -- Description: Includes some regression fixes but these probably don't affect Drill usage. https://poi.apache.org/changes.html was:Includes some regression fixes but these probably don't affect Drill usage. > upgrade to poi 5.2.5 > > > Key: DRILL-8462 > URL: https://issues.apache.org/jira/browse/DRILL-8462 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > Includes some regression fixes but these probably don't affect Drill usage. > https://poi.apache.org/changes.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8462) upgrade to poi 5.2.5
PJ Fanning created DRILL-8462: - Summary: upgrade to poi 5.2.5 Key: DRILL-8462 URL: https://issues.apache.org/jira/browse/DRILL-8462 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning Includes some regression fixes but these probably don't affect Drill usage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (DRILL-8460) Bump zookeeper jar to 3.7.2 due to CVE
[ https://issues.apache.org/jira/browse/DRILL-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning reopened DRILL-8460: --- Assignee: (was: PJ Fanning) This is not fixed. The CI build had some test failures that indicate that we may nor be able to upgrade. > Bump zookeeper jar to 3.7.2 due to CVE > -- > > Key: DRILL-8460 > URL: https://issues.apache.org/jira/browse/DRILL-8460 > Project: Apache Drill > Issue Type: Sub-task >Affects Versions: 1.21.1 >Reporter: PJ Fanning >Priority: Major > Fix For: 1.22.0 > > > https://github.com/apache/drill/security/dependabot/51 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8460) bump zookeeper jar to 3.7.2 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8460: -- Parent: DRILL-8452 Issue Type: Sub-task (was: Improvement) > bump zookeeper jar to 3.7.2 due to cve > -- > > Key: DRILL-8460 > URL: https://issues.apache.org/jira/browse/DRILL-8460 > Project: Apache Drill > Issue Type: Sub-task >Reporter: PJ Fanning >Priority: Major > > https://github.com/apache/drill/security/dependabot/51 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8460) bump zookeeper jar to 3.7.2 due to cve
PJ Fanning created DRILL-8460: - Summary: bump zookeeper jar to 3.7.2 due to cve Key: DRILL-8460 URL: https://issues.apache.org/jira/browse/DRILL-8460 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://github.com/apache/drill/security/dependabot/51 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8459) bump avro to 1.11.3 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8459: -- Parent: DRILL-8452 Issue Type: Sub-task (was: Improvement) > bump avro to 1.11.3 due to cve > -- > > Key: DRILL-8459 > URL: https://issues.apache.org/jira/browse/DRILL-8459 > Project: Apache Drill > Issue Type: Sub-task >Reporter: PJ Fanning >Priority: Major > > https://github.com/apache/drill/security/dependabot/49 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8459) bump avro to 1.11.3 due to cve
PJ Fanning created DRILL-8459: - Summary: bump avro to 1.11.3 due to cve Key: DRILL-8459 URL: https://issues.apache.org/jira/browse/DRILL-8459 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://github.com/apache/drill/security/dependabot/49 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8456) uptake POI 5.2.4
PJ Fanning created DRILL-8456: - Summary: uptake POI 5.2.4 Key: DRILL-8456 URL: https://issues.apache.org/jira/browse/DRILL-8456 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning latest release with some transitive dependencies having security patches -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8445) Upgrade Janino
PJ Fanning created DRILL-8445: - Summary: Upgrade Janino Key: DRILL-8445 URL: https://issues.apache.org/jira/browse/DRILL-8445 Project: Apache Drill Issue Type: Task Components: Server Reporter: PJ Fanning I'm not familar with exactly how janino is used inside Drill. There is a new 3.1.10 release today to fix [https://github.com/janino-compiler/janino/issues/201] This may be an issue if Janino is used to parse input that may not be entirely trustworthy. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8443) upgrade netty to 4.1.94 due to CVE
PJ Fanning created DRILL-8443: - Summary: upgrade netty to 4.1.94 due to CVE Key: DRILL-8443 URL: https://issues.apache.org/jira/browse/DRILL-8443 Project: Apache Drill Issue Type: Task Components: Server Reporter: PJ Fanning https://github.com/apache/drill/security/dependabot/45 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8431) add immutable wrapper for ObjectMapper
PJ Fanning created DRILL-8431: - Summary: add immutable wrapper for ObjectMapper Key: DRILL-8431 URL: https://issues.apache.org/jira/browse/DRILL-8431 Project: Apache Drill Issue Type: Task Components: Server Reporter: PJ Fanning The Jackson based code in Drill is quite complicated and passes around ObjectMapper instances in a way that is difficult to maintain. We need to balance the objective of trying to reuse ObjectMapper instances (because they are fairly expensive to create) but avoid the risk that code modifies an ObjectMapper instance (extra config or extra modules added) in a way that affects other code that uses the ObjectMapper instance. Jackson 3 (which is under development but a long way off) moves towards making ObjectMappers immutable. Mapper Builders are used instead to configure mappers. Some of these API changes are already backported to Jackson 2. My suggestion in this Jira is that we create a new Drill class called ImmutableObjectMapper and this exposes API methods for reading and writing JSON but that hides methods for configuring the mapper. We can wrap some of our ObjectMappers. It will probably take a few iterations to get everything switched over but we can start with the low hanging fruit. This class would allow the Java compiler to check for any untidy attempts to modify an ObjectMapper that was created elsewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8430) add factory method for creating Jackson ObjectMappers
PJ Fanning created DRILL-8430: - Summary: add factory method for creating Jackson ObjectMappers Key: DRILL-8430 URL: https://issues.apache.org/jira/browse/DRILL-8430 Project: Apache Drill Issue Type: Task Components: Server Reporter: PJ Fanning See https://issues.apache.org/jira/browse/DRILL-8415 It's useful to keep any customisation of the ObjectMapper creation in 1 place -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8429) jackson 2.14.3
PJ Fanning created DRILL-8429: - Summary: jackson 2.14.3 Key: DRILL-8429 URL: https://issues.apache.org/jira/browse/DRILL-8429 Project: Apache Drill Issue Type: Task Components: Server Reporter: PJ Fanning Jackson 2.14.3 has perf and security hardening improvements https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.3 prelude to DRILL-8415 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8415) Jackson 2.15
[ https://issues.apache.org/jira/browse/DRILL-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719976#comment-17719976 ] PJ Fanning commented on DRILL-8415: --- [~cgivre] [~dzamo] would it be a good idea to create a factory method in drill-common for creating ObjectMappers. It would be a good way of centralising the logic about creating and configuring these mappers. `new ObjectMapper()` has the problem of relying on the default settings for everything. > Jackson 2.15 > > > Key: DRILL-8415 > URL: https://issues.apache.org/jira/browse/DRILL-8415 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > I'm not advocating for an upgrade to [Jackson > 2.15|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.15]. > 2.15.0-rc1 has just been released and 2.15.0 should be out soon. > There are some security focused enhancements including a new class called > StreamReadConstraints. The defaults on > [StreamReadConstraints|https://javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.15.0-rc1/com/fasterxml/jackson/core/StreamReadConstraints.html] > are pretty high but it is not inconceivable that some Drill users might need > to relax them. Parsing large strings as numbers is sub-quadratic, thus the > default limit of 1000 chars or bytes (depending on input context). > When the Drill team consider upgrading to Jackson 2.15 or above, you might > also want to consider adding some way for users to configure the > StreamReadConstraints. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8415) Jackson 2.15
PJ Fanning created DRILL-8415: - Summary: Jackson 2.15 Key: DRILL-8415 URL: https://issues.apache.org/jira/browse/DRILL-8415 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning I'm not advocating for an upgrade to [Jackson 2.15|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.15]. 2.15.0-rc1 has just been released and 2.15.0 should be out soon. There are some security focused enhancements including a new class called StreamReadConstraints. The defaults on [StreamReadConstraints|https://javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.15.0-rc1/com/fasterxml/jackson/core/StreamReadConstraints.html] are pretty high but it is not inconceivable that some Drill users might need to relax them. Parsing large strings as numbers is sub-quadratic, thus the default limit of 1000 chars or bytes (depending on input context). When the Drill team consider upgrading to Jackson 2.15 or above, you might also want to consider adding some way for users to configure the StreamReadConstraints. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8405) upgrade to snakeyaml 2.0 due to cve
PJ Fanning created DRILL-8405: - Summary: upgrade to snakeyaml 2.0 due to cve Key: DRILL-8405 URL: https://issues.apache.org/jira/browse/DRILL-8405 Project: Apache Drill Issue Type: Task Reporter: PJ Fanning https://bitbucket.org/snakeyaml/snakeyaml/issues/561/cve-2022-1471-vulnerability-in -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8363) upgrade postgresql to 42.4.3 due to security issue
PJ Fanning created DRILL-8363: - Summary: upgrade postgresql to 42.4.3 due to security issue Key: DRILL-8363 URL: https://issues.apache.org/jira/browse/DRILL-8363 Project: Apache Drill Issue Type: Task Components: Storage - JDBC Reporter: PJ Fanning https://github.com/advisories/GHSA-562r-vg33-8x8h -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8362) upgrade excel-streaming-reader v4.0.5
PJ Fanning created DRILL-8362: - Summary: upgrade excel-streaming-reader v4.0.5 Key: DRILL-8362 URL: https://issues.apache.org/jira/browse/DRILL-8362 Project: Apache Drill Issue Type: Task Reporter: PJ Fanning A few small issues have been fixed -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8343) Upgrade Commons Text to 1.10.0
[ https://issues.apache.org/jira/browse/DRILL-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623066#comment-17623066 ] PJ Fanning commented on DRILL-8343: --- Duplicate of DRILL-8323 > Upgrade Commons Text to 1.10.0 > -- > > Key: DRILL-8343 > URL: https://issues.apache.org/jira/browse/DRILL-8343 > Project: Apache Drill > Issue Type: Bug >Reporter: Jason-Morries Adam >Priority: Critical > > Apache Commons Text versions prior to 1.10.0 are vulnerable to > [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889], which > involves potential script execution when processing untrusted input using > {{{}StringLookup{}}}. Direct and transitive references to Apache Commons Text > prior to 1.10.0 should be upgraded to avoid the default interpolation > behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8334) upgrade to okhttp 4.10.0 due to CVEs in kotlin transitive dependencies
PJ Fanning created DRILL-8334: - Summary: upgrade to okhttp 4.10.0 due to CVEs in kotlin transitive dependencies Key: DRILL-8334 URL: https://issues.apache.org/jira/browse/DRILL-8334 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning [https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp] It's a fairly minot bump from 4.9.3 to 4.10.0 okhttp 4.10.0 uses a newer copy of kotlin-stdlib that doesn't have CVEs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8332) upgrade to jackson 2.13.4.20221013
[ https://issues.apache.org/jira/browse/DRILL-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8332: -- Description: * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003] * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004] * both fixes have been backported (the CVEs themselves need to be updated to reflect this) There was a gradle module issue in 2.13.4.20221012 so upgrading to 2.13.4.20221013 was: * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003] * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004] * both fixes have been backported (the CVEs themselves need to be updated to reflect this) > upgrade to jackson 2.13.4.20221013 > -- > > Key: DRILL-8332 > URL: https://issues.apache.org/jira/browse/DRILL-8332 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003] > * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004] > * both fixes have been backported (the CVEs themselves need to be updated to > reflect this) > There was a gradle module issue in 2.13.4.20221012 so upgrading to > 2.13.4.20221013 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8332) upgrade to jackson 2.13.4.20221013
[ https://issues.apache.org/jira/browse/DRILL-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8332: -- Summary: upgrade to jackson 2.13.4.20221013 (was: upgrade to jackson 2.13.4.20221012) > upgrade to jackson 2.13.4.20221013 > -- > > Key: DRILL-8332 > URL: https://issues.apache.org/jira/browse/DRILL-8332 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003] > * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004] > * both fixes have been backported (the CVEs themselves need to be updated to > reflect this) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8332) upgrade to jackson 2.13.4.20221012
PJ Fanning created DRILL-8332: - Summary: upgrade to jackson 2.13.4.20221012 Key: DRILL-8332 URL: https://issues.apache.org/jira/browse/DRILL-8332 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42003] * [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-42004] * both fixes have been backported (the CVEs themselves need to be updated to reflect this) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8326) snakeyaml 1.33
PJ Fanning created DRILL-8326: - Summary: snakeyaml 1.33 Key: DRILL-8326 URL: https://issues.apache.org/jira/browse/DRILL-8326 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] – fixes bug in code point limit added in 1.32 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8321) Change kafka_2.13 dependency scope to test
[ https://issues.apache.org/jira/browse/DRILL-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611539#comment-17611539 ] PJ Fanning commented on DRILL-8321: --- I opened an issue and PR at https://issues.apache.org/jira/browse/DRILL-8324 > Change kafka_2.13 dependency scope to test > --- > > Key: DRILL-8321 > URL: https://issues.apache.org/jira/browse/DRILL-8321 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.20.2 >Reporter: Maksym Rymar >Assignee: Maksym Rymar >Priority: Minor > Fix For: 1.20.3 > > > Drill has 2 scala dependencies: > * {{org.apache.kafka.kafka_2.13}} > * {{com.madhukaraphatak.java-sizeof_2.11}} > which are targets on different scala versions 2.13 and 2.11. But Scala has no > backward compatibility for major releases, so we can’t have 2 libraries > compiled on various versions of scala. > To solve the issue there are only 2 ways: > # Compile both libraries on the same major Scala version. > # Remove one of the libraries from Drill > {{kafka_2.13}} is server side (kafka’s server side) dependency and is > unnecessary on the client side (Drill). Probably, it was added carelessly to > Drill to a compile scope, while it is necessary only in a test scope. > So {{kafka_2.13}} can be removed from compile scope. It will reduce the Drill > package size and the main – it will solve scala version conflict. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8323) upgrade commons-text to 1.10.0
[ https://issues.apache.org/jira/browse/DRILL-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8323: -- Description: [https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0] https://issues.apache.org/jira/browse/TEXT-191 affects one of our tests - I have fixed the test in my PR - the old expected value was wrong due to TEXT-191 bug was:https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0 > upgrade commons-text to 1.10.0 > -- > > Key: DRILL-8323 > URL: https://issues.apache.org/jira/browse/DRILL-8323 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > [https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0] > https://issues.apache.org/jira/browse/TEXT-191 affects one of our tests - I > have fixed the test in my PR - the old expected value was wrong due to > TEXT-191 bug -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8324) remove dependency on java-sizeof jar
PJ Fanning created DRILL-8324: - Summary: remove dependency on java-sizeof jar Key: DRILL-8324 URL: https://issues.apache.org/jira/browse/DRILL-8324 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning [https://github.com/phatak-dev/java-sizeof] is not maintained and ties us to a very old version of Scala. It looks like it should be easy to rewrite the code in Java and have it in Drill itself. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8323) upgrade commons-text to 1.10.0
PJ Fanning created DRILL-8323: - Summary: upgrade commons-text to 1.10.0 Key: DRILL-8323 URL: https://issues.apache.org/jira/browse/DRILL-8323 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://commons.apache.org/proper/commons-text/changes-report.html#a1.10.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8321) Change kafka_2.13 dependency scope to test
[ https://issues.apache.org/jira/browse/DRILL-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611053#comment-17611053 ] PJ Fanning commented on DRILL-8321: --- Any chance we can drop [https://github.com/phatak-dev/java-sizeof] ? If it isn't published for recent scala versions, it is a real millstone around our necks. > Change kafka_2.13 dependency scope to test > --- > > Key: DRILL-8321 > URL: https://issues.apache.org/jira/browse/DRILL-8321 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.20.2 >Reporter: Maksym Rymar >Assignee: Maksym Rymar >Priority: Minor > Fix For: 2.0.0 > > > Drill has 2 scala dependencies: > * {{org.apache.kafka.kafka_2.13}} > * {{com.madhukaraphatak.java-sizeof_2.11}} > which are targets on different scala versions 2.13 and 2.11. But Scala has no > backward compatibility for major releases, so we can’t have 2 libraries > compiled on various versions of scala. > To solve the issue there are only 2 ways: > # Compile both libraries on the same major Scala version. > # Remove one of the libraries from Drill > {{kafka_2.13}} is server side (kafka’s server side) dependency and is > unnecessary on the client side (Drill). Probably, it was added carelessly to > Drill to a compile scope, while it is necessary only in a test scope. > So {{kafka_2.13}} can be removed from compile scope. It will reduce the Drill > package size and the main – it will solve scala version conflict. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-7878) Fix LGTM Alerts
[ https://issues.apache.org/jira/browse/DRILL-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611051#comment-17611051 ] PJ Fanning commented on DRILL-7878: --- [~yaybeNo] can this be closed? lgtm is closing down and many of the issues have been dealt with anyway > Fix LGTM Alerts > --- > > Key: DRILL-7878 > URL: https://issues.apache.org/jira/browse/DRILL-7878 > Project: Apache Drill > Issue Type: Improvement >Reporter: Evan Wong >Priority: Major > > Try and deal with all alerts from LGTM badge > [https://lgtm.com/projects/g/apache/drill/alerts/?mode=list] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (DRILL-8313) Introduce configuration for yaml parsing to override the default max file size
[ https://issues.apache.org/jira/browse/DRILL-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning closed DRILL-8313. - Resolution: Won't Fix Thanks [~dzamo] - I'll close this > Introduce configuration for yaml parsing to override the default max file size > -- > > Key: DRILL-8313 > URL: https://issues.apache.org/jira/browse/DRILL-8313 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files. > Need to allow users to specify another value if they need to. > [https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java] > I only became aware of this issue in the last few hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8313) introduce configuration for yaml parsing to override the default max file size
[ https://issues.apache.org/jira/browse/DRILL-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607120#comment-17607120 ] PJ Fanning commented on DRILL-8313: --- [~dzamo] [~cgivre] When searching Drill code, I can find no direct use of snakeyaml in Drill. The only thing I found was: drill-rdbms-metastore/pom.xml {code:java} org.yaml snakeyaml {code} Do you think we need to worry about yaml files that are larger than 3Mb here? > introduce configuration for yaml parsing to override the default max file size > -- > > Key: DRILL-8313 > URL: https://issues.apache.org/jira/browse/DRILL-8313 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files. > Need to allow users to specify another value if they need to. > [https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java] > I only became aware of this issue in the last few hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8313) introduce configuration for yaml parsing to override the default max file size
PJ Fanning created DRILL-8313: - Summary: introduce configuration for yaml parsing to override the default max file size Key: DRILL-8313 URL: https://issues.apache.org/jira/browse/DRILL-8313 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning snakeyaml 1.32 brings in a default limit of 3Mb when parsing yaml files. Need to allow users to specify another value if they need to. [https://bitbucket.org/snakeyaml/snakeyaml/src/72dfa9f1074abe2b8a6c8776bee4476b0aed02e3/src/main/java/org/yaml/snakeyaml/LoaderOptions.java] I only became aware of this issue in the last few hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8309) uptake slf4j 2.0.1
PJ Fanning created DRILL-8309: - Summary: uptake slf4j 2.0.1 Key: DRILL-8309 URL: https://issues.apache.org/jira/browse/DRILL-8309 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning log4j 2.19.0 and logback 2.19.0 support slf4j 2.0.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8308) uptake POI 5.2.3
PJ Fanning created DRILL-8308: - Summary: uptake POI 5.2.3 Key: DRILL-8308 URL: https://issues.apache.org/jira/browse/DRILL-8308 Project: Apache Drill Issue Type: Improvement Components: Storage - Other Affects Versions: 2.0.0 Reporter: PJ Fanning https://poi.apache.org/changes.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603783#comment-17603783 ] PJ Fanning commented on DRILL-8300: --- Another release - maybe another CVE - unclear from release notes [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] [https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data] > Upgrade to snakeyaml 1.32 due to cve > > > Key: DRILL-8300 > URL: https://issues.apache.org/jira/browse/DRILL-8300 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8300: -- Environment: (was: Another release - maybe another CVE - unclear from release notes [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] [https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data]) > Upgrade to snakeyaml 1.32 due to cve > > > Key: DRILL-8300 > URL: https://issues.apache.org/jira/browse/DRILL-8300 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8300) Upgrade to snakeyaml 1.32 due to cve
[ https://issues.apache.org/jira/browse/DRILL-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8300: -- Environment: Another release - maybe another CVE - unclear from release notes [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] [https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data] Summary: Upgrade to snakeyaml 1.32 due to cve (was: Upgrade to snakeyaml 1.31 due to cve) > Upgrade to snakeyaml 1.32 due to cve > > > Key: DRILL-8300 > URL: https://issues.apache.org/jira/browse/DRILL-8300 > Project: Apache Drill > Issue Type: Bug > Environment: Another release - maybe another CVE - unclear from > release notes > [https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes] > [https://bitbucket.org/snakeyaml/snakeyaml/issues/547/restrict-the-size-of-incoming-data] >Reporter: PJ Fanning >Priority: Major > > https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8304) Update Calcite to 1.32
[ https://issues.apache.org/jira/browse/DRILL-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602711#comment-17602711 ] PJ Fanning commented on DRILL-8304: --- Includes a CVE fix - [https://calcite.apache.org/docs/history.html] [CVE-2022-39135|http://cve.mitre.org/cgi-bin/cvename.cgi?name=2022-39135] > Update Calcite to 1.32 > -- > > Key: DRILL-8304 > URL: https://issues.apache.org/jira/browse/DRILL-8304 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Assignee: Vova Vysotskyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
[ https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601752#comment-17601752 ] PJ Fanning commented on DRILL-8301: --- See https://github.com/apache/drill/pull/2637 > Standardise on UTF-8 encoding for char to byte (and vice versa) conversions > --- > > Key: DRILL-8301 > URL: https://issues.apache.org/jira/browse/DRILL-8301 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set > an explicit encoding which means it relies on the JVM default (which differs > by JVM install). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8302) tidy up some char conversions
[ https://issues.apache.org/jira/browse/DRILL-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8302: -- Description: As part of DRILL-8301, I spotted code that could be tidied up. The aim of this issue is to reduce the size of DRILL-8301 without introducing changes to the char encodings. * uses of a pattern like `new String("")` - IntelliJ and other tools highlight this as unnecessary * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use `new String(bytes, StandardCharsets.UTF_8)` * use Base64 encodeToString instead of case where we encode to bytes and then do our own encoding of those bytes to a String * Change existing code with `Charset.forName("UTF-8")` to use `StandardCharsets.UTF_8` was: As part of DRILL-8301, I spotted code that could be tidied up. The aim of this issue is to reduce the size of DRILL-8301 without introducing changes to the char encodings. * uses of a pattern like `new String("")` - IntelliJ and other tools highlight this as unnecessary * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use `new String(bytes, StandardCharsets.UTF_8)` * use Base64 encodeToString instead of case where we encode to bytes and then do our own encoding of those bytes to a String * Replace existing code with `Charset.forName("UTF-8")` to use `StandardCharsets.UTF_8` > tidy up some char conversions > - > > Key: DRILL-8302 > URL: https://issues.apache.org/jira/browse/DRILL-8302 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > As part of DRILL-8301, I spotted code that could be tidied up. The aim of > this issue is to reduce the size of DRILL-8301 without introducing changes to > the char encodings. > * uses of a pattern like `new String("")` - IntelliJ and other tools > highlight this as unnecessary > * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use > `new String(bytes, StandardCharsets.UTF_8)` > * use Base64 encodeToString instead of case where we encode to bytes and > then do our own encoding of those bytes to a String > * Change existing code with `Charset.forName("UTF-8")` to use > `StandardCharsets.UTF_8` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8302) tidy up some char conversions
PJ Fanning created DRILL-8302: - Summary: tidy up some char conversions Key: DRILL-8302 URL: https://issues.apache.org/jira/browse/DRILL-8302 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning As part of DRILL-8301, I spotted code that could be tidied up. The aim of this issue is to reduce the size of DRILL-8301 without introducing changes to the char encodings. * uses of a pattern like `new String("")` - IntelliJ and other tools highlight this as unnecessary * uses of `new String(bytes, StandardCharsets.UTF_8.name())` - better to use `new String(bytes, StandardCharsets.UTF_8)` * use Base64 encodeToString instead of case where we encode to bytes and then do our own encoding of those bytes to a String * Replace existing code with `Charset.forName("UTF-8")` to use `StandardCharsets.UTF_8` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8301) Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
PJ Fanning created DRILL-8301: - Summary: Standardise on UTF-8 encoding for char to byte (and vice versa) conversions Key: DRILL-8301 URL: https://issues.apache.org/jira/browse/DRILL-8301 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set an explicit encoding which means it relies on the JVM default (which differs by JVM install). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8300) upgrade to snakeyaml 1.31 due to cve
PJ Fanning created DRILL-8300: - Summary: upgrade to snakeyaml 1.31 due to cve Key: DRILL-8300 URL: https://issues.apache.org/jira/browse/DRILL-8300 Project: Apache Drill Issue Type: Bug Reporter: PJ Fanning https://github.com/advisories/GHSA-3mc7-4q67-w48m -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8298) possible bug in NonCoveringIndexPlanGenerator
[ https://issues.apache.org/jira/browse/DRILL-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8298: -- Issue Type: Bug (was: Improvement) > possible bug in NonCoveringIndexPlanGenerator > - > > Key: DRILL-8298 > URL: https://issues.apache.org/jira/browse/DRILL-8298 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > I'm not a Calcite expert by LGTM.com and IntelliJ suggest that this set and > the type of the instance in the contains check do not type match. > {code:java} > (restrictedScanTraitSet.contains(RelCollationTraitDef.INSTANCE)) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8299) type matching in MetadataContext
[ https://issues.apache.org/jira/browse/DRILL-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8299: -- Issue Type: Bug (was: Improvement) > type matching in MetadataContext > > > Key: DRILL-8299 > URL: https://issues.apache.org/jira/browse/DRILL-8299 > Project: Apache Drill > Issue Type: Bug >Reporter: PJ Fanning >Priority: Major > > The dirModifCheckMap used in this lookup is keyed using a HDFS Path instance, > not a string, so this code is not going to work: > {code:java} > public boolean getStatus(String dir) { > if (dirModifCheckMap.containsKey(dir)) { > return dirModifCheckMap.get(dir); > } > return false; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8299) type matching in MetadataContext
PJ Fanning created DRILL-8299: - Summary: type matching in MetadataContext Key: DRILL-8299 URL: https://issues.apache.org/jira/browse/DRILL-8299 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning The dirModifCheckMap used in this lookup is keyed using a HDFS Path instance, not a string, so this code is not going to work: {code:java} public boolean getStatus(String dir) { if (dirModifCheckMap.containsKey(dir)) { return dirModifCheckMap.get(dir); } return false; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8298) possible bug in NonCoveringIndexPlanGenerator
PJ Fanning created DRILL-8298: - Summary: possible bug in NonCoveringIndexPlanGenerator Key: DRILL-8298 URL: https://issues.apache.org/jira/browse/DRILL-8298 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning I'm not a Calcite expert by LGTM.com and IntelliJ suggest that this set and the type of the instance in the contains check do not type match. {code:java} (restrictedScanTraitSet.contains(RelCollationTraitDef.INSTANCE)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8297) remove or fix OrderedPartitionRecordBatch
PJ Fanning created DRILL-8297: - Summary: remove or fix OrderedPartitionRecordBatch Key: DRILL-8297 URL: https://issues.apache.org/jira/browse/DRILL-8297 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning The constructor will always throw a NullPointerException because cache is always null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8296) possible type bug in SplunkBatchReader
[ https://issues.apache.org/jira/browse/DRILL-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8296: -- Description: {code:java} if (path.nameEquals("**")) { return true; } else { return specialFields.contains(path.getAsNamePart()); } {code} LGTM and IntelliJ both say that NamePart type does not match the type stored in specialFields collection. was: ``` if (path.nameEquals("**")) { return true; } else { return specialFields.contains(path.getAsNamePart()); } ``` LGTM and IntelliJ both say that NamePart type does not match the type stored in specialFields collection. > possible type bug in SplunkBatchReader > -- > > Key: DRILL-8296 > URL: https://issues.apache.org/jira/browse/DRILL-8296 > Project: Apache Drill > Issue Type: Improvement > Components: splunk >Reporter: PJ Fanning >Priority: Major > > {code:java} > if (path.nameEquals("**")) { > return true; > } else { > return specialFields.contains(path.getAsNamePart()); > } > {code} > LGTM and IntelliJ both say that NamePart type does not match the type stored > in specialFields collection. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8296) possible type bug in SplunkBatchReader
PJ Fanning created DRILL-8296: - Summary: possible type bug in SplunkBatchReader Key: DRILL-8296 URL: https://issues.apache.org/jira/browse/DRILL-8296 Project: Apache Drill Issue Type: Improvement Components: splunk Reporter: PJ Fanning ``` if (path.nameEquals("**")) { return true; } else { return specialFields.contains(path.getAsNamePart()); } ``` LGTM and IntelliJ both say that NamePart type does not match the type stored in specialFields collection. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8282) upgrade to hadoop-common 3.2.4 due to cve
PJ Fanning created DRILL-8282: - Summary: upgrade to hadoop-common 3.2.4 due to cve Key: DRILL-8282 URL: https://issues.apache.org/jira/browse/DRILL-8282 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://github.com/advisories/GHSA-8wm5-8h9c-47pc * this change requires some reload4j dependency changes too - see broken build - https://github.com/apache/drill/pull/2628 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (DRILL-8267) Remove commons-configuration dependency management
[ https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning closed DRILL-8267. - Resolution: Won't Fix This doesn't need to be done > Remove commons-configuration dependency management > -- > > Key: DRILL-8267 > URL: https://issues.apache.org/jira/browse/DRILL-8267 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 > This jar is EOL and has many very insecure dependencies. > Looks like this dependency is not used by Drill or any of its dependencies. > Hadoop uses commons-configuration2 instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8267) remove commons-configuration dependency
[ https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8267: -- Description: https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 This jar is EOL and has many very insecure dependencies. Looks like this dependency is not used by Drill or any of its dependencies. Hadoop uses commons-configuration2 instead. was: https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 This jar is EOL and has many very insecure dependencies. We should use commons-configuration2. > remove commons-configuration dependency > --- > > Key: DRILL-8267 > URL: https://issues.apache.org/jira/browse/DRILL-8267 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 > This jar is EOL and has many very insecure dependencies. > Looks like this dependency is not used by Drill or any of its dependencies. > Hadoop uses commons-configuration2 instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8267) remove commons-configuration dependency
[ https://issues.apache.org/jira/browse/DRILL-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8267: -- Summary: remove commons-configuration dependency (was: switch to commons-configuration2) > remove commons-configuration dependency > --- > > Key: DRILL-8267 > URL: https://issues.apache.org/jira/browse/DRILL-8267 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 > This jar is EOL and has many very insecure dependencies. > We should use commons-configuration2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8267) switch to commons-configuration2
PJ Fanning created DRILL-8267: - Summary: switch to commons-configuration2 Key: DRILL-8267 URL: https://issues.apache.org/jira/browse/DRILL-8267 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://mvnrepository.com/artifact/commons-configuration/commons-configuration/1.10 This jar is EOL and has many very insecure dependencies. We should use commons-configuration2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (DRILL-8266) address number casting issues in github scan
[ https://issues.apache.org/jira/browse/DRILL-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8266: -- Summary: address number casting issues in github scan (was: address number casting issues in https://github.com/apache/drill/security/code-scanning) > address number casting issues in github scan > > > Key: DRILL-8266 > URL: https://issues.apache.org/jira/browse/DRILL-8266 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > https://github.com/apache/drill/security/code-scanning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8266) address number casting issues in https://github.com/apache/drill/security/code-scanning
PJ Fanning created DRILL-8266: - Summary: address number casting issues in https://github.com/apache/drill/security/code-scanning Key: DRILL-8266 URL: https://issues.apache.org/jira/browse/DRILL-8266 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://github.com/apache/drill/security/code-scanning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8265) upgrade aws-java-sdk-s3 due to CVE
PJ Fanning created DRILL-8265: - Summary: upgrade aws-java-sdk-s3 due to CVE Key: DRILL-8265 URL: https://issues.apache.org/jira/browse/DRILL-8265 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.12.260 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8262) Xalan is EOL and has a never to be fixed CVE
[ https://issues.apache.org/jira/browse/DRILL-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568722#comment-17568722 ] PJ Fanning commented on DRILL-8262: --- https://github.com/apache/drill/pull/2607 > Xalan is EOL and has a never to be fixed CVE > > > Key: DRILL-8262 > URL: https://issues.apache.org/jira/browse/DRILL-8262 > Project: Apache Drill > Issue Type: Improvement >Reporter: PJ Fanning >Priority: Major > > Xalan is no longer supported. > https://lists.apache.org/thread/s8kjny5270ssfcp46v0fl39lk98987w7 > It is better to use JAXP TransformerFactory than using xalan directly. If you > add xalan dependency just to ensure that you have a JAXP compliant > transformer on the classpath, this is unnecessary - the Java runtime has a > built-in implementation. > Drill dependency: > https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8264) upgrade joda to fix security warning
PJ Fanning created DRILL-8264: - Summary: upgrade joda to fix security warning Key: DRILL-8264 URL: https://issues.apache.org/jira/browse/DRILL-8264 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning A bug in joda-time pom causes this: https://github.com/apache/drill/security/code-scanning/27 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8263) use secure, non-preview version of libpam4j
PJ Fanning created DRILL-8263: - Summary: use secure, non-preview version of libpam4j Key: DRILL-8263 URL: https://issues.apache.org/jira/browse/DRILL-8263 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning https://github.com/apache/drill/blob/master/exec/java-exec/pom.xml#L32 See dependency with CVE in: https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8262) Xalan is EOL and has a never to be fixed CVE
PJ Fanning created DRILL-8262: - Summary: Xalan is EOL and has a never to be fixed CVE Key: DRILL-8262 URL: https://issues.apache.org/jira/browse/DRILL-8262 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning Xalan is no longer supported. https://lists.apache.org/thread/s8kjny5270ssfcp46v0fl39lk98987w7 It is better to use JAXP TransformerFactory than using xalan directly. If you add xalan dependency just to ensure that you have a JAXP compliant transformer on the classpath, this is unnecessary - the Java runtime has a built-in implementation. Drill dependency: https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.20.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (DRILL-8096) format-excel reader: support different Shared String implementations
[ https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565411#comment-17565411 ] PJ Fanning commented on DRILL-8096: --- This is not implemented. excel-streaming-reader that Drill uses does now use ReadOnlySharedStringTable so that is one part of this issue that is already addressed - but supporting allowing users to choose the implemenation when using Drill is not yet supported. The feature is potentially useful but maybe better to wait till users start reporting issues about memory footprint before adding extra Drill features. > format-excel reader: support different Shared String implementations > > > Key: DRILL-8096 > URL: https://issues.apache.org/jira/browse/DRILL-8096 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > One of the biggest users of memory and processing time when reading Excel > files is handling the Shared Strings Table. > excel-streaming-reader v3.3.0 supports 3 implementations. > I would suggest that Drill should use the ReadOnlySharedStringTable as the > default. > Drill currently uses the full featured Apache POI SharedStringTable by > default (which requires more memory and parsing effort). > There is also a TempFileSharedStringTable which uses a temp file to keep the > data out of heap memory. This is still pretty fast because it is implemented > using a H2 database MVMap. > If supporting allowing users configure which implementation they want sounds > useful, I can do a PR. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (DRILL-8251) Upgrade hadoop 2 (to 2.10.2) due to CVE
PJ Fanning created DRILL-8251: - Summary: Upgrade hadoop 2 (to 2.10.2) due to CVE Key: DRILL-8251 URL: https://issues.apache.org/jira/browse/DRILL-8251 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.20.1 Reporter: PJ Fanning Relates to https://github.com/apache/drill/security/dependabot/21 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (DRILL-8240) Revisit clone of log4j Strings class
[ https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543925#comment-17543925 ] PJ Fanning commented on DRILL-8240: --- The issue is that Apache Hive code uses a class from log4j-api jar but Drill does not include log4j-api jar as a dependency when it uses Apache Hive. So far, the solution is for Drill to have a copy of the log4j class that Hive needs. This java file needs to be kept up to date - we upgraded Log4j during the Log4j panic at the tirn of this year - but never upgraded the java file. I believe that Drill should not be copying log4j classes like this and that it should include the log4j-api jar as a dependency when using Apache Hive. If Drill team insists on not adding this dependency, then we are stuck with having to merge in all the changes that happen to the Java file. > Revisit clone of log4j Strings class > > > Key: DRILL-8240 > URL: https://issues.apache.org/jira/browse/DRILL-8240 > Project: Apache Drill > Issue Type: Improvement > Components: Functions - Hive >Affects Versions: 1.20.1 >Reporter: PJ Fanning >Priority: Major > > See https://issues.apache.org/jira/browse/DRILL-8044 for background. > The code added there is now out of date. After the log4j panic late last > year, 5 commits were made to modify the real log4j class and these are > missing from the Drill copy. > Compare > https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > to > https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is > missing from the Drill copy. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (DRILL-8240) Revisit clone of log4j Strings class
[ https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543925#comment-17543925 ] PJ Fanning edited comment on DRILL-8240 at 5/30/22 12:27 PM: - The issue is that Apache Hive code uses a class from log4j-api jar but Drill does not include log4j-api jar as a dependency when it uses Apache Hive. So far, the solution is for Drill to have a copy of the log4j class that Hive needs. This java file needs to be kept up to date - we upgraded Log4j during the Log4j panic at the turn of this year - but never upgraded the java file. I believe that Drill should not be copying log4j classes like this and that it should include the log4j-api jar as a dependency when using Apache Hive. If Drill team insists on not adding this dependency, then we are stuck with having to merge in all the changes that happen to the Java file. was (Author: pj.fanning): The issue is that Apache Hive code uses a class from log4j-api jar but Drill does not include log4j-api jar as a dependency when it uses Apache Hive. So far, the solution is for Drill to have a copy of the log4j class that Hive needs. This java file needs to be kept up to date - we upgraded Log4j during the Log4j panic at the tirn of this year - but never upgraded the java file. I believe that Drill should not be copying log4j classes like this and that it should include the log4j-api jar as a dependency when using Apache Hive. If Drill team insists on not adding this dependency, then we are stuck with having to merge in all the changes that happen to the Java file. > Revisit clone of log4j Strings class > > > Key: DRILL-8240 > URL: https://issues.apache.org/jira/browse/DRILL-8240 > Project: Apache Drill > Issue Type: Improvement > Components: Functions - Hive >Affects Versions: 1.20.1 >Reporter: PJ Fanning >Priority: Major > > See https://issues.apache.org/jira/browse/DRILL-8044 for background. > The code added there is now out of date. After the log4j panic late last > year, 5 commits were made to modify the real log4j class and these are > missing from the Drill copy. > Compare > https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > to > https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is > missing from the Drill copy. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (DRILL-8240) Revisit clone of log4j Strings class
[ https://issues.apache.org/jira/browse/DRILL-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543600#comment-17543600 ] PJ Fanning commented on DRILL-8240: --- [~dzamo], [~cgivre], [~luoc] Any thoughts on how we should proceed here? Should we just update the Drill copy of the code? > Revisit clone of log4j Strings class > > > Key: DRILL-8240 > URL: https://issues.apache.org/jira/browse/DRILL-8240 > Project: Apache Drill > Issue Type: Improvement > Components: Functions - Hive >Affects Versions: 1.20.1 >Reporter: PJ Fanning >Priority: Major > > See https://issues.apache.org/jira/browse/DRILL-8044 for background. > The code added there is now out of date. After the log4j panic late last > year, 5 commits were made to modify the real log4j class and these are > missing from the Drill copy. > Compare > https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > to > https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java > The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is > missing from the Drill copy. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (DRILL-8240) Revisit clone of log4j Strings class
PJ Fanning created DRILL-8240: - Summary: Revisit clone of log4j Strings class Key: DRILL-8240 URL: https://issues.apache.org/jira/browse/DRILL-8240 Project: Apache Drill Issue Type: Improvement Components: Functions - Hive Affects Versions: 1.20.1 Reporter: PJ Fanning See https://issues.apache.org/jira/browse/DRILL-8044 for background. The code added there is now out of date. After the log4j panic late last year, 5 commits were made to modify the real log4j class and these are missing from the Drill copy. Compare https://github.com/apache/logging-log4j2/commits/rel/2.17.2/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java to https://github.com/apache/logging-log4j2/commits/rel/2.14.1/log4j-api/src/main/java/org/apache/logging/log4j/util/Strings.java The Drill copy is based on Log4J 2.14.1. Every commit in 2021 and 2022 is missing from the Drill copy. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (DRILL-8230) upgrade to poi 5.2.2
PJ Fanning created DRILL-8230: - Summary: upgrade to poi 5.2.2 Key: DRILL-8230 URL: https://issues.apache.org/jira/browse/DRILL-8230 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (DRILL-8176) upgrade jackson due to CVE-2020-36518
PJ Fanning created DRILL-8176: - Summary: upgrade jackson due to CVE-2020-36518 Key: DRILL-8176 URL: https://issues.apache.org/jira/browse/DRILL-8176 Project: Apache Drill Issue Type: Bug Reporter: PJ Fanning https://nvd.nist.gov/vuln/detail/CVE-2020-36518 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8154) upgrade to poi 5.2.1
PJ Fanning created DRILL-8154: - Summary: upgrade to poi 5.2.1 Key: DRILL-8154 URL: https://issues.apache.org/jira/browse/DRILL-8154 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning https://poi.apache.org/ -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8150) upgrade to log4j 2.17.2
PJ Fanning created DRILL-8150: - Summary: upgrade to log4j 2.17.2 Key: DRILL-8150 URL: https://issues.apache.org/jira/browse/DRILL-8150 Project: Apache Drill Issue Type: Improvement Reporter: PJ Fanning https://logging.apache.org/log4j/2.x/changes-report.html -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8149) format-excel plugin needs to support POI IOUtils byte array overrides to support big files
PJ Fanning created DRILL-8149: - Summary: format-excel plugin needs to support POI IOUtils byte array overrides to support big files Key: DRILL-8149 URL: https://issues.apache.org/jira/browse/DRILL-8149 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Affects Versions: 1.19.0 Reporter: PJ Fanning [https://poi.apache.org/components/configuration.html] - see [org.apache.poi.util.IOUtils.setByteArrayMaxOverride(int maxOverride)|https://poi.apache.org/apidocs/5.0/org/apache/poi/util/IOUtils.html#setByteArrayMaxOverride-int-] Core POI code tries to set limits on resource allocations. excel-streaming-reader may not be as heavily affected by these settings because it only used parts of the core POI codebase. POI 5.2.1 (due in next few weeks) fixes a few issues but there is some evidence that core POI users are hitting issues when loading large files and having to set the byte array max override setting. I can do some testing of the format-excel plugin to see if it can hit these issues with large files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (DRILL-8095) format-excel reader - upgrade to POI 5.2.0
[ https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning resolved DRILL-8095. --- Fix Version/s: 1.20.0 Resolution: Fixed PR merged > format-excel reader - upgrade to POI 5.2.0 > -- > > Key: DRILL-8095 > URL: https://issues.apache.org/jira/browse/DRILL-8095 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > Fix For: 1.20.0 > > > Upgrade to latest POI release -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8095) format-excel reader - upgrade to POI 5.2.0
[ https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8095: -- Description: Upgrade to latest POI release (was: I've recently added a feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style information. This is not enabled by default. It saves memory and processing time to ignore the cell styles. The current Drill format-excel code does not use the cell styles. At some point in the future, it may be worth having a Drill feature that allows it to infer the schema for the sheet based on the cell styles but until such a feature is added, the parsing the cell styles is a waste of compute resources. If this sounds, useful, I can submit a PR.) Summary: format-excel reader - upgrade to POI 5.2.0 (was: format-excel reader should ignore cell styles) It appears that Drill code need the excel styles to work out if the cell data is a cell - so need to keep parsing the style data. was: I've recently added a feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style information. This is not enabled by default. It saves memory and processing time to ignore the cell styles. The current Drill format-excel code does not use the cell styles. At some point in the future, it may be worth having a Drill feature that allows it to infer the schema for the sheet based on the cell styles but until such a feature is added, the parsing the cell styles is a waste of compute resources. If this sounds, useful, I can submit a PR. > format-excel reader - upgrade to POI 5.2.0 > -- > > Key: DRILL-8095 > URL: https://issues.apache.org/jira/browse/DRILL-8095 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > Upgrade to latest POI release -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8106) format-excel does not handle missing cells properly
[ https://issues.apache.org/jira/browse/DRILL-8106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474927#comment-17474927 ] PJ Fanning commented on DRILL-8106: --- [~cgivre] this is the issue that you emailed me about. > format-excel does not handle missing cells properly > --- > > Key: DRILL-8106 > URL: https://issues.apache.org/jira/browse/DRILL-8106 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > ExcelBatchReader uses cellIterator assuming that this will return cells for > all columns - but this is not how that code works - the iterator only returns > non-empty cells. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8106) format-excel does not handle missing cells properly
PJ Fanning created DRILL-8106: - Summary: format-excel does not handle missing cells properly Key: DRILL-8106 URL: https://issues.apache.org/jira/browse/DRILL-8106 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning ExcelBatchReader uses cellIterator assuming that this will return cells for all columns - but this is not how that code works - the iterator only returns non-empty cells. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8095) format-excel reader should ignore cell styles
[ https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474699#comment-17474699 ] PJ Fanning commented on DRILL-8095: --- In theory, cell styling should not affect Drill based on how it currently parses the data. I can add a PR after the POI 5.2.0 release goes out (at weekend, hopefully). If you have any examples of xlsx files that cause problems with existing Drill code - could you send them to me? You can email if the data is sensitive. > format-excel reader should ignore cell styles > - > > Key: DRILL-8095 > URL: https://issues.apache.org/jira/browse/DRILL-8095 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > I've recently added a feature to excel-streaming-reader (in v3.3.0) to > optionally ignore cell style information. This is not enabled by default. It > saves memory and processing time to ignore the cell styles. > The current Drill format-excel code does not use the cell styles. > At some point in the future, it may be worth having a Drill feature that > allows it to infer the schema for the sheet based on the cell styles but > until such a feature is added, the parsing the cell styles is a waste of > compute resources. > If this sounds, useful, I can submit a PR. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8096) format-excel reader: support different Shared String implementations
[ https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474696#comment-17474696 ] PJ Fanning commented on DRILL-8096: --- [~cgivre] I'm heading up the POI 5.2.0 release and that will be released in a few days if noone drops a late -1. So I'm planning to wait till that is released and including the POI and associated lib updates in my next Drill PR. > format-excel reader: support different Shared String implementations > > > Key: DRILL-8096 > URL: https://issues.apache.org/jira/browse/DRILL-8096 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > One of the biggest users of memory and processing time when reading Excel > files is handling the Shared Strings Table. > excel-streaming-reader v3.3.0 supports 3 implementations. > I would suggest that Drill should use the ReadOnlySharedStringTable as the > default. > Drill currently uses the full featured Apache POI SharedStringTable by > default (which requires more memory and parsing effort). > There is also a TempFileSharedStringTable which uses a temp file to keep the > data out of heap memory. This is still pretty fast because it is implemented > using a H2 database MVMap. > If supporting allowing users configure which implementation they want sounds > useful, I can do a PR. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8096) format-excel reader: support different Shared String implementations
PJ Fanning created DRILL-8096: - Summary: format-excel reader: support different Shared String implementations Key: DRILL-8096 URL: https://issues.apache.org/jira/browse/DRILL-8096 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning One of the biggest users of memory and processing time when reading Excel files is handling the Shared Strings Table. excel-streaming-reader v3.3.0 supports 3 implementations. I would suggest that Drill should use the ReadOnlySharedStringTable as the default. Drill currently uses the full featured Apache POI SharedStringTable by default (which requires more memory and parsing effort). There is also a TempFileSharedStringTable which uses a temp file to keep the data out of heap memory. This is still pretty fast because it is implemented using a H2 database MVMap. If supporting allowing users configure which implementation they want sounds useful, I can do a PR. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8095) format-excel reader should ignore cell styles
PJ Fanning created DRILL-8095: - Summary: format-excel reader should ignore cell styles Key: DRILL-8095 URL: https://issues.apache.org/jira/browse/DRILL-8095 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning I've recently added a feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style information. This is not enabled by default. It saves memory and processing time to ignore the cell styles. The current Drill format-excel code does not use the cell styles. At some point in the future, it may be worth having a Drill feature that allows it to infer the schema for the sheet based on the cell styles but until such a feature is added, the parsing the cell styles is a waste of compute resources. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8095) format-excel reader should ignore cell styles
[ https://issues.apache.org/jira/browse/DRILL-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8095: -- Description: I've recently added a feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style information. This is not enabled by default. It saves memory and processing time to ignore the cell styles. The current Drill format-excel code does not use the cell styles. At some point in the future, it may be worth having a Drill feature that allows it to infer the schema for the sheet based on the cell styles but until such a feature is added, the parsing the cell styles is a waste of compute resources. If this sounds, useful, I can submit a PR. was: I've recently added a feature to excel-streaming-reader (in v3.3.0) to optionally ignore cell style information. This is not enabled by default. It saves memory and processing time to ignore the cell styles. The current Drill format-excel code does not use the cell styles. At some point in the future, it may be worth having a Drill feature that allows it to infer the schema for the sheet based on the cell styles but until such a feature is added, the parsing the cell styles is a waste of compute resources. > format-excel reader should ignore cell styles > - > > Key: DRILL-8095 > URL: https://issues.apache.org/jira/browse/DRILL-8095 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > I've recently added a feature to excel-streaming-reader (in v3.3.0) to > optionally ignore cell style information. This is not enabled by default. It > saves memory and processing time to ignore the cell styles. > The current Drill format-excel code does not use the cell styles. > At some point in the future, it may be worth having a Drill feature that > allows it to infer the schema for the sheet based on the cell styles but > until such a feature is added, the parsing the cell styles is a waste of > compute resources. > If this sounds, useful, I can submit a PR. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8071) format-excel data parsing should use POI code
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458822#comment-17458822 ] PJ Fanning commented on DRILL-8071: --- I scaled back the scope of this issue to just what was covered in the linked PR. I removed this from the description: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Even if you choose not to use the DataFormatter, you have unprotected calls to `cell.getNumericCellValue()` and that could easily throw an exception (if the data is not stored a number). Even `cell.getStringCellValue()` can throw an exception - for similar reasons. > format-excel data parsing should use POI code > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Affects Versions: 1.19.0 >Reporter: PJ Fanning >Priority: Major > > There is also custom code for handling the conversion of the raw numbers > representing dates/timestamps but this also seems like a bad idea. The Cell > class has getLocalDateTimeCellValue and this has the right logic for > converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8071) format-excel data parsing should use POI code
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8071: -- Description: There is also custom code for handling the conversion of the raw numbers representing dates/timestamps but this also seems like a bad idea. The Cell class has getLocalDateTimeCellValue and this has the right logic for converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. was: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Even if you choose not to use the DataFormatter, you have unprotected calls to `cell.getNumericCellValue()` and that could easily throw an exception (if the data is not stored a number). Even `cell.getStringCellValue()` can throw an exception - for similar reasons. There is also custom code for handling the conversion of the raw numbers representing dates/timestamps but this also seems like a bad idea. The Cell class has getLocalDateTimeCellValue and this has the right logic for converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. > format-excel data parsing should use POI code > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Affects Versions: 1.19.0 >Reporter: PJ Fanning >Priority: Major > > There is also custom code for handling the conversion of the raw numbers > representing dates/timestamps but this also seems like a bad idea. The Cell > class has getLocalDateTimeCellValue and this has the right logic for > converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8071) format-excel data parsing should use POI code
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8071: -- Summary: format-excel data parsing should use POI code (was: format-excel should use POI DataFormatter) > format-excel data parsing should use POI code > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Affects Versions: 1.19.0 >Reporter: PJ Fanning >Priority: Major > > The existing ExcelBatchReader uses the raw data values from the cells. This > raw data ignores formatting set on the cells. As an example, numbers and > dates are stored as doubles. With the POI DataFormatter, you can get the cell > style applied so that the data will appear as it does when you view the data > in Excel itself. > [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] > > A big number like 123456789.987654 could be stored as double that is more > like 123456789.987653999 when represented in decimal format (because this > might be the closest match that double can represent). The cell format could > say that cell has 6 decimal places after the decimal point so the formatter > would round the number back to the value that it displayed in Excel as. > Even if you choose not to use the DataFormatter, you have unprotected calls > to `cell.getNumericCellValue()` and that could easily throw an exception (if > the data is not stored a number). Even `cell.getStringCellValue()` can throw > an exception - for similar reasons. > > There is also custom code for handling the conversion of the raw numbers > representing dates/timestamps but this also seems like a bad idea. The Cell > class has getLocalDateTimeCellValue and this has the right logic for > converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8070) format-excel assumes that rowIterator returns every row
[ https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8070: -- Summary: format-excel assumes that rowIterator returns every row (was: format-excel assumes that rowIterator returns every row - it doesn't) > format-excel assumes that rowIterator returns every row > --- > > Key: DRILL-8070 > URL: https://issues.apache.org/jira/browse/DRILL-8070 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > In ExcelBatchReader, this code makes the wrong assumption: > {code:java} > for (int i = 1; i < rowNumber; i++) { > currentRow = rowIterator.next(); > } {code} > > There are 2 for loops like this. > Empty Rows will not necessarily be returned by the iterator. Basically, rows > without populated cells could easily be skipped. Think of the Sheet as being > represented as a sparse matrix - because it is stored like this. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8071: -- Description: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Even if you choose not to use the DataFormatter, you have unprotected calls to `cell.getNumericCellValue()` and that could easily throw an exception (if the data is not stored a number). Even `cell.getStringCellValue()` can throw an exception - for similar reasons. There is also custom code for handling the conversion of the raw numbers representing dates/timestamps but this also seems like a bad idea. The Cell class has getLocalDateTimeCellValue and this has the right logic for converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. was: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Even if you choose not to use the DataFormatter, you have unprotected calls to `cell.getNumericCellValue()` and that could easily throw an exception (if the data is not stored a number). Even `cell.getStringCellValue()` can throw an exception - for similar reasons. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. > format-excel should use POI DataFormatter > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > The existing ExcelBatchReader uses the raw data values from the cells. This > raw data ignores formatting set on the cells. As an example, numbers and > dates are stored as doubles. With the POI DataFormatter, you can get the cell > style applied so that the data will appear as it does when you view the data > in Excel itself. > [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] > > A big number like 123456789.987654 could be stored as double that is more > like 123456789.987653999 when represented in decimal format (because this > might be the closest match that double can represent). The cell format could > say that cell has 6 decimal places after the decimal point so the formatter > would round the number back to the value that it displayed in Excel as. > Even if you choose not to use the DataFormatter, you have unprotected calls > to `cell.getNumericCellValue()` and that could easily throw an exception (if > the data is not stored a number). Even `cell.getStringCellValue()` can throw > an exception - for similar reasons. > > There is also custom code for handling the conversion of the raw numbers > representing dates/timestamps but this also seems like a bad idea. The Cell > class has getLocalDateTimeCellValue and this has the right logic for > converting 1904 and 1900 based dates - yes, Excel uses 2 different formats. > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira
[jira] [Commented] (DRILL-8071) format-excel should use POI DataFormatter
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454585#comment-17454585 ] PJ Fanning commented on DRILL-8071: --- [~cgivre] I spotted what I think is another issue in the excel code > format-excel should use POI DataFormatter > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > The existing ExcelBatchReader uses the raw data values from the cells. This > raw data ignores formatting set on the cells. As an example, numbers and > dates are stored as doubles. With the POI DataFormatter, you can get the cell > style applied so that the data will appear as it does when you view the data > in Excel itself. > [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] > > A big number like 123456789.987654 could be stored as double that is more > like 123456789.987653999 when represented in decimal format (because this > might be the closest match that double can represent). The cell format could > say that cell has 6 decimal places after the decimal point so the formatter > would round the number back to the value that it displayed in Excel as. > Even if you choose not to use the DataFormatter, you have unprotected calls > to `cell.getNumericCellValue()` and that could easily throw an exception (if > the data is not stored a number). Even `cell.getStringCellValue()` can throw > an exception - for similar reasons. > > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter
[ https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8071: -- Description: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Even if you choose not to use the DataFormatter, you have unprotected calls to `cell.getNumericCellValue()` and that could easily throw an exception (if the data is not stored a number). Even `cell.getStringCellValue()` can throw an exception - for similar reasons. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. was: The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. > format-excel should use POI DataFormatter > - > > Key: DRILL-8071 > URL: https://issues.apache.org/jira/browse/DRILL-8071 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > The existing ExcelBatchReader uses the raw data values from the cells. This > raw data ignores formatting set on the cells. As an example, numbers and > dates are stored as doubles. With the POI DataFormatter, you can get the cell > style applied so that the data will appear as it does when you view the data > in Excel itself. > [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] > > A big number like 123456789.987654 could be stored as double that is more > like 123456789.987653999 when represented in decimal format (because this > might be the closest match that double can represent). The cell format could > say that cell has 6 decimal places after the decimal point so the formatter > would round the number back to the value that it displayed in Excel as. > Even if you choose not to use the DataFormatter, you have unprotected calls > to `cell.getNumericCellValue()` and that could easily throw an exception (if > the data is not stored a number). Even `cell.getStringCellValue()` can throw > an exception - for similar reasons. > > Code that processes excel files is a real pain to get right because the > Microsoft storage format is really bad. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8071) format-excel should use POI DataFormatter
PJ Fanning created DRILL-8071: - Summary: format-excel should use POI DataFormatter Key: DRILL-8071 URL: https://issues.apache.org/jira/browse/DRILL-8071 Project: Apache Drill Issue Type: Improvement Components: Execution - Data Types Reporter: PJ Fanning The existing ExcelBatchReader uses the raw data values from the cells. This raw data ignores formatting set on the cells. As an example, numbers and dates are stored as doubles. With the POI DataFormatter, you can get the cell style applied so that the data will appear as it does when you view the data in Excel itself. [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-] A big number like 123456789.987654 could be stored as double that is more like 123456789.987653999 when represented in decimal format (because this might be the closest match that double can represent). The cell format could say that cell has 6 decimal places after the decimal point so the formatter would round the number back to the value that it displayed in Excel as. Code that processes excel files is a real pain to get right because the Microsoft storage format is really bad. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't
[ https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454580#comment-17454580 ] PJ Fanning commented on DRILL-8070: --- [~cgivre] I spotted this when looking at DRILL-8069 > format-excel assumes that rowIterator returns every row - it doesn't > > > Key: DRILL-8070 > URL: https://issues.apache.org/jira/browse/DRILL-8070 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > In ExcelBatchReader, this code makes the wrong assumption: > {code:java} > for (int i = 1; i < rowNumber; i++) { > currentRow = rowIterator.next(); > } {code} > > There are 2 for loops like this. > Empty Rows will not necessarily be returned by the iterator. Basically, rows > without populated cells could easily be skipped. Think of the Sheet as being > represented as a sparse matrix - because it is stored like this. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't
[ https://issues.apache.org/jira/browse/DRILL-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated DRILL-8070: -- Description: In ExcelBatchReader, this code makes the wrong assumption: {code:java} for (int i = 1; i < rowNumber; i++) { currentRow = rowIterator.next(); } {code} There are 2 for loops like this. Empty Rows will not necessarily be returned by the iterator. Basically, rows without populated cells could easily be skipped. Think of the Sheet as being represented as a sparse matrix - because it is stored like this. was: In ExcelBatchReader, this code makes the wrong assumption: ``` for (int i = 1; i < rowNumber; i++) { currentRow = rowIterator.next(); } ``` There are 2 for loops like this. Empty Rows will not necessarily be returned by the iterator. Basically, rows without populated cells could easily be skipped. Think of the Sheet as being represented as a sparse matrix - because it is stored like this. > format-excel assumes that rowIterator returns every row - it doesn't > > > Key: DRILL-8070 > URL: https://issues.apache.org/jira/browse/DRILL-8070 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types >Reporter: PJ Fanning >Priority: Major > > In ExcelBatchReader, this code makes the wrong assumption: > {code:java} > for (int i = 1; i < rowNumber; i++) { > currentRow = rowIterator.next(); > } {code} > > There are 2 for loops like this. > Empty Rows will not necessarily be returned by the iterator. Basically, rows > without populated cells could easily be skipped. Think of the Sheet as being > represented as a sparse matrix - because it is stored like this. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (DRILL-8070) format-excel assumes that rowIterator returns every row - it doesn't
PJ Fanning created DRILL-8070: - Summary: format-excel assumes that rowIterator returns every row - it doesn't Key: DRILL-8070 URL: https://issues.apache.org/jira/browse/DRILL-8070 Project: Apache Drill Issue Type: Bug Components: Execution - Data Types Reporter: PJ Fanning In ExcelBatchReader, this code makes the wrong assumption: ``` for (int i = 1; i < rowNumber; i++) { currentRow = rowIterator.next(); } ``` There are 2 for loops like this. Empty Rows will not necessarily be returned by the iterator. Basically, rows without populated cells could easily be skipped. Think of the Sheet as being represented as a sparse matrix - because it is stored like this. -- This message was sent by Atlassian Jira (v8.20.1#820001)