[jira] [Created] (DRILL-8494) HTTP Caching Not Saving Pages

2024-05-06 Thread Charles Givre (Jira)
Charles Givre created DRILL-8494:


 Summary: HTTP Caching Not Saving Pages
 Key: DRILL-8494
 URL: https://issues.apache.org/jira/browse/DRILL-8494
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - HTTP
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


A minor bugfix, but the HTTP storage plugin was not actually caching results 
even when caching was set to true.  This bug was introduced in DRILL-8329.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Drill 1.21.2 - RC0

2024-04-28 Thread Charles Givre
Mine as well 
-1

I merged the XML fix and will have a fix for the caching regression by mid-week.
Best,
-- C

> On Apr 27, 2024, at 1:43 AM, James Turton  wrote:
> 
> To keep the mailing list appraised of goings on, some regressions are being 
> fixed (DRILL-8493, DRILL-8329) and another RC will be prepared. Here's my -1 
> for RC0.
> 
> On 2024/04/15 11:26, James Turton wrote:
>> Hi all
>> 
>> I'd like to propose the first release candidate (RC0) of Apache Drill, 
>> version 1.21.2.
>> 
>> The release candidate covers a total of 38 resolved Jira issues [1]. Thanks 
>> to everyone who contributed to this release.
>> 
>> The tarball artefacts are hosted at [2] and the Maven artefacts are hosted 
>> at [3].
>> 
>> This release candidate is based on a98e5f50405437a2fd670fdcb5840796fadfa6b2 
>> located at [4].
>> 
>> CI runs for the release candidate are viewable at [5].
>> 
>> [ ] +1
>> [ ] +0
>> [ ] -1
>> 
>> [1] 
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12353550=12313820
>>  
>> 
>>  
>> [2] https://dist.apache.org/repos/dist/dev/drill/1.21.2-rc0/
>> [3] https://repository.apache.org/content/repositories/orgapachedrill-1108/
>> [4] https://github.com/jnturton/drill/commits/drill-1.21.2
>> [5] https://github.com/apache/drill/actions/runs/8685131686/job/23813981058
> 



[jira] [Created] (DRILL-8493) Drill Unable to Read XML Files with Namespaces

2024-04-26 Thread Charles Givre (Jira)
Charles Givre created DRILL-8493:


 Summary: Drill Unable to Read XML Files with Namespaces
 Key: DRILL-8493
 URL: https://issues.apache.org/jira/browse/DRILL-8493
 Project: Apache Drill
  Issue Type: Bug
  Components: Format - XML
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


This is a bug fix whereby Drill ignores all data when an XML file has a 
namespace.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Drill <> DFDL

2024-04-10 Thread Charles Givre
Hi Mike, 
I hope all is well.  Congrats on the latest release of Daffodil!  Now that is 
done, I wanted to follow up with you about the Drill/DFDL integration.  Are we 
getting closer to being able to merge this?
Best,
-- C

Towards Drill 1.21.2

2024-03-18 Thread Charles Givre
Hello all, 
I hope all is well.  It seems like we are ready to do a bug fix release.   
There are two relatively minor PRs in flight, but from my perspective it seems 
like 1.21.2 is ready to go.   What does everyone think?
Best,
-- C

Re: [I] Query results caching (drill)

2024-02-16 Thread Charles Givre
Hi there, 
You should check the cache directory to see if Drill is writing files to the 
cache.  You should see result files there.
Best, 
-- C

> On Feb 16, 2024, at 09:06, RayetSena (via GitHub)  wrote:
> 
> 
> RayetSena opened a new issue, #2881:
> URL: https://github.com/apache/drill/issues/2881
> 
>   Hello, 
> 
>   I am using http storage plugin and I want to activate the caching. In the 
> documentation it says that this requires adding cacheResults. So I added this 
> to my config. But when I run the same query again, there is almost no 
> difference in query duration. You can see the configuration and query 
> duration below.
> 
>   
> ![image](https://github.com/apache/drill/assets/55273803/2214311b-e359-401f-8abe-60f4bbb9b099)
>   ```
>   {
> "type": "http",
> "cacheResults": true,
> "connections": {
>   "opendata": {
> "url": "https://opendata.cbs.nl/ODataApi/odata/;,
> "requireTail": true,
> "method": "GET",
> "authType": "none",
> "inputType": "json",
> "xmlDataLevel": 1,
> "postParameterLocation": "QUERY_STRING",
> "verifySSLCert": false
>   }
> },
> "timeout": 4000,
> "retryDelay": 5000,
> "proxyType": "direct",
> "authMode": "USER_TRANSLATION",
> "enabled": true
>   }
>   ```
>   
> ![image](https://github.com/apache/drill/assets/55273803/4c90c486-4487-4c47-bc44-6f3b45819341)
> 
>   Also I tried to configure the caching properties in the drill-override.conf 
> file. Here is the configuration that I added:
>   ```
>   drill.exec: {
>   caching: {
>   enabled: true
>   storage:{
>  type: “local”
>  path:  “cache/directory”
>   }
>   }
>   }
>   ```
> 
>   **Drill version**
>   1.21.1
> 
> 
> 
> 
> -- 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
> 
> To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org
> 
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
> 



Re: Joint OGC / ASF / OSGeo codesprint in 3 weeks

2024-02-02 Thread Charles Givre
Hi Bertil, 
Let me explain a bit about Drill and Calcite.   Drill uses Calcite for query 
planning.  Until fairly recently, Drill had a fork of Calcite that had some 
special features which Drill required.  However, about 1-2 versions ago, we 
were able to get Drill off of the fork an onto "mainstream" Calcite.   We're 
currently using version 1.34 (I think).  However...with version 1.35 of Calcite 
there were some breaking changes for Drill, so we haven't upgraded the 
dependency yet.  For someone who knows Calcite really well, I don't think that 
would be too difficult but the breaking issues had to do with the data types 
returned by some of the date functions... Anyway...

With respect to SQL functions, Drill does this on a case-by-case basis.  For 
certain situations, it relies on Calcite for the functions, and in other cases, 
it uses its own logic.  I'm not an expert on Drill's query planning, but I've 
tinkered with this a few times.  In any event, all of the geo functions in 
Drill are considered UDF and are in the contrib folder. [1]. Additionally, the 
ESRI reader is also in the contrib folder of the project. [2].   I think the 
rationale for this was that when the Geo functions were implemented, Drill was 
still stuck on the Calcite fork which did not have the Geo functions.  Another 
issue which we may encounter is that Drill does not have a specific spatial 
data type.  It relies on the VARBINARY data type for spatial data.   

Take all of this with a grain of salt. I'm not an expert on Calcite or GIS.  :-)
Best,
-- C


[1]: https://github.com/apache/drill/tree/master/contrib/udfs
[2]: https://github.com/apache/drill/tree/master/contrib/format-esri


> On Feb 2, 2024, at 01:25, Bertil Chapuis  wrote:
> 
> Hello Jia and Charles,
> 
> I'm really interested in this topic as well. Apache Calcite transitionned 
> from ESRI Geometry to JTS, and many ST functions have been implemented there 
> as well [1, 2, 3]. Sharing experiences and code could benefit all projects.
> 
> I haven’t looked into the details of each project, but from what I 
> understand, Sedona depends on Calcite through Flink, and Drill depends 
> directly on Calcite. Is that correct? Since these functions are available in 
> Calcite’s core, it means they may already be available in the respective 
> class paths.
> 
> Best regards,
> 
> Bertil
> 
> [1] 
> https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/runtime/SpatialTypeFunctions.java
> [2] 
> https://github.com/apache/calcite/blob/main/core/src/test/resources/sql/spatial.iq
> [3] 
> https://calcite.apache.org/docs/reference.html#geometry-conversion-functions-2d
> 
> 
>> On 2 Feb 2024, at 06:47, Jia Yu  wrote:
>> 
>> Hi Charles,
>> 
>> This is Jia Yu from Apache Sedona. I think what you did is fantastic.
>> As a project of this Joint codespring, I am proposing to implement a
>> comprehensive set of spatial functions to Apache Drill using Apache
>> Sedona.
>> 
>> Apache Sedona has implemented over 130 ST functions and a
>> high-performance geometry serializer in pure Java. All these functions
>> have been ported to Apache Spark, Apache Flink and Snowflake. They are
>> being downloaded over 1.5 million times per month.
>> 
>> This porting process is fairly simple. Let's take Sedona on Apache
>> Flink as an example:
>> 
>> 1. Call a Sedona java function in a UDF template:
>> https://github.com/apache/sedona/blob/master/flink/src/main/java/org/apache/sedona/flink/expressions/Functions.java
>> 2. Register this function in a catalog file:
>> https://github.com/apache/sedona/blob/master/flink/src/main/java/org/apache/sedona/flink/Catalog.java
>> 
>> What do you think?
>> 
>> Thanks,
>> Jia
>> 
>> On Thu, Feb 1, 2024 at 2:44 PM Charles Givre  wrote:
>>> 
>>> Hi Martin,
>>> Thanks for sending.  I'd love for Drill to be included in this.  I have a 
>>> question for you.  A while ago, I started work on a collection of UDFs for 
>>> interacting with H3 Geo Indexes.  I'm not an expert on this but would this 
>>> be useful?  Here's the repo: https://github.com/datadistillr/drill-h3-udf   
>>> If someone would like to collaborate to complete this and get it 
>>> integrated, I'm all for that.
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>>> On Jan 31, 2024, at 10:20, Martin Desruisseaux 
>>>>  wrote:
>>>> 
>>>> Hello all
>>>> 
>>>> The Open Geospatial Consortium (OGC), The Apache Software Foundation (ASF) 
>>>> and The Open Source Geospatial Foundation (OSGeo) hold a join code sprint 
>>>> on February 26 to 28 

Re: Joint OGC / ASF / OSGeo codesprint in 3 weeks

2024-02-02 Thread Charles Givre
Hi Jia, 
Thanks for your interest!  I'd be happy to work with you to build out a Drill 
<> Sedona collab.  This sounds really interesting and I think would be a great 
addition to both projects.
With that said, I'm totally unfamiliar with Sedona unfortunately, so I'm not 
sure how much help I can be on that side, but I do know something about 
Drill... :-)
Best,
-- C


> On Feb 2, 2024, at 00:47, Jia Yu  wrote:
> 
> Hi Charles,
> 
> This is Jia Yu from Apache Sedona. I think what you did is fantastic.
> As a project of this Joint codespring, I am proposing to implement a
> comprehensive set of spatial functions to Apache Drill using Apache
> Sedona.
> 
> Apache Sedona has implemented over 130 ST functions and a
> high-performance geometry serializer in pure Java. All these functions
> have been ported to Apache Spark, Apache Flink and Snowflake. They are
> being downloaded over 1.5 million times per month.
> 
> This porting process is fairly simple. Let's take Sedona on Apache
> Flink as an example:
> 
> 1. Call a Sedona java function in a UDF template:
> https://github.com/apache/sedona/blob/master/flink/src/main/java/org/apache/sedona/flink/expressions/Functions.java
> 2. Register this function in a catalog file:
> https://github.com/apache/sedona/blob/master/flink/src/main/java/org/apache/sedona/flink/Catalog.java
> 
> What do you think?
> 
> Thanks,
> Jia
> 
> On Thu, Feb 1, 2024 at 2:44 PM Charles Givre  wrote:
>> 
>> Hi Martin,
>> Thanks for sending.  I'd love for Drill to be included in this.  I have a 
>> question for you.  A while ago, I started work on a collection of UDFs for 
>> interacting with H3 Geo Indexes.  I'm not an expert on this but would this 
>> be useful?  Here's the repo: https://github.com/datadistillr/drill-h3-udf   
>> If someone would like to collaborate to complete this and get it integrated, 
>> I'm all for that.
>> Best,
>> -- C
>> 
>> 
>> 
>>> On Jan 31, 2024, at 10:20, Martin Desruisseaux 
>>>  wrote:
>>> 
>>> Hello all
>>> 
>>> The Open Geospatial Consortium (OGC), The Apache Software Foundation (ASF) 
>>> and The Open Source Geospatial Foundation (OSGeo) hold a join code sprint 
>>> on February 26 to 28 [1]. The main goals are to support the development of 
>>> open standards for geospatial information and to support the development of 
>>> free and open source software which implements those standards, as well as 
>>> creating awareness about the standards and software projects. This is the 
>>> fourth year that this joint code sprint is organized, and this year will be 
>>> physically located in Évora (Portugal). The event can also be attended 
>>> on-line. Registration is free [2].
>>> 
>>> Apache SIS, Sedona, Baremaps, Parquet, Drill and Camel projects 
>>> participated in the past. It would be great if participation was possible 
>>> this year too. Some ideas could be:
>>> 
>>> * Experiment the use of Apache SIS in Sedona for referencing and grid
>>>  coverage services (could be a join effort between Sedona and SIS
>>>  developers).
>>> * Any work related to Geoparquet [3] (an incubating OGC standard based
>>>  on Apache Parquet).
>>> * Any work related to Drill GIS functions [4].
>>> * Any work related to Camel Geocoder [5]. For example, exploring the
>>>  pertinence of using the ISO 19112 standard (could be a join effort
>>>  between Camel and GeoAPI developers).
>>> 
>>> If anyone is interested, the wiki page [1] can be edited directly. If 
>>> particular you can add your project in the "Which Apache projects are going 
>>> to participate?" section. If an introduction to a project can be presented 
>>> as a tutorial, it can also be added in the "Mentor streams" section of [1].
>>> 
>>>Martin
>>> 
>>> [1]https://github.com/opengeospatial/developer-events/wiki/2024-Joint-OGC-%E2%80%93-OSGeo-%E2%80%93-ASF-Code-Sprint
>>> [2]https://developer.ogc.org/sprints/23/
>>> [3]https://geoparquet.org/
>>> [4]https://drill.apache.org/docs/gis-functions/
>>> [5]https://camel.apache.org/components/4.0.x/geocoder-component.html
>> 



Re: Joint OGC / ASF / OSGeo codesprint in 3 weeks

2024-02-01 Thread Charles Givre
Hi Martin, 
Thanks for sending.  I'd love for Drill to be included in this.  I have a 
question for you.  A while ago, I started work on a collection of UDFs for 
interacting with H3 Geo Indexes.  I'm not an expert on this but would this be 
useful?  Here's the repo: https://github.com/datadistillr/drill-h3-udf   If 
someone would like to collaborate to complete this and get it integrated, I'm 
all for that. 
Best,
-- C



> On Jan 31, 2024, at 10:20, Martin Desruisseaux 
>  wrote:
> 
> Hello all
> 
> The Open Geospatial Consortium (OGC), The Apache Software Foundation (ASF) 
> and The Open Source Geospatial Foundation (OSGeo) hold a join code sprint on 
> February 26 to 28 [1]. The main goals are to support the development of open 
> standards for geospatial information and to support the development of free 
> and open source software which implements those standards, as well as 
> creating awareness about the standards and software projects. This is the 
> fourth year that this joint code sprint is organized, and this year will be 
> physically located in Évora (Portugal). The event can also be attended 
> on-line. Registration is free [2].
> 
> Apache SIS, Sedona, Baremaps, Parquet, Drill and Camel projects participated 
> in the past. It would be great if participation was possible this year too. 
> Some ideas could be:
> 
> * Experiment the use of Apache SIS in Sedona for referencing and grid
>   coverage services (could be a join effort between Sedona and SIS
>   developers).
> * Any work related to Geoparquet [3] (an incubating OGC standard based
>   on Apache Parquet).
> * Any work related to Drill GIS functions [4].
> * Any work related to Camel Geocoder [5]. For example, exploring the
>   pertinence of using the ISO 19112 standard (could be a join effort
>   between Camel and GeoAPI developers).
> 
> If anyone is interested, the wiki page [1] can be edited directly. If 
> particular you can add your project in the "Which Apache projects are going 
> to participate?" section. If an introduction to a project can be presented as 
> a tutorial, it can also be added in the "Mentor streams" section of [1].
> 
> Martin
> 
> [1]https://github.com/opengeospatial/developer-events/wiki/2024-Joint-OGC-%E2%80%93-OSGeo-%E2%80%93-ASF-Code-Sprint
> [2]https://developer.ogc.org/sprints/23/
> [3]https://geoparquet.org/
> [4]https://drill.apache.org/docs/gis-functions/
> [5]https://camel.apache.org/components/4.0.x/geocoder-component.html



Possible Regression: Can't build current master

2024-01-25 Thread Charles Givre
All, 
I just rebased my local Drill on the current master and I'm getting the 
following error when I try to build it.   Is anyone else encountering this?

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) 
on project vector: Compilation failure
[ERROR] 
/Users/charlesgivre/github/drill/exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/UnionVectorShim.java:[78,10]
 error: cannot find symbol
[ERROR]   symbol:   class UnionWriter
[ERROR]   location: class UnionVectorShim
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Re: DFDL Standard approved for ISO JTC1 PAS (Pre-Approved Standard)

2024-01-22 Thread Charles Givre
Nice!

> On Jan 22, 2024, at 08:44, Mike Beckerle  wrote:
> 
> I received notice today that the DFDL OGF standard is officially headed to
> become an ISO standard. The ballot within ISO JTC1 passed 100%.
> 
> I will let you know more info when I get it about how we can propagate this
> information, any branding considerations ISO JTC1 requires about it, their
> trademark rules, etc.
> At that point it will make sense to send this info to our users lists, make
> a blog post, etc.
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com



Re: [I] HTTP plugin timeout error (drill)

2024-01-10 Thread Charles Givre
Hi there, 
I was actually going to comment on this.  I had no trouble executing the query. 
 I think James is correct in that you are getting blocked by the API.  However, 
I may have a solution for you. 

It looks like you are trying to pull back multiple pages from the same API.  
Drill can actually do this for you without you having to resort to complex 
queries.   Take a look at the docs below.   You should be able to configure 
this so that Drill will do the query generation for you.  I thought I had 
implemented throttling or at least a configurable delay between queries but I 
can't find it if I did.  

Also, you shouldn't need to do all the casting.  If you're using the latest 
version of Drill and the incoming data is in JSON format (and well formed) 
Drill should be able to figure all that out. 
Best,
-- C

https://github.com/apache/drill/blob/master/contrib/storage-http/Pagination.md
drill/contrib/storage-http/Pagination.md at master · apache/drill
github.com



> On Jan 10, 2024, at 02:54, jnturton (via GitHub)  wrote:
> 
> 
> jnturton commented on issue #2869:
> URL: https://github.com/apache/drill/issues/2869#issuecomment-1884350260
> 
>   I think this is likely to be due to rate limiting on the API you're 
> querying. Each `FROM http.feed.` clause you've got is going to result in 
> a separate HTTP request to the API, and probably too many of those are being 
> sent in too short a space of time for the server's liking. I can't remember 
> if @cgivre added a throttling feature to the HTTP plugin but another 
> possibility might be to split the query into a few pieces using CTAS to keep 
> retrieved data around locally for later queries that rely on it.
> 
> 
> -- 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
> 
> To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org
> 
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
> 



[jira] [Created] (DRILL-8474) Add Daffodil Format Plugin

2024-01-03 Thread Charles Givre (Jira)
Charles Givre created DRILL-8474:


 Summary: Add Daffodil Format Plugin
 Key: DRILL-8474
 URL: https://issues.apache.org/jira/browse/DRILL-8474
 Project: Apache Drill
  Issue Type: New Feature
Affects Versions: 1.21.1
Reporter: Charles Givre
 Fix For: 1.22.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8472) Bump Image Metadata Library to Latest Version

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8472:


 Summary: Bump Image Metadata Library to Latest Version
 Key: DRILL-8472
 URL: https://issues.apache.org/jira/browse/DRILL-8472
 Project: Apache Drill
  Issue Type: Task
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


Bump Metadata Extractor dependency to latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8471) Bump DeltaLake Driver to Version 3.0.0

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8471:


 Summary: Bump DeltaLake Driver to Version 3.0.0
 Key: DRILL-8471
 URL: https://issues.apache.org/jira/browse/DRILL-8471
 Project: Apache Drill
  Issue Type: Task
  Components: Format - DeltaLake
Reporter: Charles Givre


Bump DeltaLake Driver to Version 3.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Next Version

2024-01-01 Thread Charles Givre
sts to me that 
>>> we should keep putting effort into: embedded Drill, Windows support, rapid 
>>> installation and setup, low "time to insight".
>>> 
>>> MongoDB questions still come up frequently giving a reason beyond the JSON 
>>> files questions to think that the JSON data model is still very important. 
>>> Wherever we decide to bound the current EVF v2 data model implementation, 
>>> maybe we can sketch out a design of whatever is unimplemented in some 
>>> updates to the Drill wiki pages? This would give other devs a head start if 
>>> we decide that some unsupported complex data type is worth implementing 
>>> down the road?
>>> 
>>> 1. https://infra-reports.apache.org/#downloads=drill
>>> 2. https://martinfowler.com/articles/data-mesh-principles.html
>>> 
>>> Regards
>>> James
>>> 
>>> On 2024/01/01 03:16, Charles Givre wrote:
>>>> I'll throw my .02 here...  As a user of Drill, I've only had the occasion 
>>>> to use the Union once. However, when I used it, it consumed so much 
>>>> memory, we ended up finding a workaround anyway and stopped using it. 
>>>> Honestly, since we improved the implicit casting rules, I think Drill is a 
>>>> lot smarter about how it reads data anyway. Bottom line, I do think we 
>>>> could drop the union and repeated union.
>>>> 
>>>> The repeated lists and maps however are unfortunately something that does 
>>>> come up a bit.   Honestly, I'm not sure what work is remaining here but 
>>>> TBH Drill works pretty well at the moment with most of the data I'm using 
>>>> it for.  This would include some really nasty nested JSON objects.
>>>> 
>>>> -- C
>>>> 
>>>> 
>>>>> On Dec 31, 2023, at 01:38, Paul Rogers  wrote:
>>>>> 
>>>>> Hi Luoc,
>>>>> 
>>>>> Thanks for reminding me about the EVF V2 work. I got mostly done adding
>>>>> projection for complex types, then got busy on other projects. I've yet to
>>>>> tackle the hard cases: unions, repeated unions and repeated lists (which
>>>>> are, in fact, repeated repeated unions).
>>>>> 
>>>>> The code to handle unprojected fields in these areas is getting awfully
>>>>> complicated. In doing that work, and then seeing a trick that Druid uses,
>>>>> I'm tempted to rework the projection bits of the code to use a cleaner
>>>>> approach. However, it might be better to commit the work done thus far so
>>>>> folks can use it before I wander off to take another approach.
>>>>> 
>>>>> Then, I wondered if anyone actually still uses this stuff. Do you still
>>>>> need the code to handle non-projection of complex types?
>>>>> 
>>>>> Of course, perhaps no one will ever need the hard cases: I've never been
>>>>> convinced that unions, repeated lists, or arrays of repeated lists are
>>>>> things that any sane data engineer will want to use -- or use more than
>>>>> once.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> - Paul
>>>>> 
>>>>> 
>>>>> On Sat, Dec 30, 2023 at 10:26 PM James Turton  wrote:
>>>>> 
>>>>>> Hi Luoc and Drill devs!
>>>>>> 
>>>>>> It's best to email Paul directly since he doesn't follow these lists
>>>>>> closely. In the meantime I've prepared a PR of backported fixes for
>>>>>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
>>>>>> upgrade that Maksym is working on, and which looks close to done,
>>>>>> included? There's at least one CVE  applicable to our current version of
>>>>>> Netty...
>>>>>> 
>>>>>> Regards
>>>>>> James
>>>>>> 
>>>>>> 
>>>>>> 1. https://github.com/apache/drill/pull/2860
>>>>>> 
>>>>>> On 2023/12/11 04:41, luoc wrote:
>>>>>>> Hello all,
>>>>>>>1.22 will be a more stable version. This is a digression: Is Paul
>>>>>> still interested in participating in the EVF V2 refactoring in the
>>>>>> framework? I would like to offer time to assist him.
>>>>>>> luoc
>>>>>>> 
>>>>>>>> 2023年12月9日 01:01,Charles Givre  写道:
>>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> Happy Friday everyone!   I wanted to raise the topic of getting a Drill
>>>>>> minor release out the door before the end of the year. My opinion is that
>>>>>> I'd really like to release Drill 1.22 once the integration with Apache
>>>>>> Daffodil is complete, but it sounds like that is still a few weeks away.
>>>>>>>> What does everyone think about issuing a maintenance release before the
>>>>>> end of the year?  There are a number of singificant fixes including some
>>>>>> security updates and a major bug in the ES plugin that basically makes it
>>>>>> unusable.
>>>>>>>> Best,
>>>>>>>> -- C
>>>>>> 
>>> 
>> 
> 



[jira] [Created] (DRILL-8470) Bump MongoDB Driver to Latest Version

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8470:


 Summary: Bump MongoDB Driver to Latest Version
 Key: DRILL-8470
 URL: https://issues.apache.org/jira/browse/DRILL-8470
 Project: Apache Drill
  Issue Type: Task
  Components: Storage - MongoDB
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


Bump mongoDB driver to latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Next Version

2023-12-31 Thread Charles Givre
I'll throw my .02 here...  As a user of Drill, I've only had the occasion to 
use the Union once.  However, when I used it, it consumed so much memory, we 
ended up finding a workaround anyway and stopped using it.  Honestly, since we 
improved the implicit casting rules, I think Drill is a lot smarter about how 
it reads data anyway.  Bottom line, I do think we could drop the union and 
repeated union. 

The repeated lists and maps however are unfortunately something that does come 
up a bit.   Honestly, I'm not sure what work is remaining here but TBH Drill 
works pretty well at the moment with most of the data I'm using it for.  This 
would include some really nasty nested JSON objects. 

-- C


> On Dec 31, 2023, at 01:38, Paul Rogers  wrote:
> 
> Hi Luoc,
> 
> Thanks for reminding me about the EVF V2 work. I got mostly done adding
> projection for complex types, then got busy on other projects. I've yet to
> tackle the hard cases: unions, repeated unions and repeated lists (which
> are, in fact, repeated repeated unions).
> 
> The code to handle unprojected fields in these areas is getting awfully
> complicated. In doing that work, and then seeing a trick that Druid uses,
> I'm tempted to rework the projection bits of the code to use a cleaner
> approach. However, it might be better to commit the work done thus far so
> folks can use it before I wander off to take another approach.
> 
> Then, I wondered if anyone actually still uses this stuff. Do you still
> need the code to handle non-projection of complex types?
> 
> Of course, perhaps no one will ever need the hard cases: I've never been
> convinced that unions, repeated lists, or arrays of repeated lists are
> things that any sane data engineer will want to use -- or use more than
> once.
> 
> Thanks,
> 
> - Paul
> 
> 
> On Sat, Dec 30, 2023 at 10:26 PM James Turton  wrote:
> 
>> Hi Luoc and Drill devs!
>> 
>> It's best to email Paul directly since he doesn't follow these lists
>> closely. In the meantime I've prepared a PR of backported fixes for
>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
>> upgrade that Maksym is working on, and which looks close to done,
>> included? There's at least one CVE  applicable to our current version of
>> Netty...
>> 
>> Regards
>> James
>> 
>> 
>> 1. https://github.com/apache/drill/pull/2860
>> 
>> On 2023/12/11 04:41, luoc wrote:
>>> Hello all,
>>>   1.22 will be a more stable version. This is a digression: Is Paul
>> still interested in participating in the EVF V2 refactoring in the
>> framework? I would like to offer time to assist him.
>>> 
>>> luoc
>>> 
>>>> 2023年12月9日 01:01,Charles Givre  写道:
>>>> 
>>>> Hello all,
>>>> Happy Friday everyone!   I wanted to raise the topic of getting a Drill
>> minor release out the door before the end of the year.   My opinion is that
>> I'd really like to release Drill 1.22 once the integration with Apache
>> Daffodil is complete, but it sounds like that is still a few weeks away.
>>>> 
>>>> What does everyone think about issuing a maintenance release before the
>> end of the year?  There are a number of singificant fixes including some
>> security updates and a major bug in the ES plugin that basically makes it
>> unusable.
>>>> Best,
>>>> -- C
>> 
>> 



Next Version

2023-12-08 Thread Charles Givre
Hello all, 
Happy Friday everyone!   I wanted to raise the topic of getting a Drill minor 
release out the door before the end of the year.   My opinion is that I'd 
really like to release Drill 1.22 once the integration with Apache Daffodil is 
complete, but it sounds like that is still a few weeks away.  

What does everyone think about issuing a maintenance release before the end of 
the year?  There are a number of singificant fixes including some security 
updates and a major bug in the ES plugin that basically makes it unusable.
Best,
-- C



[jira] [Created] (DRILL-8461) Prevent XXE Attacks in XML Format Plugin

2023-11-03 Thread Charles Givre (Jira)
Charles Givre created DRILL-8461:


 Summary: Prevent XXE Attacks in XML Format Plugin
 Key: DRILL-8461
 URL: https://issues.apache.org/jira/browse/DRILL-8461
 Project: Apache Drill
  Issue Type: Bug
  Components: Format - XML
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0


Drill's XML reader would allow a maliciously crafted XML file to perform an 
_XML eXternal Entity injection_ (XXE)  attack.  This fix disables DTD parsing 
in the XML format plugin and prevents XXE attacks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Config Questions

2023-10-31 Thread Charles Givre
Hi Mike, 
I had a look at your branch, but I couldn't build it because my machine 
couldn't find your version of Daffodil.  I don't want to push to your branch 
directly w/o permission again, but I looked at your unit tests and have some 
suggestions that might clarify things.  Would you be open to a brief zoom call 
sometime?  I think it might be quicker if we took 15 min and I can explain all 
this vs the back and forth over email. 

If not.. I'll write it up. 
Best,
-- C



Re: Drill 1.21.2

2023-10-24 Thread Charles Givre
HI James, 
Thanks for sending. I'm not opposed to a bugfix release but what are your 
thoughts about starting to plan also for Drill 1.22?  I'm really excited about 
Mike Beckerle's work to integrate Apache Daffodil with Drill.  IMHO, once 
that's done we should release the next Drill as that way Daffodil users could 
benefit from Drill and vice versa.

Best,
-- C

> On Oct 24, 2023, at 7:58 AM, James Turton  wrote:
> 
> Hi folks
> 
> It's been a relatively long time since we did a bugfix release and we've 
> accumulated a number of library upgrades and one or two fixes, so should we 
> start working towards get Drill 1.21.2 released?
> 
> Regards
> James



Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Charles Givre
Got it.  I’ll review today and tomorrow and hopefully we can get you unblocked. 
 
Sent from my iPhone

> On Oct 18, 2023, at 18:01, Mike Beckerle  wrote:
> 
> I am very much hoping someone will look at my open PR soon.
> https://github.com/apache/drill/pull/2836
> 
> I am basically blocked on this effort until you help me with one key area
> of that.
> 
> I expect the part I am puzzling over is routine to you, so it will save me
> much effort.
> 
> This is the key area in the DaffodilBatchReader.java code:
> 
>  // FIXME: Next, a MIRACLE occurs.
>  //
>  // We get the dfdlSchemaURI filled in from the query, or a default config
> location
>  // We get the rootName (or null if not supplied) from the query, or a
> default config location
>  // We get the rootNamespace (or null if not supplied) from the query, or
> a default config location
>  // We get the validationMode (true/false) filled in from the query or a
> default config location
>  // We get the dataInputURI filled in from the query, or from a default
> config location
>  //
>  // For a first cut, let's just fake it. :-)
>  boolean validationMode = true;
>  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
>  String rootName = null;
>  String rootNamespace = null;
>  URI dataInputURI = new URI("data/complexArray1.dat");
> 
> 
> I imagine this is just a few lines of code to grab these from the query,
> and i don't even care about config files for now.
> 
> I gave up on trying to figure out how to do this myself. It was actually
> quite unclear from looking at the other format plugins. The way Drill does
> configuration is obviously motivated by the distributed architecture
> combined with pluggability, but all that combined with the negotation over
> schemas which extends into runtime, and it all became quite muddy to me. I
> think what I need is super straightforward, so i figured I should just
> ask.
> 
> This is just to get enough working (against local files only) that I can be
> unblocked on creating and testing the rest of the Daffodil-to-Drill
> metadata bridge and data bridge.
> 
> My plan is to get all kinds of data and queries working first but just
> against local-only files.  Fixing it to work in distributed Drill can come
> later.
> 
> -mikeb
> 
>> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers  wrote:
>> 
>> Hi Charles,
>> 
>> The persistent store is just ZooKeeper, and ZK is known to work poorly as
>> a distributed DB. ZK works great for things like tokens, node registrations
>> and the like. But, ZK scales very poorly for things like schemas (or query
>> profiles or a list of active queries.)
>> 
>> A more scalable approach may be to cache the schemas in each Drillbit,
>> then translate them to Drill's format and include them in each Scan
>> operator definition sent to each execution Drillbit. That solution avoids
>> race conditions when the schemas change while a query is in flight. This
>> is, in fact, the model used for storage plugin definitions. (The storage
>> plugin definitions are, in fact, stored in ZK, but tend to be small and few
>> in number.)
>> 
>> - Paul
>> 
>> 
>>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre  wrote:
>>> 
>>> Hi Mike,
>>> I hope all is well.  I remembered one other piece which might be useful
>>> for you.  Drill has an interface called a PersistentStore which is used for
>>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>>> store OAuth user tokens which need to be preserved and shared across
>>> drillbits, and also frequently updated.  I was thinking that this might be
>>> useful for caching the DFDL schemata.  If you take a look here:
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>> 
>>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>>> and here
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>>> you can see how I used that.
>>> 
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Oct 13, 2023, at 1:25 PM, Mike Beckerle 
>>> wrote:
>>>> 
>>>> Very helpful.
>>>> 
>>>> Answers to your questions, and comments are below:
>>>> 
>>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-18 Thread Charles Givre
Hi Mike, 
I hope all is well.  I remembered one other piece which might be useful for 
you.  Drill has an interface called a PersistentStore which is used for storing 
artifacts such as tokens etc.  I've uesd it on two occasions: in the 
GoogleSheets plugin and the Http plugin.  In both cases, I used it to store 
OAuth user tokens which need to be preserved and shared across drillbits, and 
also frequently updated.  I was thinking that this might be useful for caching 
the DFDL schemata.  If you take a look here: 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
 
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
 and here 
https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
 you can see how I used that.

Best,
-- C

  




> On Oct 13, 2023, at 1:25 PM, Mike Beckerle  wrote:
> 
> Very helpful.
> 
> Answers to your questions, and comments are below:
> 
> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre  <mailto:cgi...@gmail.com>> wrote:
>> HI Mike, 
>> I hope all is well.  I'll take a stab at answering your questions.  But I 
>> have a few questions as well:
>>  
>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was 
>> that this would be a format plugin, but let me know if you were thinking 
>> differently
> 
> Format plugin.
>  
>> 2.  In traditional deployments, where do people store the DFDL schemata 
>> files?  Are they local or accessible via URL?
> 
> Schemas are stored in files, or in jar files created when packaging a schema 
> project. Hence URI is the preferred identifier for them.  They are not 
> retrieved remotely or anything like that. It's a matter of whether they are 
> in jars on the classpath, directories on the classpath, or just a file 
> location. 
> 
> The source-code of DFDL schemas are often created using other schemas as 
> components, so a single "DFDL schema" may have parts that come from 5 jar 
> files on the classpath e.g., 2 different header schemas, a library schema, 
> and the "main" schema that assembles them all.  Inside schemas they refer to 
> each other via xs:include or xs:import, and the schemaLocation attribute 
> takes a URI to the location of the included/imported schema and those URIs 
> are interpreted this same way we would want Drill to identify the location of 
> a schema. 
> 
> However, really people will want to pre-compile any real non-toy/test DFDL 
> schemas into binary ".bin" files for faster loading. Otherwise Daffodil 
> schema compilation time can be excessive (minutes for large DFDL schemas - 
> for example the DFDL schema for VMF is 180K lines of DFDL). Compiled schemas 
> live in exactly 1 file (relatively small. The compiled form of VMF schema is 
> 8Mbytes). So the path given for schema in Drill sql query, or in the config 
> wants to be allowed to be either a compiled schema or a source-code schema 
> (.xsd) this latter mostly being for test, training, and toy examples that we 
> would compile on-the-fly.  
>  
>> To get the DFDL schema file or URL we have a few options, all of which 
>> revolve around setting a config variable.  For now, let's just say that the 
>> schema file is contained in the same folder as the data.  (We can make this 
>> more sophisticated later...)
> 
> It would make life difficult if the schemas and test data must be 
> co-resident. Most schema projects have these in entirely separate sub-trees. 
> Schema will be under src/main/resources//xsd, compiled schema would be 
> under target/... and test data under src/test/resources/.../data
> 
> For now I think the easiest thing is just we get two URIs. One is for the 
> data, one is for the schema. We access them via getClass().getResource(). 
> 
> We should not worry about caching or anything for now. Once the above works 
> for a decent scope of tests we can worry about making it more convenient to 
> have a library of schemas at one's disposal. 
>  
>> 
>> Here's what you have to do.
>> 
>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
>> config variables must be private and final.  If they aren't it can cause 
>> weird errors that are really difficult to debug.  For some reference, take a 
>> look at the Excel plugin.  
>> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
>> 
>> Setting a config variable there will allow a user to set a global schema 
>> definition.  This can

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-12 Thread Charles Givre
One more thought... As a suggestion, I'd recommend getting the batch reader to 
first work with the DFDL schema file.  Once that's done, Paul and I can assist 
with caching, metastores etc. 
-- C



> On Oct 12, 2023, at 5:13 PM, Charles Givre  wrote:
> 
> HI Mike, 
> I hope all is well.  I'll take a stab at answering your questions.  But I 
> have a few questions as well:
> 
> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was 
> that this would be a format plugin, but let me know if you were thinking 
> differently
> 2.  In traditional deployments, where do people store the DFDL schemata 
> files?  Are they local or accessible via URL?
> 
> To get the DFDL schema file or URL we have a few options, all of which 
> revolve around setting a config variable.  For now, let's just say that the 
> schema file is contained in the same folder as the data.  (We can make this 
> more sophisticated later...)
> 
> Here's what you have to do.
> 
> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
> config variables must be private and final.  If they aren't it can cause 
> weird errors that are really difficult to debug.  For some reference, take a 
> look at the Excel plugin.  
> (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
> 
> Setting a config variable there will allow a user to set a global schema 
> definition.  This can also be configured individually for various workspaces. 
>  So let's say you had PCAP files in one workspace, you could globally set the 
> DFDL file for that and then another workspace which has some other file, you 
> could create another DFDL plugin instance for that. 
> 
> Now, this is all fine and good, but a user might also want to define the 
> schema file at query time.  The good news is that Drill allows you to do that 
> via the table() function. 
> 
> So let's say that we want to use a different schema file than the default, we 
> could do something like this:
> 
> SELECT 
> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', 
> dfdlSchema=>'path_to_schema')
> 
> Take a look at the Excel docs 
> (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) 
> which demonstrate how to write queries like that.  I believe that the 
> parameters in the table function take higher precedence than the parameters 
> from the config.  That would make sense at least.
> 
> 
> 2.  Now that we have the schema file, the next thing would be to convert that 
> into a Drill schema.  Let's say that we have a function called dfdlToDrill 
> that handles the conversion.
> 
> What you'd have to do is in the constructor for the BatchReader, you'd have 
> to set the schema there.  So pseudo code:
> 
> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, 
> FileSchemaNegotiator negotiator) {
>   // Other stuff...
>   
>   // Get Drill schema from DFDL
>   TupleMetadata schema = dfldToDrill(   
>   // Here's the important part
>   negotiator.tableSchema(schema, true);
> }
> 
> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean 
> as to whether the schema is final or not.  Once this schema has been added to 
> the negotiator object, you can then create the writers. 
> 
> 
> Take a look here... 
> 
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> 
> 
> I see Paul just responded so I'll leave you with this.  If you have 
> additional questions, send them our way.  Do take a look at the Excel plugin 
> as I think it will be helpful.
> 
> Best,
> --C
> 
> 
>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle  wrote:
>> 
>> So when a data format is described by a DFDL schema, I can generate
>> equivalent Drill schema (TupleMetadata). This schema is always complete. I
>> have unit tests working with this.
>> 
>> To do this for a real SQL query, I need the DFDL schema to be identified on
>> the SQL query by a file path or URI.
>> 
>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>> 
>> Next, assuming I have the DFDL schema identified, I generate an equivalent
>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>> 
>> What objects do I call, or what classes do I have to create to make this
>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>> static Drill schema can be useful?
>> 
>> I just need pointers to the code that illustrat

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

2023-10-12 Thread Charles Givre
HI Mike, 
I hope all is well.  I'll take a stab at answering your questions.  But I have 
a few questions as well:

1.  Are you writing a storage or format plugin for DFDL?  My thinking was that 
this would be a format plugin, but let me know if you were thinking differently
2.  In traditional deployments, where do people store the DFDL schemata files?  
Are they local or accessible via URL?

To get the DFDL schema file or URL we have a few options, all of which revolve 
around setting a config variable.  For now, let's just say that the schema file 
is contained in the same folder as the data.  (We can make this more 
sophisticated later...)

Here's what you have to do.

1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... 
config variables must be private and final.  If they aren't it can cause weird 
errors that are really difficult to debug.  For some reference, take a look at 
the Excel plugin.  
(https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)

Setting a config variable there will allow a user to set a global schema 
definition.  This can also be configured individually for various workspaces.  
So let's say you had PCAP files in one workspace, you could globally set the 
DFDL file for that and then another workspace which has some other file, you 
could create another DFDL plugin instance for that. 

Now, this is all fine and good, but a user might also want to define the schema 
file at query time.  The good news is that Drill allows you to do that via the 
table() function. 

So let's say that we want to use a different schema file than the default, we 
could do something like this:

SELECT 
FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', 
dfdlSchema=>'path_to_schema')

Take a look at the Excel docs 
(https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) 
which demonstrate how to write queries like that.  I believe that the 
parameters in the table function take higher precedence than the parameters 
from the config.  That would make sense at least.


2.  Now that we have the schema file, the next thing would be to convert that 
into a Drill schema.  Let's say that we have a function called dfdlToDrill that 
handles the conversion.

What you'd have to do is in the constructor for the BatchReader, you'd have to 
set the schema there.  So pseudo code:

public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, FileSchemaNegotiator 
negotiator) {
// Other stuff...

// Get Drill schema from DFDL
TupleMetadata schema = dfldToDrill(https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
 at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
github.com


I see Paul just responded so I'll leave you with this.  If you have additional 
questions, send them our way.  Do take a look at the Excel plugin as I think it 
will be helpful.

Best,
--C


> On Oct 12, 2023, at 2:58 PM, Mike Beckerle  wrote:
> 
> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
> 
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
> 
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> 
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> 
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
> 
> I just need pointers to the code that illustrate how to do this. Thanks
> 
> -Mike Beckerle
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers  wrote:
> 
>> Mike,
>> 
>> This is a complex question and has two answers.
>> 
>> First, the standard enhanced vector framework (EVF) used by most readers
>> assumes a "pull" model: read each record. This is where the next() comes
>> in: readers just implement this to read the next record. But, the code
>> under EVF works with a push model: the readers write to vectors, and signal
>> the next record. EVF translates the lower-level push model to the
>> higher-level, easier-to-use pull model. The best example of this is the
>> JSON reader which uses Jackson to parse JSON and responds to the
>> corresponding events.
>> 
>> You can thus take over the task of filling a batch of records. I'd have to
>> poke around the code to refresh my memory. Or, you can take a look at the
>> (quite complex) JSON parser, or the EVF itself to see what it does. 

Re: Drill expects pull parsing? Daffodil is event callbacks style

2023-10-12 Thread Charles Givre
Mike, 
I'll add to Paul's comments.  While Drill is expecting an iterator style 
reader, that iterator pattern really only applies to batches.  This concept 
took me a while to wrap my head around, but in any of the batch reader classes, 
it's important to remember that the next() method is really applying to the 
batch of records and not the data source itself.  In other words, it isn't a 
line iterator.  What that means in practice is that you can do whatever you 
want in the next method until the batch is full.
I hope this helps somewhat and doesn't add to the confusion.
Best,
-- C


> On Oct 12, 2023, at 12:13 AM, Paul Rogers  wrote:
> 
> Mike,
> 
> This is a complex question and has two answers.
> 
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
> 
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
> 
> Basically, you have to:
> 
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
> 
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
> 
> - Paul
> 
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle  wrote:
> 
>> Daffodil parsing generates event callbacks to an InfosetOutputter, which is
>> analogous to a SAX event handler.
>> 
>> Drill is expecting an iterator style of calling next() to advance through
>> the input, i.e., Drill has the control thread and expects to do pull
>> parsing. At least from the code I studied in the format-xml contrib.
>> 
>> Is there any alternative? Before I dig into creating another one of these
>> co-routine-style control inversions (which have proven to be problematic
>> for performance.
>> 



Re: Question about Drill internal data representation for Daffodil tree infosets

2023-10-10 Thread Charles Givre
Hi Mike, 
Thanks for all the work you are doing on Drill.

To answer your question, sub1 should be treated as a map in Drill.  You can 
verify this with the following query:

SELECT drillTypeOf(sub1) FROM...

In general, I'm pretty sure that Drill doesn't output strings that look like 
JSON objects unless they actually are complex objects.

Take a look here for data type functions:  
https://drill.apache.org/docs/data-type-functions/
Best,
-- C


> On Oct 10, 2023, at 7:56 AM, Mike Beckerle  wrote:
> 
> I am trying to understand the options for populating Drill data from a
> Daffodil data parse.
> 
> Suppose you have this JSON
> 
> {"parent": { "sub1": { "a1":1, "a2":2}, sub2:{"b1":3, "b2":4, "b3":5}}}
> 
> or this equivalent XML:
> 
> 
>  12
>  345
> 
> 
> Unlike those texts, Daffodil is going to have a tree data structure where a
> parent node contains two child nodes sub1 and sub2, and each of those has
> children a1, a2, and b1, b2, b3 respectively.
> It's analogous roughly to the DOM tree of the XML, or the tree of nested
> JSON map nodes you'd get back from a JSON parse of that text.
> 
> In Drill to query the JSON like:
> 
> select parent.sub1 from myStructure
> 
> gives you back single column containing what seems to be a string like
> 
> |sub1|
> --
> | { "a1":1, "a2":2}  |
> 
> So, my question is this. Is this actually a string in Drill, (what is the
> type of sub1?) or is sub1 actually a Drill data row/map node value with two
> node children, that just happens to print out looking like a JSON string?
> 
> Thanks for any insight here.
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com



Re: Security Vulnerability - Action Required: “Incorrect Permission Assignment for Critical Resource” vulnerability in org.apache.drill.exec_drill-jdbc-all:1.4.0

2023-09-21 Thread Charles Givre
HI there, 
Thanks for sending.  Could you please verify that this vulnerability actually 
exists in current Drill?   We are currently on Drill 1.21.1.  I'm also fairly 
certain that we are no longer using the Hadoop versions listed below. 
Thanks,
-- C



> On Sep 21, 2023, at 9:20 AM, James Watt  wrote:
> 
> Hi there,
>  I think the method 
> `org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.checkPermissionOfOther(FileSystem
>  fs, Path path, FsAction action, Map statCache)` may have an 
> “Incorrect Permission Assignment for Critical Resource”vulnerability which is 
> vulnerable in org.apache.drill.exec_drill-jdbc-all:1.4.0. It shares 
> similarities to a recent CVE disclosure CVE-2017-3166 in the same project 
> "apache/hadoop" project.
> The source vulnerability information is as follows:
>> Vulnerability Detail:
>> CVE Identifier: CVE-2017-3166
>> Description: In Apache Hadoop versions 2.6.1 to 2.6.5, 2.7.0 to 2.7.3, and 
>> 3.0.0-alpha1, if a file in an encryption zone with access permissions that 
>> make it world readable is localized via YARN's localization mechanism, that 
>> file will be stored in a world-readable location and can be shared freely 
>> with any application that requests to localize that file.
>> Reference:  
>> https://nvd.nist.gov/vuln/detail/CVE-2017-3166
>> Patch: 
>> https://github.com/apache/hadoop/commit/a47d8283b136aab5b9fa4c18e6f51fa799d91a29
> 
> Vulnerability Description: The vulnerability is present in the class  
> org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager  of 
> method  checkPermissionOfOther(FileSystem fs, Path path, FsAction action, 
> Map statCache) ,which is responsible for checking the 
> permissions of other files in the distributed cache. But the check snippet is 
> similar to the vulnerable snippet for CVE-2017-3166 and may have the same 
> consequence as CVE-2017-3166: a file in an encryption zone with access 
> permissions  will be stored in a world-readable location and can be freely 
> shared with any application that requests the file to be localized. 
> Therefore, maybe you need to fix the vulnerability with much the same fix 
> code as the CVE-2017-3166 patch. 
>  Maybe the version of the hadoop your project depends on is vulnerable, 
> you can update it to a newer one.
> Considering the potential risks it may have, I am willing to cooperate 
> with you to verify, address, and report the identified vulnerability promptly 
> through responsible means. If you require any further information or 
> assistance, please do not hesitate to reach out to me. Thank you and look 
> forward to hearing from you soon.
> 
> Best regards,
> Yiheng Cao
> 
> 



Re: Question on Representing DFDL/XSD choice data for Drill (Unions required?)

2023-09-18 Thread Charles Givre
Hi Mike, 
Let me answer your question with some queries:

 >>> select * from dfs.test.`record.json`;
+--+
|  record   
   |
+--+
| 
[{"a":{"a1":5.0,"a2":6.0},"b":{}},{"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}},{"a":{"a1":7.0,"a2":8.0},"b":{}},{"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}}]
 |
+--+

Now... I can flatten that like this:

>>> select flatten(record) AS data from dfs.test.`record.json`;
+--+
| data |
+--+
| {"a":{"a1":5.0,"a2":6.0},"b":{}} |
| {"a":{},"b":{"b1":55.0,"b2":66.0,"b3":77.0}} |
| {"a":{"a1":7.0,"a2":8.0},"b":{}} |
| {"a":{},"b":{"b1":77.0,"b2":88.0,"b3":99.0}} |
+--+
4 rows selected (0.298 seconds)

You asked about filtering.   For this, I broke it up into a subquery, but 
here's how I did that:

>>> SELECT data['a'], data['b']
2..semicolon> FROM (select flatten(record) AS data from dfs.test.`record.json`)
3..semicolon> WHERE data['b']['b1'] > 60.0;
++-+
| EXPR$0 | EXPR$1  |
++-+
| {} | {"b1":77.0,"b2":88.0,"b3":99.0} |
++-+
1 row selected (0.379 seconds)

I did all this without the union data type.  

Does this make sense?
Best,
-- C


> On Sep 13, 2023, at 11:08 AM, Mike Beckerle  wrote:
> 
> I'm thinking whether a first prototype of DFDL integration to Drill should
> just use JSON.
> 
> But please consider this JSON:
> 
> { "record": [
>{ "a": { "a1":5, "a2":6 } },
>{ "b": { "b1":55, "b2":66, "b3":77 } }
>{ "a": { "a1":7, "a2":8 } },
>{ "b": { "b1":77, "b2":88, "b3":99 } }
>  ] }
> 
> It corresponds to this text data file, parsed using Daffodil:
> 
>105062556677107082778899
> 
> The file is a stream of records. The first byte is a tag value 1 for type
> 'a' records, and 2 for type 'b' records.
> The 'a' records are 2 fixed length fields, each 2 bytes long, named a1 and
> a2. They are integers.
> The 'b' records are 3 fixed length fields, each 2 bytes long, named b1, b2,
> and b3. They are integers.
> This kind of format is very common, even textualized like this (from COBOL
> programs for example)
> 
> Can Drill query the JSON above to get (b1, b2) where b1 > 10 ?
> (and ... does this require the experimental Union feature?)
> 
> b1, b2
> -
> (55, 66)
> (77, 88)
> 
> I ask because in an XML Schema or DFDL schema choices with dozens of
> 'branches' are very common.
> Ex: schema for the above data:
> 
> 
>   
>  
>  
>   
>
>... many child elements let's say named a1, a2, ...
> 
>   
>  
>  
>   
>
>... many child elements let's say named b1, b2, b3
> ...
> 
>   
>  
>
>  
> 
> 
> To me XSD choice naturally requires a Union feature of some sort.
> If that's expermental still in Drill ... what to do?
> 
> On Sun, Aug 6, 2023 at 10:19 AM Charles S. Givre 
> wrote:
> 
>> @mbeckerle 
>> You've encountered another challenge that exists in Drill reading data
>> without a schema.
>> Let me explain a bit about this and I'm going to use the JSON reader as an
>> example. First Drill requires data to be homogeneous. Drill does have a
>> Union vector type which allows heterogeneous data however this is a bit
>> experimental and I wouldn't recommend using it. Also, it really just shifts
>> schema inconsistencies to the user.
>> 
>> For instance, let's say you have a column consisting of strings and
>> floats. What happens if you try to do something like this:
>> 
>> SELECT sum(mixed_col)-- orSELECT ORDER BY mixed_col
>> 
>> Remembering that Drill is distributed and if you have a column with the
>> same name and you try to do these operations, they will fail.
>> 
>> Let's say we have data like this:
>> 
>> [
>>  {
>> 'col1': 'Hi there',
>> 'col2': 5.0
>>  },
>>  {
>> 'col1':True,
>> 'col2': 4,
>> 'col3': 'foo'
>>  }
>> ]
>> 
>> In older versions of Drill, this kind of data, this would throw all kinds
>> of SchemaChangeExceptions. However, in recent versions of Drill, @jnturton
>>  submitted apache#2638
>>  which overhauled implicit
>> casting. What this meant for users is that col2 in the above would be
>> automatically cast to a FLOAT and col1 would be automatically cast to a
>> VARCHAR.
>> 

Re: Mongodb metastore..

2023-08-23 Thread Charles Givre
Hi there,
Welcome to Drill!  I think you have several problems going on there. First of 
all, the mongo metastore has nothing to do with being able to query mongo with 
Drill. The first thing you need to get working is the mongoDB storage plugin.  
You didn't say whether that is in fact working or not.   Executing a `SHOW 
DATABASES` query in Drill should show you whether Mongo is queryable or not.

Regarding the Mongo metastore, I think you've misunderstood its purpose.  
Drill's metastore exists to provide metadata for query planning for file 
systems.  This data needs to be stored somewhere and the metastore gives you 
the option of storing it in iceberg tables, JDBC databases or MongoDB.  To the 
best of my knowledge, the metastore only works for file systems and thus you 
won't see anything in the metastore from a Drill-MongoDB connection.

With that said, MongoDB itself may provide a lot of the info you are seeking.  
Take a look at the various information_schema queries in Drill and try running 
those on your MongoDB connection.
Good luck!
-- C



> On Aug 23, 2023, at 6:49 AM, Surendar Karunanandan 
>  wrote:
> 
> Team, <>
>I   am using latest version of Drill cluster installed.  We 
> try to query mongodb as plugin storage.
> 
>I  could not find  any metadata getting  updated (like 
> num_rows,last modified) etc etc.
> 
>Even I tried to implement mongo metastore as given in 
> thehttps://drill.apache.org/docs/mongo-metastore/but it does not seems to 
> be working.
> 
>Mongodb meta tables not getting populated with  metadata. Any 
> help  is appreciated.
> 
> 
> 
> 
> Thanks & regards,
> Sk
> 
>  
> 
> Surendar
> Database Administrator
> Text SBT to 77513
> +91 8939953306
> Schedule A M­eeting 
> 
> 
> CONFIDENTIALITY NOTICE: This email and any files attached are intended solely 
> for the use of the individual or entity to whom they are addressed. This 
> message may contain information that is confidential, privileged, and/or 
> protected by work product. If you have received this email in error, please 
> notify the sender and delete the email immediately.



signature.asc
Description: Message signed with OpenPGP


[jira] [Created] (DRILL-8453) Add XSD Support to XML Reader (Part 1)

2023-08-21 Thread Charles Givre (Jira)
Charles Givre created DRILL-8453:


 Summary: Add XSD Support to XML Reader (Part 1)
 Key: DRILL-8453
 URL: https://issues.apache.org/jira/browse/DRILL-8453
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - XML
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


This PR is a part of a series to add better support for reading XML data to 
Drill.  One of the main challenges is that XML data does not have a way of 
inferring data types, nor does it have a way of detecting arrays.  

The only way to do this really well is to have a schema.  Some XML files link a 
schema definition file to the data.  This PR adds the capability for Drill to 
map XSD schema files into Drill schemas.  

The current plan is as follows: Part 1 of this PR simply adds the reader but 
adds no new user detectable functionality.  Part 2 will include the actual 
integration with the XML reader.  Part 3 will include the ability to read 
arrays.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Drill SQL questions - JSON context

2023-08-21 Thread Charles Givre
One thing I'll add is that if you have nested data, Drill will allow you to 
query the data using both dotted notation and bracket notation.  IE:

SELECT foo.bar FROM

AND

SELECT foo['bar'] FROM

Both will work, however, I STRONGLY recommend using the bracket notation.  I've 
found that the dotted notation will cause some strange and inconsistent errors. 
 The bracket notation is much more reliable.
Best,
-- C


> On Aug 18, 2023, at 5:00 PM, Mike Beckerle  wrote:
> 
> I'm using Apache Daffodil in the mode where it outputs JSON data. (For the
> moment, until we build a tighter integration. This is my conceptual test
> framework for that integration.)
> 
> I have parsed data to create this JSON which represents 2-level nested
> repeating subrecords.
> 
> All the simple fields are int.
> 
> [{"a":1,  "b":2,  "c":[{"d":3,  "e":4,  "f":[{"g":5,  "h":6 },
> {"g":7,  "h":8 }]},
>   {"d":9,  "e":10, "f":[{"g":11, "h":12},
> {"g":13, "h":14}]}]},
> {"a":21, "b":22, "c":[{"d":23, "e":24, "f":[{"g":25, "h":26 },
> {"g":27, "h":28 }]},
>   {"d":29, "e":30, "f":[{"g":31, "h":32},
> {"g":33, "h":34}]}]}]
> 
> So, the top level is a vector of maps,
> within that, field "c" is a vector of maps,
> and within "c" is a field f which is a vector of maps.
> 
> The reason I created this is I'm trying to understand the arrays and how
> they work with Drill SQL.
> 
> I'm trying to figure out how to get this rowset of 3 rows from a query, and
> I'm stumped.
> 
>  a   b   d   e   g   h
> ( 1,  2,  3,  4,  5,  6)
> ( 1,  2,  9, 10, 13, 14)
> (21, 22, 29, 30, 33, 34)
> 
> This is the SQL that is my conceptual framework, but I'm sure it won't work.
> 
> SELECT a, b, c.d AS d, c.e AS e, c.f.g AS g, c.f.h AS h
> FROM ... the json file...
> WHERE g mod 10 == 3 OR g == 5
> 
> But I know it's not going to be that easy to get the query to traverse the
> vector inside the vector.
> 
> From the doc, the FLATTEN operator seems to be needed, but I can't really
> figure it out.
> 
> This is what all my data is like. Trees of nested vectors of sub-records.
> 
> Can anyone advise on what the SQL might look like, or where there's an
> example doing something like this I can learn from?
> 
> Thanks for any help
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-18 Thread Charles Givre
Hey Mike,
So it looks like I was wrong and the XML reader does not have the support for 
Arrays.  However... Once DRILL-8450 is merged, I'll add the readers for arrays. 
  The XML reader itself still won't be able to dynamically detect them until we 
finish the XSD support, but at least the infra will be there.
Best,
-- C


> On Aug 15, 2023, at 11:39 PM, Charles Givre  wrote:
> 
> I stand corrected...  It does not look like the XML reader has any support 
> for arrays.
> -- C
> 
>> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
>> 
>> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
>> as "ARRAY". This works, however, only if the XML reader uses the
>> (very complex) EVF framework and has a way to control parsing based on the
>> data type (and to set the data type based on parsing). The JSON reader has
>> such an integration. Charles, did you do the work to add that kind of
>> dynamic state machine to the XML parser?
>> 
>> - Paul
>> 
>> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
>> 
>>> Hi Mike,
>>> It is theoretically possible but I don't have an example of the syntax.
>>> As you've probably figured out, Drill vectors have both a type and data
>>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>>> Thus, you could tell Drill via the inline schema that the data mode for a
>>> given field is REPEATED and that would be the Drill equivalent of an
>>> Array.  I've never actually done this, so I don't really know if it would
>>> work for inline schemata but I'd assume that it would.
>>> 
>>> I'll do some digging to see whether I have any examples of this.
>>> Best,
>>> --C
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
>>>> 
>>>> I'm trying to get my Drill SQL queries to produce the right thing from
>>> XML.
>>>> 
>>>> A major thing that you can't easily infer from looking at just XML data
>>> is
>>>> what is an array. XML lacks an array starting indicator.
>>>> 
>>>> Is there an inline schema notation in the Drill Query language for
>>>> array-ness, so that one can inform Drill what is an array?
>>>> 
>>>> For example this provides simple types for all the fields directly in the
>>>> query.
>>>> 
>>>> @Test
>>>> 
>>>> public void testSimpleProvidedSchema() throws Exception {
>>>> 
>>>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
>>>> (type => 'xml', schema " +
>>>> 
>>>>  "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
>>>> FLOAT, `double_field` DOUBLE, `boolean_field` " +
>>>> 
>>>>  "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
>>>> TIMESTAMP, `string_field`" +
>>>> 
>>>>  " VARCHAR, `date2_field` DATE properties {`drill.format` =
>>>> `MM/dd/`})'))";
>>>> 
>>>> RowSet results = client.queryBuilder().sql(sql).rowSet();
>>>> 
>>>> assertEquals(2, results.rowCount());
>>>> 
>>>> 
>>>> Can one also tell Drill what fields or child elements are arrays?
>>> 
>>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: Drop MapR, CDH, HDP profiles and plugins from master?

2023-08-17 Thread Charles Givre
I’d agree with removing them. We’ve had a lot of issues with the MapR 
certificates expiring and breaking Drill builds. 

Sent from my iPhone

> On Aug 17, 2023, at 09:19, James Turton  wrote:
> 
> The subject line says it -- should we drop these? As far as I know they're 
> unmaintained and they add to an already monstrously big effective POM...
> 
> Regards
> James


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-15 Thread Charles Givre
I stand corrected...  It does not look like the XML reader has any support for 
arrays.
-- C

> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
> 
> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
> as "ARRAY". This works, however, only if the XML reader uses the
> (very complex) EVF framework and has a way to control parsing based on the
> data type (and to set the data type based on parsing). The JSON reader has
> such an integration. Charles, did you do the work to add that kind of
> dynamic state machine to the XML parser?
> 
> - Paul
> 
> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
> 
>> Hi Mike,
>> It is theoretically possible but I don't have an example of the syntax.
>> As you've probably figured out, Drill vectors have both a type and data
>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>> Thus, you could tell Drill via the inline schema that the data mode for a
>> given field is REPEATED and that would be the Drill equivalent of an
>> Array.  I've never actually done this, so I don't really know if it would
>> work for inline schemata but I'd assume that it would.
>> 
>> I'll do some digging to see whether I have any examples of this.
>> Best,
>> --C
>> 
>> 
>> 
>> 
>> 
>>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
>>> 
>>> I'm trying to get my Drill SQL queries to produce the right thing from
>> XML.
>>> 
>>> A major thing that you can't easily infer from looking at just XML data
>> is
>>> what is an array. XML lacks an array starting indicator.
>>> 
>>> Is there an inline schema notation in the Drill Query language for
>>> array-ness, so that one can inform Drill what is an array?
>>> 
>>> For example this provides simple types for all the fields directly in the
>>> query.
>>> 
>>> @Test
>>> 
>>> public void testSimpleProvidedSchema() throws Exception {
>>> 
>>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
>>> (type => 'xml', schema " +
>>> 
>>>   "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
>>> FLOAT, `double_field` DOUBLE, `boolean_field` " +
>>> 
>>>   "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
>>> TIMESTAMP, `string_field`" +
>>> 
>>>   " VARCHAR, `date2_field` DATE properties {`drill.format` =
>>> `MM/dd/`})'))";
>>> 
>>> RowSet results = client.queryBuilder().sql(sql).rowSet();
>>> 
>>> assertEquals(2, results.rowCount());
>>> 
>>> 
>>> Can one also tell Drill what fields or child elements are arrays?
>> 
>> 



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-15 Thread Charles Givre
Hey Paul,
The XML reader was implemented using the EVF2 Framework and in theory does have 
writers for repeated data types.  I'm not sure to what extent this has been 
tested.
Best,
-- C

> On Aug 15, 2023, at 12:01 AM, Paul Rogers  wrote:
> 
> IIRC, the syntax for the "provided schema" for arrays is "ARRAY" such
> as "ARRAY". This works, however, only if the XML reader uses the
> (very complex) EVF framework and has a way to control parsing based on the
> data type (and to set the data type based on parsing). The JSON reader has
> such an integration. Charles, did you do the work to add that kind of
> dynamic state machine to the XML parser?
> 
> - Paul
> 
> On Mon, Aug 14, 2023 at 6:28 PM Charles Givre  wrote:
> 
>> Hi Mike,
>> It is theoretically possible but I don't have an example of the syntax.
>> As you've probably figured out, Drill vectors have both a type and data
>> mode.  The mode is either NULLABLE or REPEATED if I remember correctly.
>> Thus, you could tell Drill via the inline schema that the data mode for a
>> given field is REPEATED and that would be the Drill equivalent of an
>> Array.  I've never actually done this, so I don't really know if it would
>> work for inline schemata but I'd assume that it would.
>> 
>> I'll do some digging to see whether I have any examples of this.
>> Best,
>> --C
>> 
>> 
>> 
>> 
>> 
>>> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
>>> 
>>> I'm trying to get my Drill SQL queries to produce the right thing from
>> XML.
>>> 
>>> A major thing that you can't easily infer from looking at just XML data
>> is
>>> what is an array. XML lacks an array starting indicator.
>>> 
>>> Is there an inline schema notation in the Drill Query language for
>>> array-ness, so that one can inform Drill what is an array?
>>> 
>>> For example this provides simple types for all the fields directly in the
>>> query.
>>> 
>>> @Test
>>> 
>>> public void testSimpleProvidedSchema() throws Exception {
>>> 
>>> String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
>>> (type => 'xml', schema " +
>>> 
>>>   "=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
>>> FLOAT, `double_field` DOUBLE, `boolean_field` " +
>>> 
>>>   "BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
>>> TIMESTAMP, `string_field`" +
>>> 
>>>   " VARCHAR, `date2_field` DATE properties {`drill.format` =
>>> `MM/dd/`})'))";
>>> 
>>> RowSet results = client.queryBuilder().sql(sql).rowSet();
>>> 
>>> assertEquals(2, results.rowCount());
>>> 
>>> 
>>> Can one also tell Drill what fields or child elements are arrays?
>> 
>> 



signature.asc
Description: Message signed with OpenPGP


Re: is there a way to provide inline array metadata to inform the xml_reader?

2023-08-14 Thread Charles Givre
Hi Mike,
It is theoretically possible but I don't have an example of the syntax.  As 
you've probably figured out, Drill vectors have both a type and data mode.  The 
mode is either NULLABLE or REPEATED if I remember correctly.  Thus, you could 
tell Drill via the inline schema that the data mode for a given field is 
REPEATED and that would be the Drill equivalent of an Array.  I've never 
actually done this, so I don't really know if it would work for inline schemata 
but I'd assume that it would.

I'll do some digging to see whether I have any examples of this.
Best,
--C





> On Aug 14, 2023, at 3:36 PM, Mike Beckerle  wrote:
> 
> I'm trying to get my Drill SQL queries to produce the right thing from XML.
> 
> A major thing that you can't easily infer from looking at just XML data is
> what is an array. XML lacks an array starting indicator.
> 
> Is there an inline schema notation in the Drill Query language for
> array-ness, so that one can inform Drill what is an array?
> 
> For example this provides simple types for all the fields directly in the
> query.
> 
> @Test
> 
> public void testSimpleProvidedSchema() throws Exception {
> 
>  String sql = "SELECT * FROM table(cp.`xml/simple_with_datatypes.xml`
> (type => 'xml', schema " +
> 
>"=> 'inline=(`int_field` INT, `bigint_field` BIGINT, `float_field`
> FLOAT, `double_field` DOUBLE, `boolean_field` " +
> 
>"BOOLEAN, `date_field` DATE, `time_field` TIME, `timestamp_field`
> TIMESTAMP, `string_field`" +
> 
>" VARCHAR, `date2_field` DATE properties {`drill.format` =
> `MM/dd/`})'))";
> 
>  RowSet results = client.queryBuilder().sql(sql).rowSet();
> 
>  assertEquals(2, results.rowCount());
> 
> 
> Can one also tell Drill what fields or child elements are arrays?



signature.asc
Description: Message signed with OpenPGP


[jira] [Created] (DRILL-8450) Add Data Type Inference to XML Format Plugin

2023-08-08 Thread Charles Givre (Jira)
Charles Givre created DRILL-8450:


 Summary: Add Data Type Inference to XML Format Plugin
 Key: DRILL-8450
 URL: https://issues.apache.org/jira/browse/DRILL-8450
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - XML
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Attending Community over Code 2023 ?

2023-08-06 Thread Charles Givre
HI Mike,
If your talk was accepted, I was thinking about going. :-(
-- C

> On Aug 4, 2023, at 11:10 AM, Mike Beckerle  wrote:
> 
> Is anyone from the drill project planning to attend the Community over Code
> 2023 conference?
> 
> 
>> 
>> 



signature.asc
Description: Message signed with OpenPGP


Re: Drill representation of XML Complex Type with Simple Content

2023-08-06 Thread Charles Givre
Hi Mike,
I'm traveling to the BlackHat conference so please forgive the short responses. 
 In its current implementation, Drill can't identify arrays in XML.  Since we 
don't know the schema in advance, when it hits the field int1, it has no way of 
knowing that it is a repeated field or not.  This is one of the reasons why I 
started working on the XSD reader because that will allow Drill to read XML 
arrays.  When I did the original implementation, I thought about having a 
lookahead function which would attempt to see whether the next field has the 
same name and if it does, write an array.  However there are a lot of edge 
cases with that as well.  It's something that could be added, but I wanted to 
get the v1 of the XML reader merged.

Another option would have been to include some separator and a UDF to break 
that field up into a list.  In any event, that's why all those fields get 
concatenated into one field.
Best,
-- C



> On Aug 4, 2023, at 4:50 PM, Mike Beckerle  wrote:
> 
> Consider this XML:
> 
> 
>   A
>   B
>   Y
>   C
> 
> 
> And this drill query:
> 
> SELECT * FROM cp.`xml/foo.xml`
> 
> I am using datalevel = 1.
> 
> The results I get (calling RowSet results.print() in my junit test) are:
> 
> #: `attributes` STRUCT<`int1_x` VARCHAR, `int1_y` VARCHAR, `char1_y` 
> VARCHAR>, `int1` VARCHAR, `char1` VARCHAR
> 0: {"27", "3", "4"}, "ABY", "C"
> 
> So questions:
> 
> First, why is it constructing 1 row, not multiple?
> 
> The only way I expect to get only 1 row out is if I did a group-by with the 
> whole row-set having only 1 key value.
> 
> Second, why is it concatenating the value strings?
> 
> I'd expect to write like: "SELECT '1' AS key, * FROM ...theTable... GROUP BY 
> key", and only then would I expect concatenation if everything is a string 
> and concat is somehow the default grouping operation. Even then it's a 
> stretch.
> 
> Here's what I expected to get out after inspecting the schema that was 
> inferred from the data:
> 
> 0: {"2", null, null}, "A", null
> 1: {"7", null, null}, "B", null
> 2: {null, "3", null}, "Y", null
> 3: {null, null, "4"}, null, "C"
> 
> Those correspond to the 3 columns "attributes", "int1", "char1", where 
> attributes is itself { int1_x, int1_y, char1_y}.
> 
> Third, how would I change my query to get out what I expect?
> 
> Lastly, what is the rationale for the name "int1_x" (also int1_y, and 
> char1_y) ?
> I expected to see two separate attributes columns: "attributes_int1" and 
> "attributes_char1" as maps with non-prefixed children named  x, y and y 
> respectively.
> 
> I guess I just don't grok the rationale for how queries work against XML.
> 
> The natural XML schema for this XML document is:
> 
> 
>   
> 
>   
> 
>   
> 
>   
>   
> 
>   
> 
>   
>   
> 
>   
> 
>   
> 
>   
> 
>   
> 
>   
> 
> I need to synthesize the same TupleMetadata from this schema that the current 
> XML reader infers incrementally, so I really need to understand the 
> rationale, because I wouldn't expect this choice to be entirely flattened 
> including the attributes.
> 
> Thanks for any help
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org 
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl 
> 
> Owl Cyber Defense | www.owlcyberdefense.com 
> 
> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: Issue Querying Drill in Mongo

2023-08-04 Thread Charles Givre
Hi Srinath,
Unfortunately, the mailing list doesn't support images.  Could you resent with 
text of the queries?

> On Aug 4, 2023, at 5:07 AM, Srikanth Nallela  
> wrote:
> 
>  Hi ,
> 
> 1 .I am trying to get data based on object id in mongo not getting any data. 
> I am not sure what is issue.
> 
> 
> 
> 
> 
> 
> 2 . When I ran without where condition I get this  record
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> This email is from Coherent Corp. or a Coherent group company. The contents 
> of this email, including any attachments, are intended solely for the 
> intended recipient and may contain Coherent proprietary and/or confidential 
> information and material. Any review, use, disclosure, re-transmission, 
> dissemination, distribution, retention, or copying of this email and any of 
> its contents by any person other than the intended recipient is strictly 
> prohibited. If you received this email in error, please immediately notify 
> the sender and destroy any and all copies of this email and any attachments. 
> To contact us directly, please email postmas...@coherent.com 
> .
> 
> Privacy: For information about how Coherent processes personal information, 
> please review our privacy policy at https://ii-vi.com/privacy/.



signature.asc
Description: Message signed with OpenPGP


Re: xml_reader branch - and getting Drill to use the XSD-derived metadata

2023-07-31 Thread Charles Givre
HI Mike, 
That's awesome work!  I wasn't expecting you to work on this but I really 
appreciate it.   My plan was to do this in two phases.  The first was simply to 
get Drill to be able to read an XSD schema and translate that into a Drill 
schema.  Part two would be to integrate that into the XML reader and the HTTP 
plugin which also uses the XML reader.  

I'd be happy to discuss this with you.  Could we shoot for tomorrow (Tues) 
afternoon sometime?  I'm basically free except at 4PM.
Thx!
-- C




> On Jul 31, 2023, at 5:55 PM, Mike Beckerle  wrote:
> 
> I added a first cut at attribute support -
> https://github.com/cgivre/drill/pull/6
> 
> Ok, so given that the xsd_reader can now map a small usable subset of XSD
> to drill metadata, it seems next we want Drill to start using this
> metadata.
> 
> I am not sure where to start here. Where would the metadata from the XSD
> plug into the query planning/building?
> 
> Can we schedule a time to discuss?
> 
> I am broadly available the rest of this week. Tues at 1pm or 3pm, any time
> the rest of the week, but sooner is better as I also have time to work on
> this currently.
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com



Re: drill tests not passing

2023-07-24 Thread Charles Givre
h.data.eventsource-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.data.logging/8.3.0/de.huxhorn.lilith.data.logging-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/sulky/de.huxhorn.sulky.formatting/8.3.0/de.huxhorn.sulky.formatting-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.sender/8.3.0/de.huxhorn.lilith.sender-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/sulky/de.huxhorn.sulky.io/8.3.0/de.huxhorn.sulky.io-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.converter-classic/8.3.0/de.huxhorn.lilith.logback.converter-classic-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.data.converter/8.3.0/de.huxhorn.lilith.data.converter-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.classic/8.3.0/de.huxhorn.lilith.logback.classic-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/lilith/de.huxhorn.lilith.logback.appender.multiplex-core/8.3.0/de.huxhorn.lilith.logback.appender.multiplex-core-8.3.0.jar:/home/mbeckerle/.m2/repository/de/huxhorn/sulky/de.huxhorn.sulky.ulid/8.3.0/de.huxhorn.sulky.ulid-8.3.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-client/1.0.0/kerb-client-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-config/1.0.0/kerby-config-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-common/1.0.0/kerb-common-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-crypto/1.0.0/kerb-crypto-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-util/1.0.0/kerb-util-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-core/1.0.0/kerb-core-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-pkix/1.0.0/kerby-pkix-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-asn1/1.0.0/kerby-asn1-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-util/1.0.0/kerby-util-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-simplekdc/1.0.0/kerb-simplekdc-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-admin/1.0.0/kerb-admin-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-server/1.0.0/kerb-server-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerb-identity/1.0.0/kerb-identity-1.0.0.jar:/home/mbeckerle/.m2/repository/org/apache/kerby/kerby-xdr/1.0.0/kerby-xdr-1.0.0.jar:exec/jdbc/src/test/resources/storage-plugins.json
> com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4
> org.apache.drill.exec.store.xml.xsd.TestXSDSchema,testSimpleXSD
> Unrecognized option: --add-opens
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> 
> Process finished with exit code 1
> 
> On Mon, Jul 17, 2023 at 9:08 AM Mike Beckerle  wrote:
> 
>> That looks like a great start at what we will need for DFDL.
>> 
>> I will study what you've done carefully and get back to you with
>> questions.
>> 
>> 
>> On Fri, Jul 14, 2023 at 5:59 PM Charles Givre  wrote:
>> 
>>> Hi Mike,
>>> One more thing... I've been working on an XSD Reader for Drill for some
>>> time.  (This is still very buggy)
>>> https://github.com/cgivre/drill/tree/xsd_reader
>>> 
>>> What this does is attempt to convert a XML XSD file into a Drill Schema.
>>> 
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> On Jul 14, 2023, at 2:20 PM, Charles Givre  wrote:
>>> 
>>> Mike,
>>> Are you able to build Drill w/o the tests?  If so, my suggestion is
>>> really just to start working on the DFDL extensions.  I've been doing Drill
>>> stuff for far too long and really haven't needed to run the full battery of
>>> unit tests locally.  As long as you can build it and can execute individual
>>> unit tests, you should be ok.  Others may disagree, but for what you're
>>> doing, I'd think it would be fine.
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
>>> 
>>> Update: I did a clean and install -DskipTests=true.
>>> 
>>> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>>> 
>>> But alas, it still fails, this time the failure unique and is only in
>>> "Java Execution Engine"
>>> 
>>> [ERROR] Failed to execute goal
>>> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack
>>> (unpack-vector-types) on project drill-java-exec: Artifact has not been
>>> packaged yet. When used on reactor artifact, unpack should be executed
>>> after packaging: see MDEP-98. -> [Help 1]
>>> 
>>> The command and complete trace output are b

Re: drill tests not passing

2023-07-14 Thread Charles Givre
Hi Mike, 
One more thing... I've been working on an XSD Reader for Drill for some time.  
(This is still very buggy)  https://github.com/cgivre/drill/tree/xsd_reader

 What this does is attempt to convert a XML XSD file into a Drill Schema.  
Best,
-- C



> On Jul 14, 2023, at 2:20 PM, Charles Givre  wrote:
> 
> Mike, 
> Are you able to build Drill w/o the tests?  If so, my suggestion is really 
> just to start working on the DFDL extensions.  I've been doing Drill stuff 
> for far too long and really haven't needed to run the full battery of unit 
> tests locally.  As long as you can build it and can execute individual unit 
> tests, you should be ok.  Others may disagree, but for what you're doing, I'd 
> think it would be fine.
> Best,
> -- C
> 
> 
> 
>> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
>> 
>> Update: I did a clean and install -DskipTests=true.
>> 
>> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
>> 
>> But alas, it still fails, this time the failure unique and is only in "Java 
>> Execution Engine"
>> 
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack 
>> (unpack-vector-types) on project drill-java-exec: Artifact has not been 
>> packaged yet. When used on reactor artifact, unpack should be executed after 
>> packaging: see MDEP-98. -> [Help 1]
>> 
>> The command and complete trace output are below. 
>> 
>> I need assistance on how to proceed. 
>> 
>> Complete trace from the mvn test is attached. 
>>  
>> 
>> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle > <mailto:mbecke...@apache.org>> wrote:
>>> To answer questions:
>>> 
>>> 1. Paul: This is a 100% stock build. All I have done is clone the repo 
>>> (master branch). Make a new git branch (in case I make future changes). Try 
>>> to build (success) and test (failed so far). 
>>> 
>>> 2. James: The /opt/drill directory I created is owned by my userid and has 
>>> full read/write access for all the development activities. I just put it 
>>> there so it would have a shorter path to fix the first Hive-related glitch 
>>> I encountered with the Linux 255 limit on file pathname length. 
>>> 
>>> I will try the suggested maven command line for non-UTC and see if things 
>>> improve. 
>>> 
>>> The challenge for me as a newby is how do I know if I have everything 
>>> properly configured? 
>>> 
>>> Can I just turn off building and testing of the Hive-related stuff in some 
>>> supported/well-known way? 
>>> 
>>> If so, I would suggest I'd like to turn off not just Hive, but as much as 
>>> possible. I really just need the embedded drill to work. 
>>> 
>>> I would agree with @Charles Givre <mailto:cgi...@gmail.com>  that a contrib 
>>> package addition is the ideal approach and that's what I'll be attempting. 
>>> 
>>> -mikeb
>>> 
>>> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre >> <mailto:cgi...@gmail.com>> wrote:
>>>> I'll add some heresy here... IMHO, for the purposes of developing a DFDL 
>>>> extension, you probably don't need all the Drill tests to run.  For your 
>>>> project, my suggestion would be to add a module to the contrib package and 
>>>> that way your changes are relatively self contained.  
>>>> Best,
>>>> -- C
>>>> 
>>>> 
>>>> 
>>>> > On Jul 13, 2023, at 10:27 AM, James Turton >>> > <mailto:dz...@apache.org>> wrote:
>>>> > 
>>>> > Hi Mike
>>>> > 
>>>> > Here's the command line I use to run tests on a machine that's not in 
>>>> > the UTC time zone (plus some unrelated memory size arguments).
>>>> > 
>>>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en 
>>>> > -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>>>> > 
>>>> > I have one other question to add to Paul's comments - does the OS user 
>>>> > that you're running Maven under have write access to all of the source 
>>>> > tree that you put at /opt/drill?
>>>> > 
>>>> > On 2023/07/11 22:12, Paul Rogers wrote:
>>>> >> Hi Mike,
>>>> >> 
>>>> >> A quick glance at the log suggests a failure in the tests for the JSON
>>>> >> reader, in the Mongo extended types. Dr

Re: drill tests not passing

2023-07-14 Thread Charles Givre
Mike, 
Are you able to build Drill w/o the tests?  If so, my suggestion is really just 
to start working on the DFDL extensions.  I've been doing Drill stuff for far 
too long and really haven't needed to run the full battery of unit tests 
locally.  As long as you can build it and can execute individual unit tests, 
you should be ok.  Others may disagree, but for what you're doing, I'd think it 
would be fine.
Best,
-- C



> On Jul 14, 2023, at 2:04 PM, Mike Beckerle  wrote:
> 
> Update: I did a clean and install -DskipTests=true.
> 
> Then I tried the mvn test using the non-UTC timezone stuff, as suggested.
> 
> But alas, it still fails, this time the failure unique and is only in "Java 
> Execution Engine"
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-dependency-plugin:3.4.0:unpack 
> (unpack-vector-types) on project drill-java-exec: Artifact has not been 
> packaged yet. When used on reactor artifact, unpack should be executed after 
> packaging: see MDEP-98. -> [Help 1]
> 
> The command and complete trace output are below. 
> 
> I need assistance on how to proceed. 
> 
> Complete trace from the mvn test is attached. 
>  
> 
> On Thu, Jul 13, 2023 at 1:13 PM Mike Beckerle  <mailto:mbecke...@apache.org>> wrote:
>> To answer questions:
>> 
>> 1. Paul: This is a 100% stock build. All I have done is clone the repo 
>> (master branch). Make a new git branch (in case I make future changes). Try 
>> to build (success) and test (failed so far). 
>> 
>> 2. James: The /opt/drill directory I created is owned by my userid and has 
>> full read/write access for all the development activities. I just put it 
>> there so it would have a shorter path to fix the first Hive-related glitch I 
>> encountered with the Linux 255 limit on file pathname length. 
>> 
>> I will try the suggested maven command line for non-UTC and see if things 
>> improve. 
>> 
>> The challenge for me as a newby is how do I know if I have everything 
>> properly configured? 
>> 
>> Can I just turn off building and testing of the Hive-related stuff in some 
>> supported/well-known way? 
>> 
>> If so, I would suggest I'd like to turn off not just Hive, but as much as 
>> possible. I really just need the embedded drill to work. 
>> 
>> I would agree with @Charles Givre <mailto:cgi...@gmail.com>  that a contrib 
>> package addition is the ideal approach and that's what I'll be attempting. 
>> 
>> -mikeb
>> 
>> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre > <mailto:cgi...@gmail.com>> wrote:
>>> I'll add some heresy here... IMHO, for the purposes of developing a DFDL 
>>> extension, you probably don't need all the Drill tests to run.  For your 
>>> project, my suggestion would be to add a module to the contrib package and 
>>> that way your changes are relatively self contained.  
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> > On Jul 13, 2023, at 10:27 AM, James Turton >> > <mailto:dz...@apache.org>> wrote:
>>> > 
>>> > Hi Mike
>>> > 
>>> > Here's the command line I use to run tests on a machine that's not in the 
>>> > UTC time zone (plus some unrelated memory size arguments).
>>> > 
>>> > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en 
>>> > -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
>>> > 
>>> > I have one other question to add to Paul's comments - does the OS user 
>>> > that you're running Maven under have write access to all of the source 
>>> > tree that you put at /opt/drill?
>>> > 
>>> > On 2023/07/11 22:12, Paul Rogers wrote:
>>> >> Hi Mike,
>>> >> 
>>> >> A quick glance at the log suggests a failure in the tests for the JSON
>>> >> reader, in the Mongo extended types. Drill's date/time support has
>>> >> historically been fragile. Some tests only work if your machine is set to
>>> >> use the UTC time zone (or Java is told to pretend that the time is UTC.)
>>> >> The Mongo types test failure seems to be around a date/time test so maybe
>>> >> this is the issue?
>>> >> 
>>> >> There are also failures indicating that the Drillbit (Drill server) died.
>>> >> Not sure how this can happen, as tests run Drill embedded (or used to.)
>>> >> Looking earlier in the logs, it seems that the Drillbit didn't start due 
>>> >> to
>>> &g

Re: drill tests not passing

2023-07-13 Thread Charles Givre
Hi Mike,
You can just build Drill with the -DskipTests = true option and that should
work.  I do my development in intellij, and just run the relevant unit
tests there.  Then to test Drill in its entirety, I'll use the CI in github
which works most of the time. ;-)


-- C

On Thu, Jul 13, 2023 at 1:14 PM Mike Beckerle  wrote:

> To answer questions:
>
> 1. Paul: This is a 100% stock build. All I have done is clone the repo
> (master branch). Make a new git branch (in case I make future changes). Try
> to build (success) and test (failed so far).
>
> 2. James: The /opt/drill directory I created is owned by my userid and has
> full read/write access for all the development activities. I just put it
> there so it would have a shorter path to fix the first Hive-related glitch
> I encountered with the Linux 255 limit on file pathname length.
>
> I will try the suggested maven command line for non-UTC and see if things
> improve.
>
> The challenge for me as a newby is how do I know if I have everything
> properly configured?
>
> Can I just turn off building and testing of the Hive-related stuff in some
> supported/well-known way?
>
> If so, I would suggest I'd like to turn off not just Hive, but *as much as
> possible*. I really just need the embedded drill to work.
>
> I would agree with @Charles Givre   that a contrib
> package addition is the ideal approach and that's what I'll be attempting.
>
> -mikeb
>
> On Thu, Jul 13, 2023 at 10:59 AM Charles Givre  wrote:
>
> > I'll add some heresy here... IMHO, for the purposes of developing a DFDL
> > extension, you probably don't need all the Drill tests to run.  For your
> > project, my suggestion would be to add a module to the contrib package
> and
> > that way your changes are relatively self contained.
> > Best,
> > -- C
> >
> >
> >
> > > On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
> > >
> > > Hi Mike
> > >
> > > Here's the command line I use to run tests on a machine that's not in
> > the UTC time zone (plus some unrelated memory size arguments).
> > >
> > > mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en
> > -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
> > >
> > > I have one other question to add to Paul's comments - does the OS user
> > that you're running Maven under have write access to all of the source
> tree
> > that you put at /opt/drill?
> > >
> > > On 2023/07/11 22:12, Paul Rogers wrote:
> > >> Hi Mike,
> > >>
> > >> A quick glance at the log suggests a failure in the tests for the JSON
> > >> reader, in the Mongo extended types. Drill's date/time support has
> > >> historically been fragile. Some tests only work if your machine is set
> > to
> > >> use the UTC time zone (or Java is told to pretend that the time is
> UTC.)
> > >> The Mongo types test failure seems to be around a date/time test so
> > maybe
> > >> this is the issue?
> > >>
> > >> There are also failures indicating that the Drillbit (Drill server)
> > died.
> > >> Not sure how this can happen, as tests run Drill embedded (or used
> to.)
> > >> Looking earlier in the logs, it seems that the Drillbit didn't start
> > due to
> > >> UDF (user-defined function) failures:
> > >>
> > >> Found duplicated function in drill-custom-lower.jar:
> > >> custom_lower(VARCHAR-REQUIRED)
> > >> Found duplicated function in built-in: lower(VARCHAR-REQUIRED)
> > >>
> > >> Not sure how this could occur: it should have failed in all builds.
> > >>
> > >> Also:
> > >>
> > >> File
> > >>
> >
> /opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
> > >> does not exist on file system file:///
> > >>
> > >> This is complaining that Drill needs the source code (not just class
> > file)
> > >> for its built-in functions. Again, this should not fail in a standard
> > >> build, because if it did, it would fail in all builds.
> > >>
> > >> There are other odd errors as well.
> > >>
> > >> Perhaps we should ask: is this a "stock" build? Check out Drill and
> run
> > >> tests? Or, have you already started making changes for your project?
> > >>
> > >> - Paul
> > >>
> > >>
> > >> On Tue, Jul 11, 2023 at 9:07 AM Mike Beckerle 
> > wrote:
> > >>
> > >>> I have drill building and running its tests. Some tests fail: [ERROR]
> > >>> Tests run: 4366, Failures: 2, Errors: 1, Skipped: 133
> > >>>
> > >>> I am wondering if there is perhaps some setup step that I missed in
> the
> > >>> instructions.
> > >>>
> > >>> I have attached the output from the 'mvn clean install
> > -DskipTests=false'
> > >>> execution. (zipped)
> > >>> I am running on Ubuntu 20.04, definitely have Java 8 setup.
> > >>>
> > >>> I'm hoping someone can skim it and spot the issue(s).
> > >>>
> > >>> Thanks for any help
> > >>>
> > >>> Mike Beckerle
> > >>> Apache Daffodil PMC | daffodil.apache.org
> > >>> OGF DFDL Workgroup Co-Chair |
> > www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> > >>> Owl Cyber Defense | www.owlcyberdefense.com
> > >>>
> > >>>
> > >>>
> > >
> >
> >
>


Re: drill tests not passing

2023-07-13 Thread Charles Givre
I'll add some heresy here... IMHO, for the purposes of developing a DFDL 
extension, you probably don't need all the Drill tests to run.  For your 
project, my suggestion would be to add a module to the contrib package and that 
way your changes are relatively self contained.  
Best,
-- C



> On Jul 13, 2023, at 10:27 AM, James Turton  wrote:
> 
> Hi Mike
> 
> Here's the command line I use to run tests on a machine that's not in the UTC 
> time zone (plus some unrelated memory size arguments).
> 
> mvn test -Djunit.args="-Duser.timezone=UTC -Duser.language=en 
> -Duser.region=US" -DmemoryMb=2560 -DdirectMemoryMb=2560
> 
> I have one other question to add to Paul's comments - does the OS user that 
> you're running Maven under have write access to all of the source tree that 
> you put at /opt/drill?
> 
> On 2023/07/11 22:12, Paul Rogers wrote:
>> Hi Mike,
>> 
>> A quick glance at the log suggests a failure in the tests for the JSON
>> reader, in the Mongo extended types. Drill's date/time support has
>> historically been fragile. Some tests only work if your machine is set to
>> use the UTC time zone (or Java is told to pretend that the time is UTC.)
>> The Mongo types test failure seems to be around a date/time test so maybe
>> this is the issue?
>> 
>> There are also failures indicating that the Drillbit (Drill server) died.
>> Not sure how this can happen, as tests run Drill embedded (or used to.)
>> Looking earlier in the logs, it seems that the Drillbit didn't start due to
>> UDF (user-defined function) failures:
>> 
>> Found duplicated function in drill-custom-lower.jar:
>> custom_lower(VARCHAR-REQUIRED)
>> Found duplicated function in built-in: lower(VARCHAR-REQUIRED)
>> 
>> Not sure how this could occur: it should have failed in all builds.
>> 
>> Also:
>> 
>> File
>> /opt/drill/exec/java-exec/target/org.apache.drill.exec.udf.dynamic.TestDynamicUDFSupport/home/drill/happy/udf/staging/drill-custom-lower-sources.jar
>> does not exist on file system file:///
>> 
>> This is complaining that Drill needs the source code (not just class file)
>> for its built-in functions. Again, this should not fail in a standard
>> build, because if it did, it would fail in all builds.
>> 
>> There are other odd errors as well.
>> 
>> Perhaps we should ask: is this a "stock" build? Check out Drill and run
>> tests? Or, have you already started making changes for your project?
>> 
>> - Paul
>> 
>> 
>> On Tue, Jul 11, 2023 at 9:07 AM Mike Beckerle  wrote:
>> 
>>> I have drill building and running its tests. Some tests fail: [ERROR]
>>> Tests run: 4366, Failures: 2, Errors: 1, Skipped: 133
>>> 
>>> I am wondering if there is perhaps some setup step that I missed in the
>>> instructions.
>>> 
>>> I have attached the output from the 'mvn clean install -DskipTests=false'
>>> execution. (zipped)
>>> I am running on Ubuntu 20.04, definitely have Java 8 setup.
>>> 
>>> I'm hoping someone can skim it and spot the issue(s).
>>> 
>>> Thanks for any help
>>> 
>>> Mike Beckerle
>>> Apache Daffodil PMC | daffodil.apache.org
>>> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
>>> Owl Cyber Defense | www.owlcyberdefense.com
>>> 
>>> 
>>> 
> 



Re: Drill and Highly Hierarchical Data from Daffodil

2023-07-11 Thread Charles Givre
HI Mike, 
When you say "you want all of them', can you clarify a bit about what you'd 
want the data to look like?
Best,
-- C



> On Jul 11, 2023, at 12:33 PM, Mike Beckerle  wrote:
> 
> In designing the integration of Apache Daffodil into Drill, I'm trying to
> figure out how queries would look operating on deeply nested data.
> 
> Here's an example.
> 
> This is the path to many geo-location latLong field pairs in some
> "messageSet" data:
> 
> messageSet/noc_message[*]/message_content/content/vmf/payload/message/K05_17/overlay_message/r1_group/item[*]/points_group/item[*]/latLong
> 
> This is sort-of like XPath, except in the above I have put "[*]" to
> indicate the child elements that are vectors. You can see there are 3
> nested vectors here.
> 
> Beneath that path are these two fields, which are what I would want out of
> my query, along with some fields from higher up in the nest.
> 
> entity_latitude_1/degrees
> entity_longitude_1/degrees
> 
> The tutorial information here
> 
>https://drill.apache.org/docs/selecting-nested-data-for-a-column/
> 
> describes how to index into JSON arrays with specific integer values, but I
> don't want specific integers, I want all values of them.
> 
> Can someone show me what a hypothetical Drill query would look like that
> pulls out all the values of this latLong pair?
> 
> My stab is:
> 
> SELECT pairs.entity_latitude_1.degrees AS lat,
> pairs.entity_longitude_1.degrees AS lon FROM
> messageSet.noc_message[*].message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item[*].points_group.item[*].latLong
> AS pairs
> 
> I'm not at all sure about the vectors in that though.
> 
> The other idea was this quasi-notation (that I'm making up on the fly here)
> which treats each vector as a table.
> 
> SELECT pairs.entity_latitude_1.degrees AS lat,
> pairs.entity_longitude_1.degrees AS lon FROM
>  messageSet.noc_message AS messages,
> 
> messages.message_content.content.vmf.payload.message.K05_17.overlay_message.r1_group.item
> AS parents
>  parents.points_group.item AS items
>  items.latLong AS pairs
> 
> I have no idea if that makes any sense at all for Drill
> 
> Any help greatly appreciated.
> 
> -Mike Beckerle



Re: Newby: First attempt to build drill - failure

2023-07-11 Thread Charles Givre
Methinks the Hive plugin could probably use some attention. With that said, I 
don't know how much use it actually gets.  Yes... a ticket would probably be in 
order.
Best,
-- C



> On Jul 11, 2023, at 10:38 AM, Mike Beckerle  wrote:
> 
> Should there be a ticket created about this:
> 
> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore$drop_partition_by_name_with_environment_context_args$drop_partition_by_name_with_environment_context_argsTupleSchemeFactory.class
> 
> The largest part of that path is the file name part which has
> "drop_partition_by_name_with_environment_context_args" appearing twice in
> the class file name. This appears to be a generated name so we should be
> able to shorten it.
> 
> 
> On Tue, Jul 11, 2023 at 12:27 AM James Turton  wrote:
> 
>> Good news and welcome to Drill!
>> 
>> I haven't heard of anyone runing into this problem before, and I build
>> Drill under the directory /home/james/Development/apache/drill which
>> isn't far off of what you tried in terms of length. I do see the
>> 280-character path cited by Maven below though. Perhaps in your case the
>> drill-hive-exec-shaded was downloaded from the Apache Snapshots repo,
>> rather than built locally, and this issue only presents itself if the
>> maven-dependency-plugin must unpack a very long file path from a
>> downloaded jar.
>> 
>> 
>> On 2023/07/10 18:23, Mike Beckerle wrote:
>>> Never mind. The file name was > 255 long, so I have installed the drill
>>> build tree in /opt and now the path is shorter than 255.
>>> 
>>> 
>>> On Mon, Jul 10, 2023 at 12:00 PM Mike Beckerle 
>> wrote:
>>> 
 I'm trying to build the current master branch as of today 2023-07-10.
 
 It fails due to a file-name too long issue.
 
 The command I issued is just "mvn clean install -DskipTests" per the
 instructions.
 
 I'm running on Linux, Ubuntu 20.04. Java 8.
 
 [INFO] --- maven-dependency-plugin:3.4.0:unpack (unpack) @
 drill-hive-exec-shaded ---
 [INFO] Configured Artifact:
 
>> org.apache.drill.contrib.storage-hive:drill-hive-exec-shaded:1.22.0-SNAPSHOT:jar
 [INFO] Unpacking
 
>> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/drill-hive-exec-shaded-1.22.0-SNAPSHOT.jar
 to
 
>> /home/mbeckerle/dataiti/opensource/drill/contrib/storage-hive/hive-exec-shade/target/classes
 with includes "**/**" and excludes ""
 [INFO]
 
 [INFO] Reactor Summary for Drill : 1.22.0-SNAPSHOT:
 [INFO]
 [INFO] Drill :  SUCCESS [
  3.974 s]
 [INFO] Drill : Tools :  SUCCESS [
  0.226 s]
 [INFO] Drill : Tools : Freemarker codegen . SUCCESS [
  3.762 s]
 [INFO] Drill : Protocol ... SUCCESS [
  5.001 s]
 [INFO] Drill : Common . SUCCESS [
  4.944 s]
 [INFO] Drill : Logical Plan ... SUCCESS [
  5.991 s]
 [INFO] Drill : Exec : . SUCCESS [
  0.210 s]
 [INFO] Drill : Exec : Memory :  SUCCESS [
  0.179 s]
 [INFO] Drill : Exec : Memory : Base ... SUCCESS [
  2.373 s]
 [INFO] Drill : Exec : RPC . SUCCESS [
  2.436 s]
 [INFO] Drill : Exec : Vectors . SUCCESS [
 54.917 s]
 [INFO] Drill : Contrib : .. SUCCESS [
  0.138 s]
 [INFO] Drill : Contrib : Data : ... SUCCESS [
  0.143 s]
 [INFO] Drill : Contrib : Data : TPCH Sample ... SUCCESS [
  1.473 s]
 [INFO] Drill : Metastore :  SUCCESS [
  0.144 s]
 [INFO] Drill : Metastore : API  SUCCESS [
  4.366 s]
 [INFO] Drill : Metastore : Iceberg  SUCCESS [
  3.940 s]
 [INFO] Drill : Exec : Java Execution Engine ... SUCCESS
>> [01:04
 min]
 [INFO] Drill : Exec : JDBC Driver using dependencies .. SUCCESS [
  7.332 s]
 [INFO] Drill : Exec : JDBC JAR with all dependencies .. SUCCESS [
 16.304 s]
 [INFO] Drill : On-YARN  SUCCESS [
  5.477 s]
 [INFO] Drill : Metastore : RDBMS .. SUCCESS [
  6.704 s]
 [INFO] Drill : Metastore : Mongo .. SUCCESS [
  3.621 s]
 [INFO] Drill : Contrib : Storage : Kudu ... SUCCESS [
  6.693 s]
 [INFO] Drill : Contrib : Format : XML . SUCCESS [
  3.511 s]
 [INFO] Drill : Contrib : Storage : HTTP 

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

2023-07-07 Thread Charles Givre
Concur +1 From me!
-- C



> On Jul 6, 2023, at 12:46 PM, Ted Dunning  wrote:
> 
> That's a cool abstract.
> 
> 
> 
> On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle  wrote:
> 
>> I decided the only way to force getting this Drill + Daffodil integration
>> done, or at least started, is to have a deadline.
>> 
>> So I submitted this abstract below for the upcoming "Community over Code"
>> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>> 
>> I'm hoping this forces some of the refactoring that is gating other efforts
>> and fixes in Daffodil at the same time.
>> 
>> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
>> Daffodil*
>> 
>> 
>> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
>> Asterix, some COBOL FD, or any other kind of data. **You can now describe
>> it with a Data Format Description Language (DFDL) schema, then u**sing
>> Apache Drill, you can directly query that data and those queries can also
>> incorporate data from any of Apache Drill's other array of data sources.
>> **This
>> talk will describe the integration of Apache Drill with Apache Daffodil's
>> DFDL implementation. **This deep integration implements Drill's metadata
>> model in terms of the Daffodil DFDL metadata model, and implements Drill's
>> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
>> queries to operate intelligently on DFDL-described data without the cost of
>> data conversion into an expensive intermediate form like JSON or XML. **The
>> talk will highlight the specific challenges in this integration and the
>> lessons learned that are applicable to integration of other Apache projects
>> having their own metadata and data models. *
>> 



[jira] [Created] (DRILL-8438) Bump YAUAA to 7.19.2

2023-05-23 Thread Charles Givre (Jira)
Charles Givre created DRILL-8438:


 Summary: Bump YAUAA to 7.19.2
 Key: DRILL-8438
 URL: https://issues.apache.org/jira/browse/DRILL-8438
 Project: Apache Drill
  Issue Type: Task
  Components: Functions - Drill
Reporter: Charles Givre
Assignee: Niels Basjes


Bump YAUAA to latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8437) Add Header Index Pagination

2023-05-21 Thread Charles Givre (Jira)
Charles Givre created DRILL-8437:


 Summary: Add Header Index Pagination
 Key: DRILL-8437
 URL: https://issues.apache.org/jira/browse/DRILL-8437
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - HTTP
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0


Some APIs include pagination fields in the HTTP response headers.  This PR adds 
a new pagination method called Header Index which supports that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8434) Add Median Function

2023-05-15 Thread Charles Givre (Jira)
Charles Givre created DRILL-8434:


 Summary: Add Median Function
 Key: DRILL-8434
 URL: https://issues.apache.org/jira/browse/DRILL-8434
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0


Adds a median function to Drill. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8433) Add Percent Change UDF to Drill

2023-05-09 Thread Charles Givre (Jira)
Charles Givre created DRILL-8433:


 Summary: Add Percent Change UDF to Drill
 Key: DRILL-8433
 URL: https://issues.apache.org/jira/browse/DRILL-8433
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22.0


Adds a function to calculate the percent change between two columns.  Doing 
this without a custom function is cumbersome because you have to include a 
check for division by zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DRAFT] Report to Board

2023-05-09 Thread Charles Givre
Hello all, 
Here is the draft report to the Apache Board.  This is due tomorrow, so please 
let me know any comments ASAP.   Personally, I was very disappointed with 
Apache's marketing dept's announcement of Drill 1.21.  Last time we had a major 
release, it seemed like they did a lot more.  This time... not so much.  They 
have a policy that they don't make a big deal for bug fix releases, but still...
Best,
-- C



## Description:
The mission of Drill is the creation and maintenance of software related to 
Schema-free SQL Query Engine for Apache Hadoop, NoSQL and Cloud Storage

## Issues:
No issues requriing Board attention.
The PMC would like to express a little disappointment with the rather tepid 
response to our request for a release announcement.  Drill 1.21 was a very
significant release from a functionality perspective, and the only 
announcements we received were buried in other announcements. 


## Membership Data:
Apache Drill was founded 2014-11-18 (8 years ago)
There are currently 62 committers and 27 PMC members in this project.
The Committer-to-PMC ratio is roughly 2:1.

Community changes, past quarter:
- No new PMC members. Last addition was James Turton on 2022-01-23.
- No new committers. Last addition was Maksym Rymar on 2022-10-19.

## Project Activity:
Since the last report, we released Drill 1.21.0 and a bugfix release of 
1.21.1.  

1.21 was quite a significant release in that we eliminated the fork of Apache 
Calcite that Drill was using and now Drill is running on the main branch of 
Calcite, 1.34.  The result was a significant improvement in stability and 
overall ease of use.  Another very significant improvement was the work done
to improve implicit casting.  One of the main challenges of Drill was querying 
data without a schema.  When it works, the exprience is magical, but when it 
doesn't, the user was often confronted with a barrage of incomprehensible errors
Drill 1.21.0 largely fixes both of these issues and by in 
large, "it just works"!

Drill 1.21 also adds new connectors/plugins for:
* GoogleSheets
* Box File System
* MS Access

Drill 1.21 also adds the ability for Drill to query other Drill clusters.  This
is particularly useful if a user has data in multiple public clouds. In this 
scenario, Drill can be installed in both environments, and then connected
using the new Drill-on-Drill plugin.  From there, a user can query across
environments without incurring massive data transfer costs.

Lastly, Drill 1.21 user translation to the security settings.  User translation
allows users to use individual credentials instead of service accounts to query 
data in systems like Splunk or ElasticSearch.  

1.21.1 is a minor bugfix release which fixed a significant regression in 
Calcite's date functions as well as some other minor bugs. 


1.21.1 was released on 2023-04-29.
1.21.0 was released on 2023-02-21.
1.20.3 was released on 2023-01-07.


## Community Health:
Drill's community health remains strong. 

dev@drill.apache.org had a 4% increase in traffic in the past quarter 
(349 emails compared to 335)
iss...@drill.apache.org had a 52% increase in traffic in the past quarter
(495 emails compared to 324)
29 issues opened in JIRA, past quarter (-32% change)
35 issues closed in JIRA, past quarter (-7% change)
112 commits in the past quarter (13% increase)
11 code contributors in the past quarter (-8% change)
34 PRs opened on GitHub, past quarter (-20% change)
34 PRs closed on GitHub, past quarter (-27% change)
11 issues opened on GitHub, past quarter (37% increase)
4 issues closed on GitHub, past quarter (33% increase)

[jira] [Created] (DRILL-8428) ElasticSearch Config Missing Getters

2023-05-04 Thread Charles Givre (Jira)
Charles Givre created DRILL-8428:


 Summary: ElasticSearch Config Missing Getters
 Key: DRILL-8428
 URL: https://issues.apache.org/jira/browse/DRILL-8428
 Project: Apache Drill
  Issue Type: Bug
Reporter: Charles Givre
Assignee: Charles Givre


The ElasticSearch config was missing some getters and as a result, prevented 
users from setting certain config variables.  This PR fixes this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] Apache Drill 1.21.1 Released

2023-04-30 Thread Charles Givre
Great work everyone!  
One thing to note is that hidden in this bug fix release is the update to 
Calcite 1.34, which also includes bug fixes to the date functions as well as 
adding some new functionality to Drill!  James, I couldn't remember what new 
operators it added but would you mind please sending a note to the list?  I can 
work on updating the docs.
Best,
-- C



> On Apr 30, 2023, at 6:37 AM, James Turton  wrote:
> 
> On behalf of the Apache Drill community, I am happy to announce the release 
> of Apache Drill 1.21.1.
> 
> Drill is an Apache open-source SQL query engine for Big Data exploration.
> Drill is designed from the ground up to support high-performance analysis
> on the semi-structured and rapidly evolving data coming from modern Big
> Data applications, while still providing the familiarity and ecosystem of
> ANSI SQL, the industry-standard query language. Drill provides
> plug-and-play integration with existing Apache Hive and Apache HBase
> deployments.
> 
> For information about Apache Drill, and to get involved, visit the project 
> website [1].
> 
> A total of 12 JIRA's are resolved in this bugfix release of Drill. For the 
> full list please see release notes [2].
> 
> The binary and source artifacts are available here [3].
> 
> Thanks to everyone in the community who contributed to this release!
> 
> 1. https://drill.apache.org/
> 2. https://drill.apache.org/docs/apache-drill-1-21-1-release-notes/
> 3. https://drill.apache.org/download/
> 



Re: [VOTE] Release Apache Drill 1.21.1 - RC0

2023-04-25 Thread Charles Givre
Hello all, 
Built from source, ran various queries. LGTM +1 (Binding)
-- C



> On Apr 19, 2023, at 5:26 AM, James Turton  wrote:
> 
> Hi all
> 
> I'd like to propose the first release candidate (RC0) of Apache Drill, 
> version 1.21.1.
> 
> The release candidate covers a total of 12 resolved Jira issues [1]. Thanks 
> to everyone who contributed to this release.
> 
> The tarball artefacts are hosted at [2] and the Maven artefacts are hosted at 
> [3].
> 
> This release candidate is based on commit 
> 6da4b77ceb1510df4de4a2cfcc43a536b99f37ab located at [4].
> 
> [ ] +1
> [ ] +0
> [ ] -1
> 
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12352949
> [2] https://dist.apache.org/repos/dist/dev/drill/1.21.1-rc1/
> [3] https://repository.apache.org/content/repositories/orgapachedrill-1105/
> [4] https://github.com/jnturton/drill/commits/drill-1.21.1



[jira] [Resolved] (DRILL-4223) PIVOT and UNPIVOT to rotate table valued expressions

2023-04-18 Thread Charles Givre (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Givre resolved DRILL-4223.
--
Resolution: Fixed

Added in Drill 1.21. 

> PIVOT and UNPIVOT to rotate table valued expressions
> 
>
> Key: DRILL-4223
> URL: https://issues.apache.org/jira/browse/DRILL-4223
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Execution - Codegen, SQL Parser
>Reporter: Ashwin Aravind
>Priority: Major
> Fix For: 1.21.0
>
>
> Capability to PIVOT and UNPIVOT table values expressions which are results of 
> a SELECT query



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Towards Drill 1.21.1

2023-03-30 Thread Charles Givre
Hello all,
It looks like we're ready to start with Drill 1.21.1.  I just submitted a PR 
that I'm not sure if it qualifies as a bug fix or not. Basically, I found that 
when I was working with some Excel files that had formula errors, Drill would 
throw exceptions.
The original intent was to make it so that the files would at least read, but I 
realized that the user might want to configure this, so it expanded a bit.  I 
don't know if this qualifies as a bug fix or not, but if it does, the PR is 
ready for review.
Best,
-- C



signature.asc
Description: Message signed with OpenPGP


[jira] [Created] (DRILL-8417) Allow Excel Reader to Ignore Formula Errors

2023-03-30 Thread Charles Givre (Jira)
Charles Givre created DRILL-8417:


 Summary: Allow Excel Reader to Ignore Formula Errors
 Key: DRILL-8417
 URL: https://issues.apache.org/jira/browse/DRILL-8417
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Excel
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8414) Index Paginator Not Working When Provided URL

2023-03-18 Thread Charles Givre (Jira)
Charles Givre created DRILL-8414:


 Summary: Index Paginator Not Working When Provided URL
 Key: DRILL-8414
 URL: https://issues.apache.org/jira/browse/DRILL-8414
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - HTTP
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.1


The index paginator offers two options:  One where the API returns an index or 
offset and the other is when it returns a URL.  The second was not fully 
implemented.  This PR also adds functionality in the case where the API returns 
a path rather than a URL.  In that case, the path will replace the pre-existing 
path segments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8413) Add DNS Lookup Functions

2023-03-17 Thread Charles Givre (Jira)
Charles Givre created DRILL-8413:


 Summary: Add DNS Lookup Functions
 Key: DRILL-8413
 URL: https://issues.apache.org/jira/browse/DRILL-8413
 Project: Apache Drill
  Issue Type: New Feature
  Components: Functions - Drill
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.22


This PR adds additional DNS lookup functions to Drill:

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8411) GoogleSheets Reader Will Not Read More than 1K Rows

2023-03-14 Thread Charles Givre (Jira)
Charles Givre created DRILL-8411:


 Summary: GoogleSheets Reader Will Not Read More than 1K Rows
 Key: DRILL-8411
 URL: https://issues.apache.org/jira/browse/DRILL-8411
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - GoogleSheets
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.1


The GoogleSheets reader hits the batch limit from the GoogleSheets SDK of 1000 
rows and stops.   This PR fixes that.  

It also fixes a minor but annoying issue whereby the GoogleSheets reader 
determines a column is a date/time, but is then unable to parse it because it 
is in a non-standard format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8408) Allow Implicit Casts on Join

2023-03-08 Thread Charles Givre (Jira)
Charles Givre created DRILL-8408:


 Summary: Allow Implicit Casts on Join
 Key: DRILL-8408
 URL: https://issues.apache.org/jira/browse/DRILL-8408
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Data Types
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.1


Currently, Drill does not allow implicit casts on joins.  With DRILL-8136, this 
has been significantly improved, and it might make sense to do so. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8407) Add Support for SFTP File Systems

2023-03-05 Thread Charles Givre (Jira)
Charles Givre created DRILL-8407:


 Summary: Add Support for SFTP File Systems
 Key: DRILL-8407
 URL: https://issues.apache.org/jira/browse/DRILL-8407
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - File
Affects Versions: 1.20.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: Future


Add support for SFTP File Systems. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Friendly suggestion to make the dev-list readable again

2023-02-22 Thread Charles Givre
HI Christofer, 
Thanks for the suggestion.  We were in the middle of a release, but thank you 
VERY much for this suggestion.  I know all of us are probably overwhelmed with 
the number of emails we get, so any way to organize and/or reduce that is 
greatly appreciated.   I submitted a pull request for this change.  
(https://github.com/apache/drill/pull/2764)   Personally, I'd love to disable 
the gitbox emails as well. Those are completely useless and merely duplicate 
what I already get from github.  

Best,
--C 



> On Feb 15, 2023, at 7:41 AM, Christofer Dutz  
> wrote:
> 
> Hi,
> 
> while reviewing your projects activity as part of my board duties, I came to 
> notice, that the dev-list is in an almost unusable state.
> 
> The vast majority of all emails are auto-generated emails from GitHub.
> 
> The way the replication is setup per default makes these emails impossible to 
> follow and display in any email client I tried to read it on. After a lot of 
> persuasion Infra introduced a new feature into the .asf.yml to allow custom 
> templates for these auto-generated emails.
> 
> https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-CustomsubjectlinesforGitHubevents
> 
> In the PLC4X Project we used these settings to drastically reformat the 
> subject lines of all GitHub emails.
> 
> https://github.com/apache/plc4x/blob/develop/.asf.yaml#L62
> 
> With this we can now read the start of the title in typical columnar layouted 
> email clients as well on phones and the clients are able to do correct 
> threading.
> 
> Perhaps worth discussing to adopt these or similar changes.
> 
> The way the list is currently, I doubt anyone is reading it and it makes it 
> impossible for outsiders to get started … well … and it makes it hard for the 
> board to execute it’s oversight.
> 
> As I said in the subject … it’s not an order to change anything, but a 
> friendly suggestion.
> 
> Chris



[jira] [Created] (DRILL-8402) Add REGEXP_EXTRACT Function

2023-02-20 Thread Charles Givre (Jira)
Charles Givre created DRILL-8402:


 Summary: Add REGEXP_EXTRACT Function
 Key: DRILL-8402
 URL: https://issues.apache.org/jira/browse/DRILL-8402
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


This PR adds two UDFs to Drill:

regexp_extract(, ) which returns an array of strings which were 
captured by capturing groups in the regex.

regexp_extract(, , ) returns the text captured by a 
specific capturing group. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Drill 1.21.0 - RC0

2023-02-17 Thread Charles Givre
Built Drill from source on Java 11.  I encountered 1 small issue, which I don't 
think is a blocker, but should definitely be documented somewhere.  
Specifically, if there is not a file called git.properties in the root folder, 
the build will fail at the distribution phase.  We could either include a blank 
file or just document this to tell developers to create one.  I don't think it 
is a big deal.

In any event, ran various queries against various data sources to include PIVOT 
queries!!
All behaved as expected, so 
+1 from me. (Binding)

Best,
-- C




> On Feb 15, 2023, at 4:05 AM, James Turton  wrote:
> 
> Hi all,
> 
> I'd like to propose the first release candidate (RC0) of Apache Drill, 
> version 1.21.0.
> 
> The release candidate covers a total of 106 resolved JIRAs [1]. Thanks to 
> everyone who contributed to this release.
> 
> The tarball artifacts are hosted at [2] and the maven artifacts are hosted at 
> [3].
> 
> This release candidate is based on commit 
> ff86c29b125776c56f361fc9187fb96458de37a6 located at [4].
> 
> Please download and try out the release, especially using JDK 8 because this 
> release was built with JDK 17.
> 
> The vote ends at Sat 18 Feb 2023 12:00:00 UTC.
> 
> [ ] +1
> [ ] +0
> [ ] -1
> 
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12335462
> [2] https://github.com/jnturton/drill/releases/tag/drill-1.21.0
> [3] https://repository.apache.org/content/repositories/orgapachedrill-1104/
> [4] https://github.com/jnturton/drill/commits/drill-1.21.0



Re: [DISCUSS] Towards Drill 1.21

2023-02-13 Thread Charles Givre
Hello all, 
It seems like we have one remaining PR to merge which is almost done and one 
minor bug to fix.  Once this is done, we should be ready to generate a release 
candidate.

DRILL-8117:  Upgrade unit tests to the cluster fixture framework

Thanks for everyone's hard work on this!
Best,
-- C


> On Feb 8, 2023, at 11:34 PM, James Turton  wrote:
> 
> Another note, arising from some basic testing I just been doing of 
> 1.21.0-SNAPSHOT on my laptop, is that embedded Drill takes 5+ seconds to 
> start and eats 1.5Gb of RAM before you even do anything. That kind of bloat 
> won't worry big data users much but it'll certainly hurt Drill's appeal to 
> lightweight users. It can only originate from our ongoing inclusion of 
> libraries for external formats and storage without having a modular plugin 
> architecture that is suited for scaling along that dimension. JAR hell 
> <https://en.wikipedia.org/wiki/Java_Classloader#JAR_hell> will also 
> increasingly turn routine dependency upgrades into protracted battles (and I 
> think we all already have one or two scars from these to show off).
> 
> My gut says that eventually, not today, this is going to become of 
> existential importance.
> 
> On 2023/02/08 17:06, Charles Givre wrote:
>> Most definitely!
>> 
>> Current status:
>> PRs Remaining
>> DRILL-8395:  Add Support for Insert and Drop Tables to GoogleSheets
>> DRILL-4232:  Support for EXCEPT and INTERSECT set operator
>> DRILL-8117:  Upgrade unit tests to the cluster fixture framework
>> 
>> Best,
>> -- C
>> 
>> 
>>> On Feb 8, 2023, at 9:57 AM, James Turton  wrote:
>>> 
>>> One other category of thing we need to do for this release is doc updates. 
>>> A *lot* of them, I fear, because we've merged a lot of new stuff since 1.20.
>>> 
>>> On 2023/02/07 18:29, Charles Givre wrote:
>>>> Hello all,
>>>> I hope all is well.  Calcite just released 1.33 and I think it's time to 
>>>> start cutting a release of Drill.   There are a few outstanding PRs which 
>>>> I'd like to see merged and they are:
>>>> 
>>>> DRILL-8379:  Update Calcite to 1.33
>>>> DRILL-8395:  Add Support for Insert and Drop Tables to GoogleSheets
>>>> DRILL-8372:  Unfreed buffers when running a LIMIT 0 query over delimited 
>>>> text
>>>> DRILL-4232:  Support for EXCEPT and INTERSECT set operator
>>>> DRILL-8290:  Early exit from recursive file listing for LIMIT 0 queries
>>>> 
>>>> These are PRs that are in the final stages of review and I think we can 
>>>> get them over the finish line very quickly.  If there are other PRs that 
>>>> anyone would like included, (or removed) please let me know and we can add 
>>>> them to the list.
>>>> 
>>>> This will really be a milestone release.  Thanks to everyone who has 
>>>> contributed to this release.
>>>> Best,
>>>> -- C
>>>> 
>>>> 



[jira] [Created] (DRILL-8399) MS Access Reader Misinterprets Data Types

2023-02-10 Thread Charles Givre (Jira)
Charles Givre created DRILL-8399:


 Summary: MS Access Reader Misinterprets Data Types
 Key: DRILL-8399
 URL: https://issues.apache.org/jira/browse/DRILL-8399
 Project: Apache Drill
  Issue Type: Bug
  Components: Format - MS Access
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


The MS Access reader was assigning certain data types incorrectly, resulting in 
various errors.  This minor PR fixes that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Towards Drill 1.21

2023-02-08 Thread Charles Givre
Most definitely!  

Current status:
PRs Remaining
DRILL-8395:  Add Support for Insert and Drop Tables to GoogleSheets
DRILL-4232:  Support for EXCEPT and INTERSECT set operator
DRILL-8117:  Upgrade unit tests to the cluster fixture framework

Best,
-- C


> On Feb 8, 2023, at 9:57 AM, James Turton  wrote:
> 
> One other category of thing we need to do for this release is doc updates. A 
> *lot* of them, I fear, because we've merged a lot of new stuff since 1.20.
> 
> On 2023/02/07 18:29, Charles Givre wrote:
>> Hello all,
>> I hope all is well.  Calcite just released 1.33 and I think it's time to 
>> start cutting a release of Drill.   There are a few outstanding PRs which 
>> I'd like to see merged and they are:
>> 
>> DRILL-8379:  Update Calcite to 1.33
>> DRILL-8395:  Add Support for Insert and Drop Tables to GoogleSheets
>> DRILL-8372:  Unfreed buffers when running a LIMIT 0 query over delimited text
>> DRILL-4232:  Support for EXCEPT and INTERSECT set operator
>> DRILL-8290:  Early exit from recursive file listing for LIMIT 0 queries
>> 
>> These are PRs that are in the final stages of review and I think we can get 
>> them over the finish line very quickly.  If there are other PRs that anyone 
>> would like included, (or removed) please let me know and we can add them to 
>> the list.
>> 
>> This will really be a milestone release.  Thanks to everyone who has 
>> contributed to this release.
>> Best,
>> -- C
>> 
>> 
> 



[DISCUSS] Towards Drill 1.21

2023-02-07 Thread Charles Givre
Hello all, 
I hope all is well.  Calcite just released 1.33 and I think it's time to start 
cutting a release of Drill.   There are a few outstanding PRs which I'd like to 
see merged and they are:

DRILL-8379:  Update Calcite to 1.33
DRILL-8395:  Add Support for Insert and Drop Tables to GoogleSheets
DRILL-8372:  Unfreed buffers when running a LIMIT 0 query over delimited text
DRILL-4232:  Support for EXCEPT and INTERSECT set operator
DRILL-8290:  Early exit from recursive file listing for LIMIT 0 queries

These are PRs that are in the final stages of review and I think we can get 
them over the finish line very quickly.  If there are other PRs that anyone 
would like included, (or removed) please let me know and we can add them to the 
list.

This will really be a milestone release.  Thanks to everyone who has 
contributed to this release.
Best,
-- C



[jira] [Created] (DRILL-8395) Add Support for INSERT and Drop Table to GoogleSheets Plugin

2023-02-05 Thread Charles Givre (Jira)
Charles Givre created DRILL-8395:


 Summary: Add Support for INSERT and Drop Table to GoogleSheets 
Plugin
 Key: DRILL-8395
 URL: https://issues.apache.org/jira/browse/DRILL-8395
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - GoogleSheets
Affects Versions: 1.20.3
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


This PR adds support for INSERT queries which allow a user to append data to an 
existing GoogleSheets tab.  It also:
 * Adds support for DROP TABLE queries which were not implemented
 * Modifies CTAS queries so that if a user executes a CTAS query with a file 
token, Drill will add a new tab to an existing document, but if the user 
executes a CTAS with a file name, it will create an entirely new document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8392) Empty Tables Causes Index Out of Bounds Exception on PDF Reader

2023-01-24 Thread Charles Givre (Jira)
Charles Givre created DRILL-8392:


 Summary: Empty Tables Causes Index Out of Bounds Exception on PDF 
Reader
 Key: DRILL-8392
 URL: https://issues.apache.org/jira/browse/DRILL-8392
 Project: Apache Drill
  Issue Type: Bug
  Components: Format - PDF
Affects Versions: 1.20.3
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8390) Minor Improvements to PDF Reader

2023-01-18 Thread Charles Givre (Jira)
Charles Givre created DRILL-8390:


 Summary: Minor Improvements to PDF Reader
 Key: DRILL-8390
 URL: https://issues.apache.org/jira/browse/DRILL-8390
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - PDF
Reporter: Charles Givre
Assignee: Charles Givre


This PR makes some minor improvements to the PDF reader including:
 * Fixes a minor bug where certain configurations the first row of data was 
skipped
 * Fixes a minor bug where empty tables were causing crashes with the 
spreadsheet extraction algorithm was used
 * Adds a table_count metadata field
 * Adds a table_index metadata field to reflect the current table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8387) Add Support for User Translation to ElasticSearch Plugin

2023-01-12 Thread Charles Givre (Jira)
Charles Givre created DRILL-8387:


 Summary: Add Support for User Translation to ElasticSearch Plugin
 Key: DRILL-8387
 URL: https://issues.apache.org/jira/browse/DRILL-8387
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - ElasticSearch
Affects Versions: 1.20.3
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


Add support for user translation to ElasticSearch. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8386) Add Support for User Translation for Cassandra

2023-01-11 Thread Charles Givre (Jira)
Charles Givre created DRILL-8386:


 Summary: Add Support for User Translation for Cassandra
 Key: DRILL-8386
 URL: https://issues.apache.org/jira/browse/DRILL-8386
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Cassandra
Affects Versions: 1.20.3
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


Adds support for user translation to the Cassandra plugin. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8384) Add Format Plugin for Microsoft Access

2023-01-10 Thread Charles Givre (Jira)
Charles Givre created DRILL-8384:


 Summary: Add Format Plugin for Microsoft Access
 Key: DRILL-8384
 URL: https://issues.apache.org/jira/browse/DRILL-8384
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - MS Access
Affects Versions: 1.21.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.0


Shockingly, MS Access is still in widespread use.  This plugin enables Drill to 
read MS Access files. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release Apache Drill 1.20.3 - RC0

2023-01-04 Thread Charles Givre
Hello all,
Just a reminder, to please try this out and record your vote here.  I think we 
need one more vote to publish the RC0.
Best,
-- C

> On Dec 28, 2022, at 1:21 AM, James Turton  wrote:
> 
> Herewith an attempt to unmangle the formatting of my original email :(.
> 
> Hi all,
> 
> I'd like to propose the first release candidate (RC0) of Apache Drill, 
> version 1.20.3. This is very likely to
> be the final update to the 1.20.x series given that 1.21 is expected in the 
> near future.
> 
> The release candidate covers a total of 30 resolved JIRAs [1]. Thanks to 
> everyone who contributed to this release
> and to Jingchuan Hu for some nontrivial backporting of fixes.
> 
> The tarball artifacts are hosted at [2] and the maven artifacts are hosted at 
> [3]. The unit test run based on the
> final merge into the 1.20 branch may be viewed here [4].
> 
> This release candidate is based on commit 
> f49d553d71db553fda2fc330f9a3f05d7dcdd17f located at [5].
> 
> Please download and try out the release.
> 
> [ ] +1
> [ ] +0
> [ ] -1
> 
> Here's my vote: +1
> 
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12352165
> [2] https://dist.apache.org/repos/dist/dev/drill/1.20.3-rc0
> [3] https://repository.apache.org/content/repositories/orgapachedrill-1103/
> [4] https://github.com/apache/drill/actions/runs/3780949700
> [5] https://github.com/jnturton/drill/releases/tag/drill-1.20.3
> 
> Season's greetings
> James



Towards Drill 1.21

2022-12-28 Thread Charles Givre
Hello all, 
I hope everyone is having a great holiday season.  I'd like to officially start 
the discussion around Drill 1.21 release.  Target date for a release candidate 
would be mid-January, but there's an external dependency on Calcite, so we'll 
see. 

There is some work pending which I think we should definitely include in the 
next release which are:

Upgrade to Calcite 1.33.  (Note: Calcite hasn't released this yet, but likely 
will around the beginning of the year and we do not anticipate major 
modifications will be necessary to Drill to support this)
DRILL-8376: Distribution UDFs
DRILL-8372 Unfreed Buffers in Text Reader
DRILL-8290: Shortcut recursive file listings
DRILL-4232: Support for Except and Intersect Set Operators
DRILL-8188: HDF5 to EVF2.  (Not sure if @luoc is still working on this, but the 
PR hasn't been touched in a while and this is one of the last EVF1 format 
plugins)
DRILL-8375: Incomplete support for non-projected complex vectors (Paul Rogers 
created a JIRA for this, but we don't have a PR yet, so I don't know if this 
will make it or not)

Are there other issues or PRs that anyone would like to see included in the 
next Drill release?  This is not a call to close the release, but just to start 
taking inventory on what we'd like to wrap up for the next release.

Thanks everyone and Happy New Year!
-- C




Re: [VOTE] Release Apache Drill 1.20.3 - RC0

2022-12-27 Thread Charles Givre
Built Drill from source.  Ran tests locally. No issues. +1 from me.
-- C



> On Dec 27, 2022, at 5:29 AM, James Turton  wrote:
> 
> |Hi all, I'd like to propose the first release candidate (RC0) of Apache 
> Drill, version 1.20.3. This is very likely to be the final update to the 
> 1.20.x series given that 1.21 is expected in the near future. The release 
> candidate covers a total of 30 resolved JIRAs [1]. Thanks to everyone who 
> contributed to this release and to Jingchuan Hu for some nontrivial 
> backporting of fixes. The tarball artifacts are hosted at [2] and the maven 
> artifacts are hosted at [3]. The unit test run based on the final merge into 
> the 1.20 branch may be viewed here [4]. This release candidate is based on 
> commit ||f49d553d71db553fda2fc330f9a3f05d7dcdd17f located at [5]. Please 
> download and try out the release. [ ] +1 [ ] +0 [ ] -1 Here's my vote: +1 [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12352165
>  [2] https://dist.apache.org/repos/dist/dev/drill/1.20.3-rc0 [3] 
> https://repository.apache.org/content/repositories/orgapachedrill-1103/ [4] 
> https://github.com/apache/drill/actions/runs/3780949700 [5] 
> https://github.com/jnturton/drill/releases/tag/drill-1.20.3 Season's 
> greetings James |



signature.asc
Description: Message signed with OpenPGP


[jira] [Created] (DRILL-8376) Add Distribution UDFs

2022-12-24 Thread Charles Givre (Jira)
Charles Givre created DRILL-8376:


 Summary: Add Distribution UDFs
 Key: DRILL-8376
 URL: https://issues.apache.org/jira/browse/DRILL-8376
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.21
Reporter: Charles Givre
Assignee: Charles Givre


Add `width_bucket`, `pearson_correlation` and `kendall_correlation` to Drill



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Towards Drill 2.0

2022-12-21 Thread Charles Givre
 James, Maksym, 
I suppose you are right.  We've done A LOT of work on Drill since 1.20 to 
include several new plugins (GoogleSheets, Delta Lake, Drill on Drill, 
others?).  All the user translation work, bug fixes, much more.  
In terms of pending work, Calcite is about to release 1.33 which will have some 
fixes that are relevant to Drill such as Athena support and a few others.  
Additionally, there are a few pending PRs such as the Splunk writer, a few from 
James, and some from Vova that are reasonably close to completion.  

Here's what I'd propose... why don't we shoot for a Drill 1.21 in January that 
incorporates all this new work?
What do you think?
-- C





> On Dec 20, 2022, at 4:19 AM, James Turton  wrote:
> 
> I tend to agree that a 1.21 before a 2.0 wouldn't do any harm considering 
> where we are and that little of the stuff we've got in the master branch 
> breaks things in more than a superficial way.
> 
> On 12/19/22 16:49, Максим Римар wrote:
>> Hi Charles,
>> 
>> Could you please tell me, you said that you wanted a few pieces to be
>> wrapped up in the next release, what are they?
>> 
>> And another one mind, does the Drill have some breaking changes to release
>> it as Drill 2.0?
>> Of course, Calcite update is a significant update, but it is only one of
>> many changes that we wanted to make for release 2.0 (
>> https://github.com/apache/drill/wiki/Drill-2.0-Proposal). Maybe it has
>> sense to release the current Drill as 1.21.0 and don't make release 2.0
>> until we will make structural changes (
>> https://github.com/apache/drill/wiki/Drill-2.0-Proposal#project-structure-packaging-and-distribution
>> )?
>> 
>> Regards,
>> Maksym
>> 
>> On 2022/12/13 14:51:56 Charles Givre wrote:
>>> Hello all,
>>> We've been doing a lot of really great work on Drill in the last few
>> months, and I wanted to thank those who have been involved with releasing
>> the bug fix releases throughout the year.  With all that said, there's been
>> enormous progress on Drill 2.0 and while it isn't quite what we had
>> originally discussed, IMHO, it's time to get this in people's hands.
>>> The work that Vova did updating Drill to Calcite 1.32 and getting us off
>> of the fork has been a HUGE improvement in performance and stability.
>> Likewise the casting work that James T. did has made Drill significantly
>> more usable.  From my perspective, there are a few pieces that I'd like to
>> see wrapped up but I wanted to start the dialogue to see if we might be
>> able to get Drill 2.0 released either by the end of the year or early
>> January.  What do you think?
>>> Best,
>>> -- C
> 



[jira] [Resolved] (DRILL-8198) XML EVF2 reader provideSchema usage

2022-12-19 Thread Charles Givre (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Givre resolved DRILL-8198.
--
Resolution: Fixed

> XML EVF2 reader provideSchema usage
> ---
>
> Key: DRILL-8198
> URL: https://issues.apache.org/jira/browse/DRILL-8198
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - XML
>Affects Versions: 1.20.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
> Fix For: 2.0.0
>
>
> XMLBatchReader is converted to EVF2 reader, but not used provideSchema for 
> Schema Provision feature



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8371) Add Write/Append Capability to Splunk Plugin

2022-12-14 Thread Charles Givre (Jira)
Charles Givre created DRILL-8371:


 Summary: Add Write/Append Capability to Splunk Plugin
 Key: DRILL-8371
 URL: https://issues.apache.org/jira/browse/DRILL-8371
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Splunk
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


While Drill can currently read from Splunk indexes, it cannot write to them or 
create them.  This proposed PR adds support for CTAS queries for Splunk as well 
as INSERT and DROP TABLE. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Towards Drill 2.0

2022-12-13 Thread Charles Givre
Hello all, 
We've been doing a lot of really great work on Drill in the last few months, 
and I wanted to thank those who have been involved with releasing the bug fix 
releases throughout the year.  With all that said, there's been enormous 
progress on Drill 2.0 and while it isn't quite what we had originally 
discussed, IMHO, it's time to get this in people's hands.  

The work that Vova did updating Drill to Calcite 1.32 and getting us off of the 
fork has been a HUGE improvement in performance and stability.  Likewise the 
casting work that James T. did has made Drill significantly more usable.  From 
my perspective, there are a few pieces that I'd like to see wrapped up but I 
wanted to start the dialogue to see if we might be able to get Drill 2.0 
released either by the end of the year or early January.  What do you think?
 
Best,
-- C

[jira] [Created] (DRILL-8365) HTTP Plugin Places Parameters in Wrong Place

2022-12-04 Thread Charles Givre (Jira)
Charles Givre created DRILL-8365:


 Summary: HTTP Plugin Places Parameters in Wrong Place
 Key: DRILL-8365
 URL: https://issues.apache.org/jira/browse/DRILL-8365
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - HTTP
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.20.3


When the requireTail option is set to true, and pagination is enabled, the HTTP 
plugin puts the required parameters in the wrong place in the URL.  This PR 
fixes that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8364) Add Support for OAuth Enabled File Systems

2022-12-03 Thread Charles Givre (Jira)
Charles Givre created DRILL-8364:


 Summary: Add Support for OAuth Enabled File Systems
 Key: DRILL-8364
 URL: https://issues.apache.org/jira/browse/DRILL-8364
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - File
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


Currently Drill supports reading from file systems such as HDFS, S3 and others 
that use token based authentication.  This PR extends Drill's plugin 
architecture so that Drill can connect with other file systems which use OAuth 
2.0 for authentication.

This PR also adds support for Drill to query Box. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8360) Add Provided Schema for XML Reader

2022-11-27 Thread Charles Givre (Jira)
Charles Givre created DRILL-8360:


 Summary: Add Provided Schema for XML Reader
 Key: DRILL-8360
 URL: https://issues.apache.org/jira/browse/DRILL-8360
 Project: Apache Drill
  Issue Type: Improvement
  Components: Format - XML
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


The XML reader does not support provisioned schema.  This PR adds that support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8356) Add File Name to GoogleSheets Plugin

2022-11-09 Thread Charles Givre (Jira)
Charles Givre created DRILL-8356:


 Summary: Add File Name to GoogleSheets Plugin
 Key: DRILL-8356
 URL: https://issues.apache.org/jira/browse/DRILL-8356
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - GoogleSheets
Affects Versions: 2.0.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


GoogleSheets uses tokens to identify the individual files.  These tokens are 
not human readable and will make it difficult for a user to know which file 
they are accessing.  

This PR adds a metadata field called `_title` which identifies the document 
they are working with.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8354) Add IS_EMPTY Function.

2022-11-08 Thread Charles Givre (Jira)
Charles Givre created DRILL-8354:


 Summary: Add IS_EMPTY Function.
 Key: DRILL-8354
 URL: https://issues.apache.org/jira/browse/DRILL-8354
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


When analyzing data, there is currently no single function to evaluate whether 
a given field is empty.  With scalar fields, this can be accomplished with the 
`IS NOT NULL` operator, but with complex fields, this is more challenging as 
complex fields are never null. 

This PR adds a UDF called IS_EMPTY() which accepts any type of field and 
returns true if the field does not contain data.  

 

In the case of scalar fields, if the field is `null` this returns true.  In the 
case of complex fields, which can never be `null`, in the case of lists, the 
function returns true if the list is empty.  In the case of maps, it returns 
true if all of the map's fields are unpopulated. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8350) Convert PCAP Format Plugin to EVF2

2022-11-02 Thread Charles Givre (Jira)
Charles Givre created DRILL-8350:


 Summary: Convert PCAP Format Plugin to EVF2
 Key: DRILL-8350
 URL: https://issues.apache.org/jira/browse/DRILL-8350
 Project: Apache Drill
  Issue Type: Task
  Components: Format - PCAP
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


Convert the PCAP format plugin to EVF2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8349) GoogleSheets Not Registering Schemas with Non Default Name

2022-11-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8349:


 Summary: GoogleSheets Not Registering Schemas with Non Default Name
 Key: DRILL-8349
 URL: https://issues.apache.org/jira/browse/DRILL-8349
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - GoogleSheets
Affects Versions: 2.0.0
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


GoogleSheets plugin fails to register plugin instances with names other than 
`GoogleSheets`. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8342) Add Automatic Retry for Rate Limited APIs

2022-10-22 Thread Charles Givre (Jira)
Charles Givre created DRILL-8342:


 Summary: Add Automatic Retry for Rate Limited APIs
 Key: DRILL-8342
 URL: https://issues.apache.org/jira/browse/DRILL-8342
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - HTTP
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


Many APIs have a burst limit for number of requests.  This PR adds a retry 
capability to the HTTP Storage Plugin, whereby if a 429 response code is 
received, Drill will wait a configurable amount of time, and retry the request 
once. 

To prevent runaway pagination, this retry will only happen once per request. 

This PR adds a new configuration option called retryDelay which is the number 
of milliseconds that Drill should wait between retrys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8341) Add Scanned Plugin List to Sys Profiles Table

2022-10-21 Thread Charles Givre (Jira)
Charles Givre created DRILL-8341:


 Summary: Add Scanned Plugin List to Sys Profiles Table
 Key: DRILL-8341
 URL: https://issues.apache.org/jira/browse/DRILL-8341
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Monitoring
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


In DRILL-8322, [~dzamo] added the list of scanned plugins to the query 
profiles.  This information is extremely useful in query analysis.  This minor 
PR adds this same information to the sys.profiles table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8340) Add Additional Date Manipulation Functions (Part 1)

2022-10-20 Thread Charles Givre (Jira)
Charles Givre created DRILL-8340:


 Summary: Add Additional Date Manipulation Functions (Part 1)
 Key: DRILL-8340
 URL: https://issues.apache.org/jira/browse/DRILL-8340
 Project: Apache Drill
  Issue Type: Improvement
  Components: Functions - Drill
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


This PR adds several utility functions to facilitate working with dates and 
times.  These are modeled after the date/time functionality in MySQL.

Specifically this adds:
 * YEARWEEK():  Returns an int of year week. IE (202002)
 * TIME_STAMP():  Converts most anything that looks like a date 
string into a timestamp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8335) Add Ability to Query GoogleSheets Tabs by Index

2022-10-14 Thread Charles Givre (Jira)
Charles Givre created DRILL-8335:


 Summary: Add Ability to Query GoogleSheets Tabs by Index
 Key: DRILL-8335
 URL: https://issues.apache.org/jira/browse/DRILL-8335
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - GoogleSheets
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 2.0.0


The GoogleSheets plugin does not provide a way for a user to query data if they 
do not know the available tab names.  This adds the ability to query by index 
of the tabs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8333) Fix Resource Leaks in HTTP Plugin

2022-10-13 Thread Charles Givre (Jira)
Charles Givre created DRILL-8333:


 Summary: Fix Resource Leaks in HTTP Plugin
 Key: DRILL-8333
 URL: https://issues.apache.org/jira/browse/DRILL-8333
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - HTTP
Affects Versions: 1.20.2
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.20.3


The HTTP plugin has several methods which collect a `ResponseBody` object but 
do not close these objects.  This is causing a resource leak and will cause 
Drill to fail in the event that queries fire off many API calls. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   >