[jira] [Created] (DRILL-8473) Update and incorporate Agirish/drill-helm-charts in Drill

2024-01-01 Thread James Turton (Jira)
James Turton created DRILL-8473:
---

 Summary: Update and incorporate Agirish/drill-helm-charts in Drill
 Key: DRILL-8473
 URL: https://issues.apache.org/jira/browse/DRILL-8473
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.21.1
Reporter: James Turton
Assignee: James Turton
 Fix For: 1.22.0


Helm charts for deploying Drill on Kubernetes were developed by [~agirish] and 
[released under the Apache 
License|https://github.com/Agirish/drill-helm-charts]. These charts can be 
updated to make use of the container images that are automatically published to 
Docker Hub and incorporated in the Drill codebase where they can be maintained 
by Drill contributors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] BACKPORT-TO-STABLE: Backported fixes for Drill 1.21.2 (drill)

2024-01-01 Thread via GitHub


jnturton commented on PR #2860:
URL: https://github.com/apache/drill/pull/2860#issuecomment-1873633340

   > @jnturton Did we add #2795 to this?
   
   @cgivre yes, it's here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8470: Bump MongoDB Driver to Latest Version (drill)

2024-01-01 Thread via GitHub


jnturton commented on code in PR #2862:
URL: https://github.com/apache/drill/pull/2862#discussion_r1439154883


##
contrib/format-image/pom.xml:
##
@@ -39,7 +39,7 @@
 
   com.drewnoakes
   metadata-extractor
-  2.18.0
+  2.19.0

Review Comment:
   I guess there's a bug that got fixed in this library and that's now breaking 
a unit test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-2835: Daffodil Feature for Drill (drill)

2024-01-01 Thread via GitHub


cgivre commented on code in PR #2836:
URL: https://github.com/apache/drill/pull/2836#discussion_r1439055155


##
contrib/format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.daffodil;
+
+import java.io.InputStream;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.util.Objects;
+
+import org.apache.daffodil.japi.DataProcessor;
+import org.apache.drill.common.AutoCloseables;
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.physical.impl.scan.v3.ManagedReader;
+import org.apache.drill.exec.physical.impl.scan.v3.file.FileDescrip;
+import org.apache.drill.exec.physical.impl.scan.v3.file.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import 
org.apache.drill.exec.store.daffodil.schema.DaffodilDataProcessorFactory;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static 
org.apache.drill.exec.store.daffodil.schema.DrillDaffodilSchemaUtils.daffodilDataProcessorToDrillSchema;
+
+
+public class DaffodilBatchReader implements ManagedReader {
+
+  private static final Logger logger = 
LoggerFactory.getLogger(DaffodilBatchReader.class);
+  private final DaffodilFormatConfig dafConfig;
+  private final RowSetLoader rowSetLoader;
+  private final CustomErrorContext errorContext;
+  private final DaffodilMessageParser dafParser;
+  private final InputStream dataInputStream;
+
+  static class DaffodilReaderConfig {
+final DaffodilFormatPlugin plugin;
+DaffodilReaderConfig(DaffodilFormatPlugin plugin) {
+  this.plugin = plugin;
+}
+  }
+
+  public DaffodilBatchReader (DaffodilReaderConfig readerConfig, EasySubScan 
scan, FileSchemaNegotiator negotiator) {
+
+errorContext = negotiator.parentErrorContext();
+this.dafConfig = readerConfig.plugin.getConfig();
+
+String schemaURIString = dafConfig.getSchemaURI(); // 
"schema/complexArray1.dfdl.xsd";
+String rootName = dafConfig.getRootName();
+String rootNamespace = dafConfig.getRootNamespace();
+boolean validationMode = dafConfig.getValidationMode();
+
+URI dfdlSchemaURI;
+try {
+  dfdlSchemaURI = new URI(schemaURIString);
+} catch (URISyntaxException e) {
+  throw UserException.validationError(e)
+  .build(logger);
+}
+
+FileDescrip file = negotiator.file();
+DrillFileSystem fs = file.fileSystem();
+URI fsSchemaURI = fs.getUri().resolve(dfdlSchemaURI);
+
+
+DaffodilDataProcessorFactory dpf = new DaffodilDataProcessorFactory();
+DataProcessor dp;
+try {
+  dp = dpf.getDataProcessor(fsSchemaURI, validationMode, rootName, 
rootNamespace);
+} catch (Exception e) {
+  throw UserException.dataReadError(e)
+  .message(String.format("Failed to get Daffodil DFDL processor for: 
%s", fsSchemaURI))
+  .addContext(errorContext).addContext(e.getMessage()).build(logger);
+}
+// Create the corresponding Drill schema.
+// Note: this could be a very large schema. Think of a large complex RDBMS 
schema,
+// all of it, hundreds of tables, but all part of the same metadata tree.
+TupleMetadata drillSchema = daffodilDataProcessorToDrillSchema(dp);
+// Inform Drill about the schema
+negotiator.tableSchema(drillSchema, true);
+
+//
+// DATA TIME: Next we construct the runtime objects, and open files.
+//
+// We get the DaffodilMessageParser, which is a stateful driver for 
daffodil that
+// actually does the parsing.
+rowSetLoader = negotiator.build().writer();
+
+// We construct the Daffodil InfosetOutputter which the daffodil parser 
uses to
+// convert infoset event calls to fill in a Drill row via a rowSetLoader.
+DaffodilDrillInfosetOutputter outputter = new 
DaffodilDrillInfosetOutputter(rowSetLoader);

Re: Next Version

2024-01-01 Thread Paul Rogers
Hi All,

My two cents on Charles' other points: about Drill's use with Mongo or
Druid. If this is common, we might want to put more effort into the
integrations above the level of the reader. I'm most familiar with Druid,
so let's use that as an example.

Druid provides a SQL interface, so it is convenient to forward Drill
queries to Druid as SQL. But, Druid has a very limited distribution
architecture: it is two-level: the coordinator and the data nodes. This
means we've got, say, 10 Drill nodes, that pick one Drill node to be the
reader that talks to the one Druid coordinator, that then talks to, say, 20
data nodes. This is clearly a bottleneck, and will never perform anywhere
near what Druid's native UI can do.

So, a better approach is to bypass Druid SQL and use Druid native queries.
Bypass the coordinator and talk directly to the data nodes. Now, we have
our 10 Drill nodes each talking to two Druid data nodes, providing a
parallelism far better than Druid itself provides. Drill's distributed
sort, join and windowing functionality is far more scalable than Druid's
only single-node functionality.

Druid is optimized for small, simple queries that power dashboards. Druid
frowns on "BI" use cases that touch large chunks of data. In Druid, the
coordinator is the bottleneck: BI queries against the coordinator kill
dashboard SLAs. With the above setup, Drill would provide a wonderful,
scalable BI solution for Druid that does not degrade the system because
Drill would no longer put load on Druid's weak link: the coordinator node.

Mongo is also distributed. Does it have the same potential to use Drill to
distribute work to avoid a similar bottleneck?

To give MapR some credit, MapR-DB had a client that allowed distributed
queries. The Drill integration with MapR-DB was supposed to use an approach
similar to the one outlined above for Druid.

Alas, the above trick won't work for a traditional DBMS using JDBC.
However, if the DB is sharded, then, with the right metadata, Drill could
distribute queries to the shards so the DB's own query system doesn't have
to.

So there you have it, a fun weekend project for someone familiar with the
details of a particular distributed DB.

Thanks,

- Paul


On Mon, Jan 1, 2024 at 7:17 AM Charles Givre  wrote:

> To continue the thread hijacking
>
> I'd agree with what James is saying.  What if we were to create a docker
> container (or some sort of package) that included Drill, Superset and all
> associated configuration stuff so that a user could just run a docker
> command and have a fully functional Drill instance set up with Superset?
>
> Regarding the JSON, for a while we were working on updating all the
> plugins to use EVF2.  From my recollection, we got all the formats
> converted except for parquet (major project) and HDF5 (PR pending:
> https://github.com/apache/drill/pull/2515).  We had also started working
> on removing the old JSON reader, however, there were a few places it reared
> its head:
> 1.  The Druid plugin.  I wrote a draft PR that is pending to swap it out
> for the EVF JSON reader but haven't touched it in a really long time. (
> https://github.com/apache/drill/pull/2657)
> 2.  The Mongo plugin:  No work there...
> 3.  The conversion UDFs.   Work started.  (
> https://github.com/apache/drill/pull/2567)
>
> In any event, given the interest in Mongo/Drill, it might be worthwhile to
> take a look at the Mongo plugin to see what it would take to swap out the
> old JSON reader for the EVF one.
> Regarding unprojected columns, if that's the holdup, I'd say scrap that
> feature for complex data types.
>
> What do you think?
>
>
> > On Jan 1, 2024, at 07:57, James Turton  wrote:
> >
> > P.P.S. since I'm spamming this thread today. With
> >
> > > this suggests to me that we should keep putting effort into: embedded
> Drill, Windows support, rapid installation and setup, low "time to insight".
> >
> > I'm not going so far as to suggest that Drill be thought of as desktop
> software, rather that ad hoc Drill deployments working on small (Gb) to big
> (Tb) data may be as, or more, important than long lived, heavily
> integrated, professionally managed deployments working on really Big data
> (Pb). Perhaps the last category belongs almost entirely to BigQuery,
> Athena, Snowflake and the like nowadays anyway.
> >
> > I still think a cluster is the often the most effective way to deploy
> Drill so the question contemplated is really "Can we make it faster and
> easier to spin up a cluster (and embedded Drill), connect to data sources
> and start running (successful) queries"?
> >
> > On 2024/01/01 07:33, James Turton wrote:
> >> P.S. I also have an admittedly vague idea about deprecating the UNION
> data type, which still breaks things in many operators, in favour of a
> different approach where we kick any invalid data encountered while loading
> column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though
> binary data formats tend not to be 

Re: Next Version

2024-01-01 Thread Paul Rogers
Hi All,

Thanks for the insight into current Drill usage.

Just to clarify one point, the discussion around UNION, LIST and REPEATED
LIST was specifically for handling non-projected columns in EVF. Some
history. Originally, the readers would either a) read all columns into
vectors, then rely on a PROJECT operator to remove them, or b) use
super-complex, reader-specific code to do the projection at read time (e.g.
Parquet). To provide better performance, and reduce direct memory
fragmentation, EVF handles the projection at read time. Readers read all
columns and provide data to EVF, EVF ignores the unprojected columns. This
is super easy for scalar columns. Moderately complex for arrays, maps and
maps of arrays. It got quite messy for UNION, LIST and REPEATED LIST due to
the complex, type-based, ever-changing nature of those structures.

The thing that makes UNION, LIST and REPEATED LIST "special" is that they
all are based on a dynamic type system: data can change types, either per
row (UNION), a list of varying-type values (LIST), or lists of lists of
varying-type values (REPEATED LIST). Following Visual Basic, I call these
"variants": we have variant (UNION), variant arrays (LIST), and 2D variant
arrays (REPEATED LIST). Oddly, there is no "REPEATED UNION": one uses LIST
instead -- the only type in Drill with this distinction. We end up with
REPEATED LIST only because we have REPEATED everything (except UNION), even
though LIST itself is already a "REPEATED UNION." Yes, it is a bit of a
mess and has been so from the start.

The variant types, of course, were meant to allow reading unstructured JSON
or XML: Drill could query a web page, say. But, of course, SQL is an awful
language to work with such data. As a result, Drill's variant functionality
has always been a muddle: it mostly works at the vector level (after
numerous fixes), but SQL is not up to the task of wrangling the resulting
data. To be fair, the Druid folks confidently ran into the same mess when
they started evolving Druid to handle unstructured JSON data using SQL: it
*should* work in theory, but doesn't actually work in practice. They ran
into the same problems we've discussed: this suggests that the problems are
real, not just the result of lack of knowledge by folks who built the thing.

So, a proposal is to handle projection at read time for all but the variant
cases. EVF always writes variant vectors, projected or not. A PROJECT
operator removes unwanted columns later in the DAG, just like in the
pre-EVF days. The result is not performant, but who cares if few people use
these features.

For the remaining readers not yet using EVF, I might suggest that if it
ain't broke, don't fix it. The key value that EVF adds for existing readers
is efficient handling of non-projected columns. However, for something like
the JSON, Druid or Mongo readers, the DB itself will return only the
projected columns, so the reader should project all returned columns.
(Contrast this with a JSON file: we are forced to read all the columns,
even if we only want 5 out of 100. Dumping the other 95 a read time is a
big win.)

Existing readers, such as Parquet, already contain one-off versions of the
complex code needed to write to vectors and project columns. There is no
win for the users by taking on the daunting challenge of rewriting the
Parquet reader to use EVF. For NEW readers, of course, EVF provides all
this functionality ready-made, so new readers should be written to use EVF.

The key purpose of Parquet back in the halcyon days of Hadoop was as a
compact, efficient data file format for distributed processing, especially
for applications with thousands or millions of files. Impala, for example,
targeted this exact use case. Does Parquet still have this role on large,
cloud-based systems? If so, use Athena or Presto, along with Glue for
metadata. For the desktop, use Drill without metadata. Still, is
Parquet-at-scale a common format for desktop usage? If not, then the
existing reader is probably good enough.

Data "lakehouses" seem to be the new thing: Delta Lake with Spark and the
like. Drill would need some significant work to play well in that context
since it was designed for stand-alone usage and Hadoop integration. As
Hadoop continues to fade, and new architectures emerge, perhaps Drill will
avoid lakehouse integration work by focusing on the
desktop/small-deployment use case.

To summarize, for the use cases where Drill still finds users, we can punt
on the variant types (UNION, LIST, REPEATED LIST): EVF can read them, but
will always project them. A PROJECT operator can remove them if they are
not projected. This approach has a performance hit, but that won't matter
if no one actually uses these particular types. If we go this route, then
the fixes I did are probably good enough: I just need to dust off the code
and create a PR.

Does this make sense?

Thanks,

- Paul


On Mon, Jan 1, 2024 at 7:17 AM Charles Givre  wrote:

> To continue the thread 

[jira] [Created] (DRILL-8472) Bump Image Metadata Library to Latest Version

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8472:


 Summary: Bump Image Metadata Library to Latest Version
 Key: DRILL-8472
 URL: https://issues.apache.org/jira/browse/DRILL-8472
 Project: Apache Drill
  Issue Type: Task
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


Bump Metadata Extractor dependency to latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8471) Bump DeltaLake Driver to Version 3.0.0

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8471:


 Summary: Bump DeltaLake Driver to Version 3.0.0
 Key: DRILL-8471
 URL: https://issues.apache.org/jira/browse/DRILL-8471
 Project: Apache Drill
  Issue Type: Task
  Components: Format - DeltaLake
Reporter: Charles Givre


Bump DeltaLake Driver to Version 3.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Next Version

2024-01-01 Thread Charles Givre
To continue the thread hijacking

I'd agree with what James is saying.  What if we were to create a docker 
container (or some sort of package) that included Drill, Superset and all 
associated configuration stuff so that a user could just run a docker command 
and have a fully functional Drill instance set up with Superset?

Regarding the JSON, for a while we were working on updating all the plugins to 
use EVF2.  From my recollection, we got all the formats converted except for 
parquet (major project) and HDF5 (PR pending: 
https://github.com/apache/drill/pull/2515).  We had also started working on 
removing the old JSON reader, however, there were a few places it reared its 
head:
1.  The Druid plugin.  I wrote a draft PR that is pending to swap it out for 
the EVF JSON reader but haven't touched it in a really long time. 
(https://github.com/apache/drill/pull/2657)
2.  The Mongo plugin:  No work there... 
3.  The conversion UDFs.   Work started.  
(https://github.com/apache/drill/pull/2567)

In any event, given the interest in Mongo/Drill, it might be worthwhile to take 
a look at the Mongo plugin to see what it would take to swap out the old JSON 
reader for the EVF one. 
Regarding unprojected columns, if that's the holdup, I'd say scrap that feature 
for complex data types. 

What do you think?


> On Jan 1, 2024, at 07:57, James Turton  wrote:
> 
> P.P.S. since I'm spamming this thread today. With
> 
> > this suggests to me that we should keep putting effort into: embedded 
> > Drill, Windows support, rapid installation and setup, low "time to insight".
> 
> I'm not going so far as to suggest that Drill be thought of as desktop 
> software, rather that ad hoc Drill deployments working on small (Gb) to big 
> (Tb) data may be as, or more, important than long lived, heavily integrated, 
> professionally managed deployments working on really Big data (Pb). Perhaps 
> the last category belongs almost entirely to BigQuery, Athena, Snowflake and 
> the like nowadays anyway.
> 
> I still think a cluster is the often the most effective way to deploy Drill 
> so the question contemplated is really "Can we make it faster and easier to 
> spin up a cluster (and embedded Drill), connect to data sources and start 
> running (successful) queries"?
> 
> On 2024/01/01 07:33, James Turton wrote:
>> P.S. I also have an admittedly vague idea about deprecating the UNION data 
>> type, which still breaks things in many operators, in favour of a different 
>> approach where we kick any invalid data encountered while loading column FOO 
>> out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though binary data 
>> formats tend not to be malformed?) column. This would let a query over dirty 
>> data complete without invisible data swallowing, and would mean we could cut 
>> further effort on UNION support.
>> 
>> On 2024/01/01 07:11, James Turton wrote:
>>> Happy New Year!
>>> 
>>> Here's another two cents. Make that five now that I scan this email again!
>>> 
>>> Excluding our Docker Hub images (which are popular), Drill is downloaded 
>>> ~1000 times a month [1] (order of magnitude, it's hard to count genuinely 
>>> new installations from web server downloads).
>>> 
>>> What roles are these folks in? I'm a data engineer by day and I don't think 
>>> that we count for a large share of those downloads. The DEs I work with are 
>>> risk averse sorts that tend to favour setups with rigid schemas early on 
>>> and no surprises for their users at query time. Add to that a second stat 
>>> from the download data: the biggest single download user OS is Windows, at 
>>> about 50% [1]. Some of these users may go on to copy that download to a 
>>> server environment but I have a theory that many of them go on to run 
>>> embedded Drill right there on beefy Windows laptops.
>>> 
>>> I conjecture that most of the people reaching for Drill are analysts or 
>>> developers working _away_ from an established, shared data infrastructure. 
>>> There may not be any shared data engineering where they are, or they may 
>>> find themselves in a fashionable "Data Mesh" environment [2]. I'm probably 
>>> abusing Data Mesh a bit here in that I'm told that it mainly proposes a 
>>> federation of distinct data _teams_, rather than of data _systems_ but, if 
>>> you entertain my cynical formulation of "Data Mesh guys! Silos aren't 
>>> uncool any more!" just a bit, then you can well imagine why a user in a 
>>> Data Mesh might look for something like Drill to combine data from 
>>> different silos on their own machine. Tangentially this suggests to me that 
>>> we should keep putting effort into: embedded Drill, Windows support, rapid 
>>> installation and setup, low "time to insight".
>>> 
>>> MongoDB questions still come up frequently giving a reason beyond the JSON 
>>> files questions to think that the JSON data model is still very important. 
>>> Wherever we decide to bound the current EVF v2 data model implementation, 
>>> maybe we 

Re: [PR] BACKPORT-TO-STABLE: Backported fixes for Drill 1.21.2 (drill)

2024-01-01 Thread via GitHub


cgivre commented on PR #2860:
URL: https://github.com/apache/drill/pull/2860#issuecomment-1873350619

   @jnturton 
   Did we add https://github.com/apache/drill/pull/2795 to this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] DRILL-8470: Bump MongoDB Driver to Latest Version (drill)

2024-01-01 Thread via GitHub


cgivre opened a new pull request, #2862:
URL: https://github.com/apache/drill/pull/2862

   # [DRILL-8470](https://issues.apache.org/jira/browse/DRILL-): Bump 
MongoDB Driver to Latest Version
   
   ## Description
   Update the mongoDB Java driver to the latest version.
   
   ## Documentation
   N/A
   
   ## Testing
   Ran unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (DRILL-8470) Bump MongoDB Driver to Latest Version

2024-01-01 Thread Charles Givre (Jira)
Charles Givre created DRILL-8470:


 Summary: Bump MongoDB Driver to Latest Version
 Key: DRILL-8470
 URL: https://issues.apache.org/jira/browse/DRILL-8470
 Project: Apache Drill
  Issue Type: Task
  Components: Storage - MongoDB
Affects Versions: 1.21.1
Reporter: Charles Givre
Assignee: Charles Givre
 Fix For: 1.21.2


Bump mongoDB driver to latest version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Next Version

2024-01-01 Thread James Turton

P.P.S. since I'm spamming this thread today. With

> this suggests to me that we should keep putting effort into: embedded 
Drill, Windows support, rapid installation and setup, low "time to insight".


I'm not going so far as to suggest that Drill be thought of as desktop 
software, rather that ad hoc Drill deployments working on small (Gb) to 
big (Tb) data may be as, or more, important than long lived, heavily 
integrated, professionally managed deployments working on really Big 
data (Pb). Perhaps the last category belongs almost entirely to 
BigQuery, Athena, Snowflake and the like nowadays anyway.


I still think a cluster is the often the most effective way to deploy 
Drill so the question contemplated is really "Can we make it faster and 
easier to spin up a cluster (and embedded Drill), connect to data 
sources and start running (successful) queries"?


On 2024/01/01 07:33, James Turton wrote:
P.S. I also have an admittedly vague idea about deprecating the UNION 
data type, which still breaks things in many operators, in favour of a 
different approach where we kick any invalid data encountered while 
loading column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or 
VARBINARY, though binary data formats tend not to be malformed?) 
column. This would let a query over dirty data complete without 
invisible data swallowing, and would mean we could cut further effort 
on UNION support.


On 2024/01/01 07:11, James Turton wrote:

Happy New Year!

Here's another two cents. Make that five now that I scan this email 
again!


Excluding our Docker Hub images (which are popular), Drill is 
downloaded ~1000 times a month [1] (order of magnitude, it's hard to 
count genuinely new installations from web server downloads).


What roles are these folks in? I'm a data engineer by day and I don't 
think that we count for a large share of those downloads. The DEs I 
work with are risk averse sorts that tend to favour setups with rigid 
schemas early on and no surprises for their users at query time. Add 
to that a second stat from the download data: the biggest single 
download user OS is Windows, at about 50% [1]. Some of these users 
may go on to copy that download to a server environment but I have a 
theory that many of them go on to run embedded Drill right there on 
beefy Windows laptops.


I conjecture that most of the people reaching for Drill are analysts 
or developers working _away_ from an established, shared data 
infrastructure. There may not be any shared data engineering where 
they are, or they may find themselves in a fashionable "Data Mesh" 
environment [2]. I'm probably abusing Data Mesh a bit here in that 
I'm told that it mainly proposes a federation of distinct data 
_teams_, rather than of data _systems_ but, if you entertain my 
cynical formulation of "Data Mesh guys! Silos aren't uncool any 
more!" just a bit, then you can well imagine why a user in a Data 
Mesh might look for something like Drill to combine data from 
different silos on their own machine. Tangentially this suggests to 
me that we should keep putting effort into: embedded Drill, Windows 
support, rapid installation and setup, low "time to insight".


MongoDB questions still come up frequently giving a reason beyond the 
JSON files questions to think that the JSON data model is still very 
important. Wherever we decide to bound the current EVF v2 data model 
implementation, maybe we can sketch out a design of whatever is 
unimplemented in some updates to the Drill wiki pages? This would 
give other devs a head start if we decide that some unsupported 
complex data type is worth implementing down the road?


1. https://infra-reports.apache.org/#downloads=drill
2. https://martinfowler.com/articles/data-mesh-principles.html

Regards
James

On 2024/01/01 03:16, Charles Givre wrote:
I'll throw my .02 here...  As a user of Drill, I've only had the 
occasion to use the Union once. However, when I used it, it consumed 
so much memory, we ended up finding a workaround anyway and stopped 
using it. Honestly, since we improved the implicit casting rules, I 
think Drill is a lot smarter about how it reads data anyway. Bottom 
line, I do think we could drop the union and repeated union.


The repeated lists and maps however are unfortunately something that 
does come up a bit.   Honestly, I'm not sure what work is remaining 
here but TBH Drill works pretty well at the moment with most of the 
data I'm using it for.  This would include some really nasty nested 
JSON objects.


-- C



On Dec 31, 2023, at 01:38, Paul Rogers  wrote:

Hi Luoc,

Thanks for reminding me about the EVF V2 work. I got mostly done 
adding
projection for complex types, then got busy on other projects. I've 
yet to
tackle the hard cases: unions, repeated unions and repeated lists 
(which

are, in fact, repeated repeated unions).

The code to handle unprojected fields in these areas is getting 
awfully
complicated. In doing that work, and then seeing a trick that