from:"Paul Rogers"

Re: Does drill support variable arguments in customer UDAF

2020-10-28 Thread Paul Rogers

Variable arg UDAF might be a bit harder than var arg UDFs as support is
needed in the agg operators.

- Paul

On Wed, Oct 28, 2020 at 5:21 AM Charles Givre  wrote:

> Hi there,
> Drill does support VARARG UDFs.  Take a look at this PR for an example:
> https://github.com/apache/drill/pull/1835 <
> https://github.com/apache/drill/pull/1835>
> -- C
>
> > On Oct 28, 2020, at 4:22 AM, wingc.s...@qq.com 
> wrote:
> >
> > I have seen themailing list history.The same problem is metioned
> in 2015 and 2017, I wonder if drill support pass variable arguments in
> customer udaf such as my_sum(1, 2, 3, ..) .I'm looking forward
> your replay, thanks.
>
>

Re: Thought about schemaless sources (mongodb/json)

2020-10-26 Thread Paul Rogers

Hi Dobes,

The good news is that, with the "new" JSON reader we added recently, it
would be pretty easy to read each row as text.

Most of what I said earlier still applies, however. Drill has a function to
parse JSON into fields. It suffers from the same limitations as the JSON
readers (becaues the code is identical.)

You could write code to pull out each column as is done in DBs. That way,
Drill knows the set of columns because they appear explicitly in the SELECT
clause. All good. This, however, introduces an extreme form of the CAST
issue: users must know not only how to CAST, but also how to parse JSON.

One drawback of the field-by-field way of extracting JSON is that doing so
is very inefficient: the JSON must be parsed for each column extraction.
Perhaps a clever optimizer could combine all such expressions into a single
combined operation that parses the JSON once. Of course, if we do that,
we're kind of back to what the JSON reader always does.

Further, if the JSON fields are inconsistent, you need to add conditional
code to handle cases such as the Mongo structured constants, or the case
where a column is both a text "10" and a number 20. This can be done, it
just makes each SQL statement more complex, which may or may not be a
problem. Again, a view could hide this complexity.

Still, simple enough to give the idea a try. 1) make that JSON reader
change using the new JSON reader. 2) Add a "parse JSON field" function as a
UDF. Java libraries exist to do the heavy lifting. You could then try out
the idea to see how it works.

Thanks,

- Paul

On Mon, Oct 26, 2020 at 11:57 AM Dobes Vandermeer  wrote:

> Hi Paul,
>
> I think you misunderstood my proposal.  I was suggesting that each whole
> row would be provided as a single column of TEXT.
>
> So if your JSON is like:
>
> {"foo":1}
> {"foo":2, "bar": 3}
>
> The schema would initially be just a single column, maybe row TEXT NOT
> NULL.
>
> Then when you operate on that data you would use projection functions,
> (e.g. similar to the ones in PostgreSQL
> https://www.postgresql.org/docs/9.5/functions-json.html) like:
>
> SELECT SUM(JSON_EXTRACT_PATH(row, '$.foo') AS INTEGER) FROM ...
> SELECT SUM(CASE WHEN JSON_TYPEOF(row, '$.bar') = 'number') THEN
> JSON_EXTRACT_PATH(row, '$.bar') AS INTEGER ELSE JSON_EXTRACT_PATH('row',
> '$.foo') AS INTEGER END FROM ...
>
>
> A similar approach could be taken for CSV data although you would want the
> column headings to be available, which makes it a bit more complicated but
> I think a system could be designed for that.
>
> I don't think I would propose this as a the default way of handling
> JSON/CSV necessarily. For people lucky enough to have data that adheres to
> a strict universal schema with no missing fields / nulls the current
> approach is more convenient since you can use a better syntax.
>
> For people with less perfect data this approach allows them to "clean up"
> data using a view or a CREATE TABLE AS as parse the JSON / MongoDB data
> using custom logic to produce a parquet file with a fully specified schema.
>
> On 10/26/2020 10:54:00 AM, Paul Rogers  wrote:
> Hi Dobes,
>
> Thanks for the idea: a text-only approach would certainly solve one class
> of problems. As it turns out, CSV does read all columns as text.
>
> The solution does not solve two other problems which have popped up from
> time to time. First, if file A contains columns a and b, and file B
> contains columns a, b, and c, then the CSV reader for A does not know to
> read column c. Some other operator in Drill will fill in the column, and
> will choose Nullable INT as the type. That could, however, be changed to
> choose Nullable Varchar.
>
> Second, if a batch of A columns is given to the client before Drill sees
> the first file B rows, then the client will see a schema of (a, b) followed
> by a schema of (a, b, c.) JDBC and ODBC clients can't handle this. The REST
> API handles this, but at the cost of buffering all rows in memory before
> sending, which causes its own issues, as someone recently noted.
>
> JSON does have an all-text mode which suffers from the same missing-column
> issue. Since Mongo uses JDBC, it should also support all-text mode.
>
> Mongo is a bit tricky because it allows an object syntax to specify
> scalars: {type: "int", value: 10} (IIRC). Drill does not read such values
> as strings, even in all-text mode. The result can be a type conversion
> error if some values use the extended from, others use the simple form. So,
> Mongo would need special handling. There was a start at fixing this before
> our committer reviewers dried up. We could dust off the work and finish it
> up.
>
> The all-text approach avoids type conversion

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

2020-10-26 Thread Paul Rogers

Hi Gareth,

Glad you found a solution! The fix for the REST API is to convert it from a
buffered form to streaming. As noted, the current version buffers all rows
in memory, then formats the JSON response. The fix is to stream the JSON
out as the REST API receives batches from the rest of Drill. Doing so
avoids the memory pressure problem.

As you note, it is a partial fix. The REST API appends the schema to the
JSON message after all the data. If your code needs that schema, then your
code must buffer the full result set to wait for that schema. We decided
not to change this behavior to avoid breaking existing code. We could,
however, create a new API version that sends the schema before the data.

Sending the schema first has the problem I outlined in my response to
Dobes: Drill will change the schema as the query runs: the first schema may
not contain all the columns that will occur later in the query. (JDBC/ODBC
have this same issue, by the way.)

With Drill, there is no free lunch. It's all a matter of trade-offs.

Thanks,

- Paul

On Mon, Oct 26, 2020 at 2:10 AM Gareth Western 
wrote:

> Hi Paul,
>
> What is the "partial fix" related to in the REST API? The API has worked
> fine for our needs, except in the case I mentioned where we would like to
> select 12 million records all at once. I don't think this type of query
> will ever work with the REST API until the API supports a streaming
> protocol (e.g. gRPC or rsocket), right?
>
> Regarding the cleaning, I found out that there is actually a small
> cleaning step when the CSV is first created, so it should be possible to
> use this stage to convert the data to a format such as Parquet.
>
> Regarding the immediate solution for my problem, I got the JDBC driver
> working with Python using the JayDeBeApi library, and can keep the memory
> usage down by using the fetchMany method to stream batches of results from
> the server:
> https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb
>
> Mvh,
> Gareth
>
> On 23/10/2020, 22:44, "Paul Rogers"  wrote:
>
> Hi Gareth,
>
> The REST API is handy. We do have a partial fix queued up, but it got
> stalled a bit because of lack of reviewers for the tricky part of the
> code
> that is impacted. If the REST API will help your use case; perhaps you
> can
> help with review of the fix, or trying it out in your environment.
> You'd
> need a custom Drill build, but doing that is pretty easy.
>
> One other thing to keep in mind: Drill will ready many kinds of "raw"
> data.
> But, Drill does require that the data be clean. For CSV, that means
> consistent columns and consistent formatting. (A column cannot be a
> number
> in one place, and a string in another. If using headers, a column
> cannot be
> called "foo" in one file, and "bar" in another.) If your files are
> messy,
> it is very helpful to run an ETL step to clean up the data so you
> don't end
> up with random failed queries. Since the data is rewritten for
> cleaning,
> you might as well write the output to Parquet as Nitin suggests.
>
> - Paul
>
>
>
> On Fri, Oct 23, 2020 at 2:54 AM Gareth Western <
> gar...@garethwestern.com>
> wrote:
>
> > Thanks Paul and Nitin.
> >
> > Yes, we are currently using the REST API, so I guess that caveat is
> the
> > main issue. I am experimenting with JDBC and ODBC, but haven't made a
> > successfully connection with those from our Python apps yet (issues
> not
> > related to Drill but with the libraries I'm trying to use).
> >
> > Our use case for Drill is using it to expose some source data files
> > directly with the least amount of "preparation" possible (e.g.
> converting
> > to Parquet before working with the data). Read performance isn't a
> priority
> > yet just as long as we can actually get to the data.
> >
> > I guess I'll port the app over to Java and try again with JDBC first.
> >
> > Kind regards,
> > Gareth
> >
> > On 23/10/2020, 09:08, "Paul Rogers"  wrote:
> >
> > Hi Gareth,
> >
> > As it turns out, SELECT * by itself should use a fixed amount of
> memory
> > regardless of table size. (With two caveats.) Drill, as with
> most query
> > engines, reads data in batches, then returns each batch to the
> client.
> > So,
> > if you do SELECT * FROM yourfile.csv, the execution engine will
> use
> > only
> > enough memory for one batch of data (which is likely to be in
> th

Re: Thought about schemaless sources (mongodb/json)

2020-10-26 Thread Paul Rogers

Hi Dobes,

Thanks for the idea: a text-only approach would certainly solve one class
of problems. As it turns out, CSV does read all columns as text.

The solution does not solve two other problems which have popped up from
time to time. First, if file A contains columns a and b, and file B
contains columns a, b, and c, then the CSV reader for A does not know to
read column c. Some other operator in Drill will fill in the column, and
will choose Nullable INT as the type. That could, however, be changed to
choose Nullable Varchar.

Second, if a batch of A columns is given to the client before Drill sees
the first file B rows, then the client will see a schema of (a, b) followed
by a schema of (a, b, c.) JDBC and ODBC clients can't handle this. The REST
API handles this, but at the cost of buffering all rows in memory before
sending, which causes its own issues, as someone recently noted.

JSON does have an all-text mode which suffers from the same missing-column
issue. Since Mongo uses JDBC, it should also support all-text mode.

Mongo is a bit tricky because it allows an object syntax to specify
scalars: {type: "int", value: 10} (IIRC). Drill does not read such values
as strings, even in all-text mode. The result can be a type conversion
error if some values use the extended from, others use the simple form. So,
Mongo would need special handling. There was a start at fixing this before
our committer reviewers dried up. We could dust off the work and finish it
up.

The all-text approach avoids type conversion issues because there is no
type conversion. Users, however, want to do math and other operations which
require numeric types. So, an inconvenience with the all-text approach is
that the user must include CASTs in every query to convert the data to the
proper type. Doing so makes queries more complex and slower.

You can work around the CAST problem by creating a view: the view contains
all the needed CASTs. The user references the view instead of the actual
file.

Of course, if you are going to define a view, you might as well go all the
way and use the other approach you mentioned, which Drill started to
support: tell the CSV or JSON reader what schema to expect. That way, the
reader does the conversion (as for JSON), and does so reliably. Since the
schema is known at read time, all batches have the same schema, which
solves the "schema change" problem.

The key challenge is that, to solve these problems, Drill needs information
not available in the raw data. The question is: what is the most reliable,
least complex way to supply that information? All-text mode, with
pre-defined conversions, and a list of columns to expect would provide
Drill with the needed information.

One way to gather the information would be to extend to do a "sniff" pass
over the data to infer a schema across the set of files to scan, work out
any inconsistencies, then do the real scan with a reliable, complete schema.

Doing the "sniff" pass for every query is slow; it would be better to do
the "sniff" pass once and reuse the information. Drill tries to do this
with Parquet metadata. AWS recognized this same problem. The AWS Glue
product, based on Hive, will "sniff" your S3 data to infer a schema which
is then stored in HMS and available for tools to use. Drill could tie into
Glue to obtain the information so that the user need not create it.

So, several issues to consider and several ways to address them. It does
look like the key challenge is what you identified: to provide Drill with
information not available in that first file record.

Thanks,

- Paul

On Mon, Oct 26, 2020 at 1:42 AM Dobes Vandermeer  wrote:

> Currently drill tries to infer schemas from data that doesn't come with
> one, such as JSON, CSV, and mongoDB.  However this doesn't work well if the
> first N rows are missing values for fields - drill just assigns an
> arbitrary type to fields that are only null and no type to fields that are
> missing completely, then rejects values when it finds them later.
>
> What if you could instead query in a mode where each row is just given as
> a string, and you use JSON functions to load the data out and convert or
> cast it to the appropriate type?
>
> For JSON in particular it's common these days to provide functions that
> extract data from a JSON string column.  BigQuery and postgres are two good
> examples.
>
> I think in many cases these JSON functions could be inspected by a driver
> and still be used for filter push
> down.
>
> Anyway, just an idea I had to approach the mongo schema problem that's a
> bit different from trying to specify the schema up front.  I think this
> approach offers more flexibility to the user at the cost of more verbose
> syntax and harder to optimize queries.
>

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

2020-10-23 Thread Paul Rogers

Hi Gareth,

The REST API is handy. We do have a partial fix queued up, but it got
stalled a bit because of lack of reviewers for the tricky part of the code
that is impacted. If the REST API will help your use case; perhaps you can
help with review of the fix, or trying it out in your environment. You'd
need a custom Drill build, but doing that is pretty easy.

One other thing to keep in mind: Drill will ready many kinds of "raw" data.
But, Drill does require that the data be clean. For CSV, that means
consistent columns and consistent formatting. (A column cannot be a number
in one place, and a string in another. If using headers, a column cannot be
called "foo" in one file, and "bar" in another.) If your files are messy,
it is very helpful to run an ETL step to clean up the data so you don't end
up with random failed queries. Since the data is rewritten for cleaning,
you might as well write the output to Parquet as Nitin suggests.

- Paul



On Fri, Oct 23, 2020 at 2:54 AM Gareth Western 
wrote:

> Thanks Paul and Nitin.
>
> Yes, we are currently using the REST API, so I guess that caveat is the
> main issue. I am experimenting with JDBC and ODBC, but haven't made a
> successfully connection with those from our Python apps yet (issues not
> related to Drill but with the libraries I'm trying to use).
>
> Our use case for Drill is using it to expose some source data files
> directly with the least amount of "preparation" possible (e.g. converting
> to Parquet before working with the data). Read performance isn't a priority
> yet just as long as we can actually get to the data.
>
> I guess I'll port the app over to Java and try again with JDBC first.
>
> Kind regards,
> Gareth
>
> On 23/10/2020, 09:08, "Paul Rogers"  wrote:
>
> Hi Gareth,
>
> As it turns out, SELECT * by itself should use a fixed amount of memory
> regardless of table size. (With two caveats.) Drill, as with most query
> engines, reads data in batches, then returns each batch to the client.
> So,
> if you do SELECT * FROM yourfile.csv, the execution engine will use
> only
> enough memory for one batch of data (which is likely to be in the 10s
> of
> meg in size.)
>
> The first caveat is if you do a "buffering" operation, such as a sort.
> SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all data.
> But, Drill spills to disk to relieve memory pressure.
>
> The other caveat is if you use the REST API to fetch data. Drill's
> REST API
> is not scalable. It buffers all data in memory in an extremely
> inefficient
> manner. If you use the JDBC, ODBC or native APIs, then you won't have
> this
> problem. (There is a pending fix we can do for a future release.) Are
> you
> using the REST API?
>
> Note that the above is just as true of Parquet as it is with CSV.
> However,
> as Nitin notes, Parquet is more efficient to read.
>
> Thanks,
>
> - Paul
>
>
> On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar 
> wrote:
>
> > Please convert CSV to parquet first and while doing so make sure you
> cast
> > each column to correct datatype
> >
> > once you have in paraquet, your queries should be bit faster.
> >
> > On Fri, Oct 23, 2020, 11:57 AM Gareth Western <
> gar...@garethwestern.com>
> > wrote:
> >
> > > I have a very large CSV file (nearly 13 million records) stored in
> Azure
> > > Storage and read via the Azure Storage plugin. The drillbit
> configuration
> > > has a modest 4GB heap size. Is there an effective way to select
> all the
> > > records from the file without running out of resources in Drill?
> > >
> > > SELECT * … is too big
> > >
> > > SELECT * with OFFSET and LIMIT sounds like the right approach, but
> OFFSET
> > > still requires scanning through the offset records, and this seems
> to hit
> > > the same memory issues even with small LIMITs once the offset is
> large
> > > enough.
> > >
> > > Would it help to switch the format to something other than CSV? Or
> move
> > it
> > > to a different storage mechanism? Or something else?
> > >
> >
>

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

2020-10-23 Thread Paul Rogers

Hi Gareth,

As it turns out, SELECT * by itself should use a fixed amount of memory
regardless of table size. (With two caveats.) Drill, as with most query
engines, reads data in batches, then returns each batch to the client. So,
if you do SELECT * FROM yourfile.csv, the execution engine will use only
enough memory for one batch of data (which is likely to be in the 10s of
meg in size.)

The first caveat is if you do a "buffering" operation, such as a sort.
SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all data.
But, Drill spills to disk to relieve memory pressure.

The other caveat is if you use the REST API to fetch data. Drill's REST API
is not scalable. It buffers all data in memory in an extremely inefficient
manner. If you use the JDBC, ODBC or native APIs, then you won't have this
problem. (There is a pending fix we can do for a future release.) Are you
using the REST API?

Note that the above is just as true of Parquet as it is with CSV. However,
as Nitin notes, Parquet is more efficient to read.

Thanks,

- Paul

On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar 
wrote:

> Please convert CSV to parquet first and while doing so make sure you cast
> each column to correct datatype
>
> once you have in paraquet, your queries should be bit faster.
>
> On Fri, Oct 23, 2020, 11:57 AM Gareth Western 
> wrote:
>
> > I have a very large CSV file (nearly 13 million records) stored in Azure
> > Storage and read via the Azure Storage plugin. The drillbit configuration
> > has a modest 4GB heap size. Is there an effective way to select all the
> > records from the file without running out of resources in Drill?
> >
> > SELECT * … is too big
> >
> > SELECT * with OFFSET and LIMIT sounds like the right approach, but OFFSET
> > still requires scanning through the offset records, and this seems to hit
> > the same memory issues even with small LIMITs once the offset is large
> > enough.
> >
> > Would it help to switch the format to something other than CSV? Or move
> it
> > to a different storage mechanism? Or something else?
> >
>

Re: Standalone drillbit without Zookeeper?

2020-09-29 Thread Paul Rogers

Hi Matt,

Running Drill with drillbit.sh mostly does a bunch of Java setup, then
invokes a main routine that runs Drill as a service. Nothing special about
that.

Drill's "embedded" mode runs without ZK. Technically, it creates an
in-memory "cluster coordinator" that coordinates only with itself. The
"main" for embedded mode is the Sqlline process (which starts Drill via a
JDBC call.)

Many of Drill's unit tests create an in-process server using the in-memory
cluster coordinator. Although tests normally disable the embedded web
server, it is possible to enable the web server, as is done by one or two
tests. Depending on what you're trying to do, you could probably cobble
something together along these lines.

A bit more ambitious option is to add a Drill command-line option to run as
a service (with all the usual Java and logging setup), but use the
single-node, non-ZK cluster coordinator.

- Paul

On Tue, Sep 29, 2020 at 2:43 PM Matt Keranen  wrote:

> Is it possible to run a single node drillbit without Zookeeper, as a
> "service" without the need for coordination across multiple nodes?
>
> `zk.connect: "local"` is not accepted as the equivalent of "zk=local" with
> drill-embedded.
>

Re: How to make the Drill query optimizer always push down clauses

2020-09-23 Thread Paul Rogers

Absolutely. I'm not sure who manages it. I'll ask (on Slack).

- Paul

On Wed, Sep 23, 2020 at 12:14 PM Carolina Gomes  wrote:

> Also, could I be added to the Slack channel?
>
> On Wed, Sep 23, 2020 at 2:57 PM Carolina Gomes  wrote:
>
> > Hi Paul,
> >
> >
> > That would be great even if you can just copy the discussion here. Being
> > able to do that would greatly optimize the performance of our product.
> >
> > On Mon, Sep 21, 2020 at 5:57 PM Paul Rogers  wrote:
> >
> >> Hi Carolina,
> >>
> >> This issue came up recently in one of the Drill Slack channels. I
> wonder,
> >> can anyone here summarize the findings from that Slack discussion?
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >>
> >> On Mon, Sep 21, 2020 at 7:35 AM Carolina Gomes 
> >> wrote:
> >>
> >> > Also if it helps, I’m using Drill 1.16 in single-node mode.
> >> >
> >> > On Mon, Sep 21, 2020 at 10:32 AM Carolina Gomes 
> >> > wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > I have a question about push down of limit and offset clauses on
> >> Drill.
> >> > > For my use case, I’d always like for limit and offset clauses to be
> >> > pushed
> >> > > down to the data sources, which are always RDBMS databases like SQL
> >> > Server,
> >> > > Oracle etc.
> >> > >
> >> > >
> >> > > However, I have noticed the decision to push down seems to happen
> >> > > depending on the size of the limit clause, and on the number of
> >> columns
> >> > > being projected.
> >> > >
> >> > >
> >> > > As an example, I have a table of about 250 columns with about 50
> >> million
> >> > > rows. If I do:
> >> > >
> >> > >
> >> > > select * from table limit 1000 —-> limit push down does not happen,
> >> query
> >> > > takes 30s while if I change the physical plan to push down the limit
> >> > > clause, it takes less than 1s.
> >> > >
> >> > > select * from table limit 100 —-> limit push down does happen,
> >> query
> >> > > takes roughly same time as if I queried directly on the source DB.
> >> > >
> >> > > Is there a way of easily telling Drill to always pushdown?
> >> > > --
> >> > > [Carolina Gomes]
> >> > > CEO, AfterData.ai <https://www.afterdata.ai/>
> >> > > +1 (416) 931 4774
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > [Carolina Gomes]
> >> > CEO, AfterData.ai <https://www.afterdata.ai/>
> >> > +1 (416) 931 4774
> >> >
> >>
> > --
> > [Carolina Gomes]
> > CEO, AfterData.ai <https://www.afterdata.ai/>
> > +1 (416) 931 4774
> >
> >
> >
> > --
> [Carolina Gomes]
> CEO, AfterData.ai <https://www.afterdata.ai/>
> +1 (416) 931 4774
>

Re: How to make the Drill query optimizer always push down clauses

2020-09-23 Thread Paul Rogers

Hi Carolina,

The discussion, as I recall, was on the public Drill Slack channel which
you are welcome to join. Would also be great if the participants of that
discussion could record the info in a Jira ticket.

As I recall, the folks found that there are complications with Drill's
Calcite-based query planner when computing costs after the filter pushdown.
There was detailed discussion of specific code changes, but I don't know if
that fixed the problem or if the discussion petered out.

Thanks,

- Paul


On Wed, Sep 23, 2020 at 11:57 AM Carolina Gomes  wrote:

> Hi Paul,
>
>
> That would be great even if you can just copy the discussion here. Being
> able to do that would greatly optimize the performance of our product.
>
> On Mon, Sep 21, 2020 at 5:57 PM Paul Rogers  wrote:
>
> > Hi Carolina,
> >
> > This issue came up recently in one of the Drill Slack channels. I wonder,
> > can anyone here summarize the findings from that Slack discussion?
> >
> > Thanks,
> >
> > - Paul
> >
> >
> > On Mon, Sep 21, 2020 at 7:35 AM Carolina Gomes 
> wrote:
> >
> > > Also if it helps, I’m using Drill 1.16 in single-node mode.
> > >
> > > On Mon, Sep 21, 2020 at 10:32 AM Carolina Gomes 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have a question about push down of limit and offset clauses on
> Drill.
> > > > For my use case, I’d always like for limit and offset clauses to be
> > > pushed
> > > > down to the data sources, which are always RDBMS databases like SQL
> > > Server,
> > > > Oracle etc.
> > > >
> > > >
> > > > However, I have noticed the decision to push down seems to happen
> > > > depending on the size of the limit clause, and on the number of
> columns
> > > > being projected.
> > > >
> > > >
> > > > As an example, I have a table of about 250 columns with about 50
> > million
> > > > rows. If I do:
> > > >
> > > >
> > > > select * from table limit 1000 —-> limit push down does not happen,
> > query
> > > > takes 30s while if I change the physical plan to push down the limit
> > > > clause, it takes less than 1s.
> > > >
> > > > select * from table limit 100 —-> limit push down does happen,
> > query
> > > > takes roughly same time as if I queried directly on the source DB.
> > > >
> > > > Is there a way of easily telling Drill to always pushdown?
> > > > --
> > > > [Carolina Gomes]
> > > > CEO, AfterData.ai <https://www.afterdata.ai/>
> > > > +1 (416) 931 4774
> > > >
> > > >
> > > >
> > > > --
> > > [Carolina Gomes]
> > > CEO, AfterData.ai <https://www.afterdata.ai/>
> > > +1 (416) 931 4774
> > >
> >
> --
> [Carolina Gomes]
> CEO, AfterData.ai <https://www.afterdata.ai/>
> +1 (416) 931 4774
>

Re: How to make the Drill query optimizer always push down clauses

2020-09-21 Thread Paul Rogers

Hi Carolina,

This issue came up recently in one of the Drill Slack channels. I wonder,
can anyone here summarize the findings from that Slack discussion?

Thanks,

- Paul


On Mon, Sep 21, 2020 at 7:35 AM Carolina Gomes  wrote:

> Also if it helps, I’m using Drill 1.16 in single-node mode.
>
> On Mon, Sep 21, 2020 at 10:32 AM Carolina Gomes 
> wrote:
>
> > Hi all,
> >
> > I have a question about push down of limit and offset clauses on Drill.
> > For my use case, I’d always like for limit and offset clauses to be
> pushed
> > down to the data sources, which are always RDBMS databases like SQL
> Server,
> > Oracle etc.
> >
> >
> > However, I have noticed the decision to push down seems to happen
> > depending on the size of the limit clause, and on the number of columns
> > being projected.
> >
> >
> > As an example, I have a table of about 250 columns with about 50 million
> > rows. If I do:
> >
> >
> > select * from table limit 1000 —-> limit push down does not happen, query
> > takes 30s while if I change the physical plan to push down the limit
> > clause, it takes less than 1s.
> >
> > select * from table limit 100 —-> limit push down does happen, query
> > takes roughly same time as if I queried directly on the source DB.
> >
> > Is there a way of easily telling Drill to always pushdown?
> > --
> > [Carolina Gomes]
> > CEO, AfterData.ai 
> > +1 (416) 931 4774
> >
> >
> >
> > --
> [Carolina Gomes]
> CEO, AfterData.ai 
> +1 (416) 931 4774
>

Re: CTAS query fails

2020-09-21 Thread Paul Rogers

Hi Vimal,

One thing to consider is that if you do have variable schema, you may be
presenting Parquet with a Drill feature which Parquet cannot support.
Parquet appears to require that the schema be known when creating the file.
In Drill-speak, this means that the batch used to create a Parquet file
will define the schema. If another batch comes along later with a different
schema, there is no way to go back and revise the Parquet schema.

That is, there is an impedance mismatch between the fact that (for a subset
of operators) Drill allows the schema to vary from one batch of records to
the next. However, Parquet (or JDBC or ODBC) requires that the schema be
known up-front.

In your case, this shows up as that JSON object (Drill MAP) with a varying
set of elements.

Drill provides no way to bridge this gap. The ability to have ill-defined
schemas is seen as a "feature" of Drill, not a bug.

The best solution is to do an ETL step to normalize the data before running
it through Drill. That way, although Drill does allow the schema to change,
it won't, in fact, change and so the Parquet writer will be happy.

Thanks,

- Paul


On Sun, Sep 20, 2020 at 11:50 PM Vimal Jain  wrote:

> Thanks Paul for quick response.
> So reading your response, looks like this has something to do would Parquet
> instead of Drill ? I would post this question in the Parquet community
> group as well to see if we can get an answer for this.
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Fri, Sep 18, 2020 at 10:45 PM Paul Rogers  wrote:
>
> > Hi Vimal,
> >
> > You've stumbled across one of the more frustrating bits of Drill. Drill
> is
> > "schema-free", meaning that the only information which Drill has to read
> > your data is the data itself. In your case, the JSON reader can infer
> that
> > "abc" is a MAP (Drill's term, Hive would call it a STRUCT.) Each file is
> > read in a different "fragment". One fragment says that "abc" is an empty
> > MAP, another says that it has some schema. These are merged sometime
> later
> > in the query.
> >
> > If you had had a null value instead, Drill won't know that "abc" is a map
> > and would have guessed INT as the type. So, good that you have an empty
> > object, it avoids ambiguity.
> >
> > Sounds like the issue is in the Parquet writer: that it has some
> limitation
> > on an empty group. Why is the group empty? Because, when writing the
> first
> > file with the empty group, the Parquet writer has no way to predict that
> > your "abc" field will eventually include a non-empty group. In fact, when
> > the non-empty group does appear, the Parquet schema must change. Not sure
> > what Parquet will do in that case: you may end up with some files with
> one
> > schema, other files with another schema.
> >
> > What you want, of course, is for Drill to combine your files to create a
> > single schema for Parquet, setting fields to null when they are missing.
> > Drill can't currently do that effectively because it involves predicting
> > the future, which Drill cannot do.
> >
> > Does anyone have more direct knowledge of how Parquet handles this case?
> >
> > Thanks,
> >
> > - Paul
> >
> > On Fri, Sep 18, 2020 at 4:10 AM Vimal Jain  wrote:
> >
> > > Hi,
> > > I am trying to convert my JSON data into Parquet format using CTAS
> query
> > > like below :-
> > >
> > > *create table ds2.root.`parquetOutput` as select * from
> > > TABLE(ds1.root.`jsonInput/` (type =>'json'));*
> > >
> > > But it fails with error :-
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema
> with
> > an
> > > empty group: optional group abc {}Fragment 0:0Please, refer to logs for
> > > more information.[Error Id: fa3c0390-0093-4c4a-9b32-098d5cc68c7e on
> > > ip-172-30-3-153.ec2.internal:31010] (state=,code=0)*
> > >
> > > So can someone explain what is the issue here, can't my jsons have a
> key
> > > "abc" with value as empty object "{}" ?
> > > It's empty in some json files in ds1 but in some there is a value.
> > > Any help to resolve this would be appreciated.
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> >
>

Re: CTAS query fails

2020-09-18 Thread Paul Rogers

Hi Vimal,

You've stumbled across one of the more frustrating bits of Drill. Drill is
"schema-free", meaning that the only information which Drill has to read
your data is the data itself. In your case, the JSON reader can infer that
"abc" is a MAP (Drill's term, Hive would call it a STRUCT.) Each file is
read in a different "fragment". One fragment says that "abc" is an empty
MAP, another says that it has some schema. These are merged sometime later
in the query.

If you had had a null value instead, Drill won't know that "abc" is a map
and would have guessed INT as the type. So, good that you have an empty
object, it avoids ambiguity.

Sounds like the issue is in the Parquet writer: that it has some limitation
on an empty group. Why is the group empty? Because, when writing the first
file with the empty group, the Parquet writer has no way to predict that
your "abc" field will eventually include a non-empty group. In fact, when
the non-empty group does appear, the Parquet schema must change. Not sure
what Parquet will do in that case: you may end up with some files with one
schema, other files with another schema.

What you want, of course, is for Drill to combine your files to create a
single schema for Parquet, setting fields to null when they are missing.
Drill can't currently do that effectively because it involves predicting
the future, which Drill cannot do.

Does anyone have more direct knowledge of how Parquet handles this case?

Thanks,

- Paul

On Fri, Sep 18, 2020 at 4:10 AM Vimal Jain  wrote:

> Hi,
> I am trying to convert my JSON data into Parquet format using CTAS query
> like below :-
>
> *create table ds2.root.`parquetOutput` as select * from
> TABLE(ds1.root.`jsonInput/` (type =>'json'));*
>
> But it fails with error :-
>
>
>
>
>
>
>
>
> *Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an
> empty group: optional group abc {}Fragment 0:0Please, refer to logs for
> more information.[Error Id: fa3c0390-0093-4c4a-9b32-098d5cc68c7e on
> ip-172-30-3-153.ec2.internal:31010] (state=,code=0)*
>
> So can someone explain what is the issue here, can't my jsons have a key
> "abc" with value as empty object "{}" ?
> It's empty in some json files in ds1 but in some there is a value.
> Any help to resolve this would be appreciated.
>
> *Thanks and Regards,*
> *Vimal Jain*
>

Re: Support for JSONPath while querying JSON data in drill

2020-09-14 Thread Paul Rogers

Hi Vimal,

Drill does not support JSON Path. Instead, Drill attempts to read your JSON
into records which you can then manipulate in SQL. Drill supports JSON
structure to some degree: nested records, arrays of a single type (with no
nulls.) More recent versions do provide a way to ignore some part of the
JSON, such as ignoring the message body of a REST response, to focus just
on the "payload" portion of the response.

It is possible to add JSON Path by creating a new format plugin. The
simplest JSON path implementations load your data into memory to simplify
path queries. Since Drill queries data at scale, if your files are large,
loading the entire file can be problematic. Instead, you'd want a streaming
path solution, or load a single object at a time and apply path rules to
that.

Another consideration is that using path rules on each record can be
expensive and is a cost paid for every query, which will slow performance.
If you will query the data multiple times, you may find it more effective
to perform an ETL from the JSON format into a Parquet as a separate step,
then query the Parquet format to get good query performance.

Thanks,

- Paul

On Mon, Sep 14, 2020 at 7:29 AM Vimal Jain  wrote:

> Hi There,
> I am using Drill 1.17.
> I have complex json data in AWS S3 to query.
> I am looking to see if there is already a support in drill to query using
> JSON Path ( similar to XPath for XML  , for ref -
> https://goessner.net/articles/JsonPath/ )
> If it's not possible by default , is there we can make use of it ? ( like
> UDF etc ? )
>
> *Thanks and Regards,*
> *Vimal Jain*
>

Re: [VOTE] Release Apache Drill 1.18.0 - RC0

2020-09-02 Thread Paul Rogers

Hi Abhishek,

Downloaded the tar file, installed Drill, cleaned my ZK and poked around in
the UI.

As you noted, you've already run the thousands of unit tests and the test
framework, so no point in trying to repeat that. Our tests, however, don't
cover the UI much at all, so I clicked around on the basics to ensure
things basically work. Seems good.

To catch the odd cases, would be great if someone who uses Drill in
production could try it out. Until then, my vote is +1.

- Paul


On Tue, Sep 1, 2020 at 5:28 PM Abhishek Girish  wrote:

> Thanks Vova!
>
> Hey folks, we need more votes to validate the release. Please give RC0 a
> try.
>
> Special request to PMCs - please vote as we only have 1 binding vote at
> this point. I am fine extending the voting window by a day or two if anyone
> is or plans to work on it soon.
>
> On Tue, Sep 1, 2020 at 12:09 PM Volodymyr Vysotskyi 
> wrote:
>
> > Verified checksums and signatures for binary and source tarballs and for
> > jars published to the maven repo.
> > Run all unit tests on Ubuntu with JDK 8 using tar with sources.
> > Run Drill in embedded mode on Ubuntu, submitted several queries, verified
> > that profiles displayed correctly.
> > Checked JDBC driver using SQuirreL SQL client and custom java client,
> > ensured that it works correctly with the custom authenticator.
> >
> > +1 (binding)
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Mon, Aug 31, 2020 at 1:37 PM Volodymyr Vysotskyi <
> volody...@apache.org>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have looked into the DRILL-7785, and the problem is not in Drill, so
> it
> > > is not a blocker for the release.
> > > For more details please refer to my comment
> > > <
> >
> https://issues.apache.org/jira/browse/DRILL-7785?focusedCommentId=17187629=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17187629
> > >
> > > on this ticket.
> > >
> > > Kind regards,
> > > Volodymyr Vysotskyi
> > >
> > >
> > > On Mon, Aug 31, 2020 at 4:26 AM Abhishek Girish 
> > > wrote:
> > >
> > >> Yup we can certainly include it if RC0 fails. So far I’m inclined to
> not
> > >> consider it a blocker. I’ve requested Vova and Anton to take a look.
> > >>
> > >> So folks, please continue to test the candidate.
> > >>
> > >> On Sun, Aug 30, 2020 at 6:16 PM Charles Givre 
> wrote:
> > >>
> > >> > Ok.  Are you looking to include DRILL-7785?  I don't think it's a
> > >> blocker,
> > >> > but if we find anything with RC0... let's make sure we get it in.
> > >> >
> > >> > -- C
> > >> >
> > >> >
> > >> >
> > >> > > On Aug 30, 2020, at 9:14 PM, Abhishek Girish 
> > >> wrote:
> > >> >
> > >> > >
> > >> >
> > >> > > Hey Charles,
> > >> >
> > >> > >
> > >> >
> > >> > > I would have liked to. We did get one of the PRs merged after the
> > >> master
> > >> >
> > >> > > branch was closed as I hadn't made enough progress with the
> release
> > >> yet.
> > >> >
> > >> > > But that’s not the case now.
> > >> >
> > >> > >
> > >> >
> > >> > > Unless DRILL-7781 is a release blocker, we should probably skip
> it.
> > So
> > >> > far,
> > >> >
> > >> > > a lot of effort has gone into getting RC0 ready. So I'm hoping to
> > get
> > >> > this
> > >> >
> > >> > > closed asap.
> > >> >
> > >> > >
> > >> >
> > >> > > Regards,
> > >> >
> > >> > > Abhishek
> > >> >
> > >> > >
> > >> >
> > >> > > On Sun, Aug 30, 2020 at 6:07 PM Charles Givre 
> > >> wrote:
> > >> >
> > >> > >
> > >> >
> > >> > >> HI Abhishek,
> > >> >
> > >> > >>
> > >> >
> > >> > >> Can we merge DRILL-7781?  We really shouldn't ship something
> with a
> > >> > simple
> > >> >
> > >> > >> bug like this.
> > >> >
> > >> > >>
> > >> >
> > >> > >> -- C
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> On Aug 30, 2020, at 8:40 PM, Abhishek Girish <
> agir...@apache.org>
> > >> > wrote:
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> Advanced tests from [5] are also complete. All 7500+ tests
> passed,
> > >> > except
> > >> >
> > >> > >>
> > >> >
> > >> > >>> for a few relating to known resource issues (drillbit
> > connectivity /
> > >> > OOM
> > >> >
> > >> > >>
> > >> >
> > >> > >>> /...). Plus a few with the same symptoms as DRILL-7785.
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> > >>> On Sun, Aug 30, 2020 at 2:17 PM Abhishek Girish <
> > agir...@apache.org
> > >> >
> > >> >
> > >> > >> wrote:
> > >> >
> > >> > >>
> > >> >
> > >> > >>>
> > >> >
> > >> > >>
> > >> >
> > >> >  Wanted to share an update on some of the testing I've done from
> > my
> > >> > side:
> > >> >
> > >> > >>
> > >> >
> > >> > 
> > >> >
> > >> > >>
> > >> >
> > >> >  All Functional tests from [5] (plus private Customer tests) are
> > >> >
> > >> > >> complete.
> > >> >
> > >> > >>
> > >> >
> > >> >  10,000+ tests have passed. However, I did see an issue with
> Hive
> > >> ORC
> > >> >
> > >> > >> tables
> > >> >
> > >> >

Re: Successful (and not so successful) Production use cases for drill?

2020-08-25 Thread Paul Rogers

trivial/inconsequential but fatal to a
>
> Eg can I have a "table" schema configured for a directory that has a number
> of csvs
> Orders202001.csv has columns |OrderID|ProductCode|Quantity|Date|
> Orders202002.csv has columns |OrderID|ProductCode|Date|Quantity|
>
> A single schema definition over the top of these two files should work
> correctly for both - the ordinal position in the individual csvs is
> dynamically mapped to the schema based on the csv header column
>
> The same applies to extra columns and missing columns
>
> Orders202003.csv has columns
> |OrderID|ProductCode|SupplierCode|Quantity|Date|
> Orders202004.csv has columns |ProductCode|Quantity|Date|
>
> Extra columns are ignored and missing columns show as null Throw a
> 'warning'
> or an option to set a column as mandatory in which case throw an error
>
> How does drill handle the situation where there are multiple csvs in a
> directory and one fails but the rest are ok. Is the whole table offline? Do
> all selects fail or does it show what it knows and throws a warning?
>
> I've written a c# csv handler like above and use for ETLing into a
> relational dbs when required. It saves so much time.
>
>
> Is there a 3rd party SQL query tool that plays nicely with drill?
>
>
> I do a lot of funky SQL with views on views and CTEs etc etc. How accurate
> is the dependency metadata? Would I be able to generate object level
> (view/table) data lineage/dependency data?
>
> As an aside – I’ve seen some of the threads om the mailing list about
> writing a generic rest plugin. I’ve previously used the CDATA -
> https://www.cdata.com/drivers/rest/odbc/ (worth a download of the trial to
> check out for ideas imho) especially mapping output data and uri params
> http://cdn.cdata.com/help/DWF/odbc/pg_customschemacolumns.htm
>
> On 2020/08/21 04:55:54, Paul Rogers  wrote:
> > Hi, welcome to Drill.>
> >
> > In my (albeit limited) experience, Drill has a particular sweet spot:
> data>
> > large enough to justify a distributed system, but not so large as to>
> > overtax the limited support Drill has for huge deployments.
> Self-describing>
> > data is good, but not data that is dirty or with inconsistent format.
> Drill>
> > is good to grab data from other systems, but only if those systems have>
> > some way to "push" operations via a system-specific query API (and
> someone>
> > has written a Drill plugin.)>
> >
> > Drill tries to be really good with Parquet: but that is not a "source">
> > format; you'll need to ETL data into Parquet. Some have used Drill for
> the>
> > ETL, but that only works if the source data is clean.>
> >
> > One of the biggest myths around big data is that you can get
> interactive>
> > response times on large data sets. You are entirely at the mercy of I/O>
> > performance. You can get more, but it will cost you. (In the "old days"
> by>
> > having a very large number of disk spindles; today by having many nodes>
> > pull from S3.)>
> >
> > As your data size increases, you'll want to partition data (which is as>
> > close to indexing as Drill and similar tools get.) But, as the number
> of>
> > partitions (or, for Parquet, row groups) increases, Drill will spend
> more>
>
> > time figuring out which partitions & row groups to scan than it spends>
> > scanning the resulting files. The Hive Metastore tries to solve this,
> but>
>
> > has become a huge mess with its own problems.>
> >
> > From what I've seen, Drill works best somewhere in the middle: larger
> than>
> > a set of files on your laptop, smaller than 10's of K of Parquet files.>
> >
> > Might be easier to discuss *your* specific use case rather than explain
> the>
> > universe of places where Drill has been used.>
> >
> > To be honest, I guess my first choice would be to run in the cloud
> using>
> > tools available from Amazon, DataBricks or Snowflake if you have a>
> > reasonably "normal" use case and just want to get up and running
> quickly.>
>
> > If the use case turns out to be viable, you can find ways to reduce
> costs>
>
> > by replacing "name brand" components with open source. But, if you
> "failed>
> > fast", you did so without spending much time at all on plumbing.>
> >
> > Thanks,>
> >
> > - Paul>
> >
> >
> > On Thu, Aug 20, 2020 at 9:02 PM  wrote:>
> >
> > > Hi all,>
> > >>
> > >>
>

Re: Successful (and not so successful) Production use cases for drill?

2020-08-20 Thread Paul Rogers

Hi, welcome to Drill.

In my (albeit limited) experience, Drill has a particular sweet spot: data
large enough to justify a distributed system, but not so large as to
overtax the limited support Drill has for huge deployments. Self-describing
data is good, but not data that is dirty or with inconsistent format. Drill
is good to grab data from other systems, but only if those systems have
some way to "push" operations via a system-specific query API (and someone
has written a Drill plugin.)

Drill tries to be really good with Parquet: but that is not a "source"
format; you'll need to ETL data into Parquet. Some have used Drill for the
ETL, but that only works if the source data is clean.

One of the biggest myths around big data is that you can get interactive
response times on large data sets. You are entirely at the mercy of I/O
performance. You can get more, but it will cost you. (In the "old days" by
having a very large number of disk spindles; today by having many nodes
pull from S3.)

As your data size increases, you'll want to partition data (which is as
close to indexing as Drill and similar tools get.) But, as the number of
partitions (or, for Parquet, row groups) increases, Drill will spend more
time figuring out which partitions & row groups to scan than it spends
scanning the resulting files. The Hive Metastore tries to solve this, but
has become a huge mess with its own problems.

>From what I've seen, Drill works best somewhere in the middle: larger than
a set of files on your laptop, smaller than 10's of K of Parquet files.

Might be easier to discuss *your* specific use case rather than explain the
universe of places where Drill has been used.

To be honest, I guess my first choice would be to run in the cloud using
tools available from Amazon, DataBricks or Snowflake if you have a
reasonably "normal" use case and just want to get up and running quickly.
If the use case turns out to be viable, you can find ways to reduce costs
by replacing "name brand" components with open source. But, if you "failed
fast", you did so without spending much time at all on plumbing.

Thanks,

- Paul

On Thu, Aug 20, 2020 at 9:02 PM  wrote:

> Hi all,
>
>
>
> Can some of the users that have deployed drill in production, whether
> small/medium and enterprise firms, share the use cases and experiences?
>
>
>
> What problems was drill meant to solve?
>
>
>
> Was it successful?
>
>
>
> What was/is drill mostly used for at your corporation?
>
>
>
> What was tried but wasn't taken up by users?
>
>
>
> Has it found a niche, or a core group of heavy users? What are their roles?
>
>
>
>
>
> I've been working in reporting, data warehousing, business intelligence,
> data engineering(?) (the name of the field seems to rebrand every 5 or so
> years - or the lifecycle of 2 failed enterprise data projects - but that's
> a
> theory for another time) for a bit over 15 years now and for the last 5 or
> so have been trying to understand why 70-80% of projects never achieve
> their
> aims. It doesn't seem to matter if they're run by really smart (and
> expensive!) people using best in class tools and processes. Their failure
> rate might be closer to the 70%, but that's still pretty terrible
>
>
>
> I have a couple theories as to why and have tested them over the last 5 or
> so years
>
>
>
> One part is reducing the gap between project inception and production
> quality data output. Going live quickly creates enthusiasm + a feedback
> loop
> to iterate the models which in turn creates a sense of engagement
>
>
>
> Getting rid of a thick ETL process that takes months or more of dev and
> refactoring before hitting production is one component. Using ~70% of the
> project resources on the plumbing - leaving very little for the complex
> data
> model iterations - just creates a tech demo not a commercially useful
> solution.  I don't think this is a technology problem, and applies whether
> using traditional on prem etl tools or the current data engineering scripts
> and cron jobs but in the cloud
>
>
>
> The least unsuccessful data engineering approach I've seen is the ELT
> logical data mart pattern; landing the source data as close to a 1:1 format
> as possible into a relational-like data store and leveraging MPP dbs via
> views and CTASes to create a conformed star schema. Then using the star
> schemas as building blocks create the complex (and actually useful) models.
> Something like this can be up in a few weeks and still cover the majority
> of
> user facing features a full data pipeline/ETL would have (snapshots +
> transactional facts, inferred members, type 1 dims only - almost everyone
> double joins a type 2 dim to get the current record anyway). While they
> aren't always (or even usually) 100% successes they at least have something
> useful or just fail quickly which is useful in itself
>
>
>
> The first part of this - getting all the data into a single spot, still
> sucks and is probably more fiddly than 10 years ago

Re: GitHub raw data as a Data source

2020-07-29 Thread Paul Rogers

Hi Faraz,

The short answer is, "yes, but you have to write some code." Drill can
process any tabular data, but needs a reader (a "storage plugin") to
convert from the API's data format to Drill's value vector format. The good
news is that, for most formats, readers already exist. Your file appears to
be CSV: Drill provides a CSV reader. What Drill does not provide is a
storage plugin to read CSV from a REST call. It should be easy to create
one: just start with (or better, modify) the REST storage plugin. Instead
of creating a JSON decoder for the data, create a CSV decoder.

If you choose to go this route, we can give you pointers for how to
proceed. Alternatively, you can use a script to download the data to a
local file, then use the existing CSV reader to query the data. Not
elegant, but may be fine if you do the query infrequently.

- Paul

On Wed, Jul 29, 2020 at 1:06 PM Faraz Ahmad  wrote:

> Hi Team,
>
>
>
> Is there any way we can able to query csv file data from GitHub using
> Apache Drill?
>
>
>
> Currently, I can able to pull this GitHub data into Power BI by using Web
> data connection with below URL:
>
>
>
>
> https://raw.githubusercontent.com/itsnotaboutthecell/Power-BI-Sessions/master/An%20Introduction%20to%20Tabular%20Editor/Source%20Files/Customers.csv
>
>
>
>
>
> My goal is to pull this data outside of Power BI, mash up with other data
> and then simply create a view within Drill.
>
> This view will then be connected to Power BI thru Drill ODBC connection.
>
>
>
> Kindly let me know if this is possible. Thanks so much!
>
>
>
>
>
> Regards,
>
> Faraz Ahmad
>
>
>

Re: Aggregate UDF and HashAgg

2020-07-27 Thread Paul Rogers

Hi James,

The behavior you see can mostly be explained by noting the way the two
aggregates work. The streaming agg is a sequential operator: it works with
sorted data, starts one aggregate, gathers all data, then resets for the
next. The hash agg is a parallel aggregate: it runs all aggregates in
parallel, it will start all aggregates at the same time, add data to each
of them depending on the hash key as it arrives, and complete all
aggregates at the same time at the end. There is no reset needed in a
parallel agg.

The real question is whether the parallel (hash) agg correctly calls the
add method multiple times and the the output once for each of the parallel
aggregates.

You are seeing the key trade-off between the two implementations: the
sequential (streaming) agg is very memory frugal, but requires a sort to
organize data. The parallel (hash) agg requires no sort, at the cost of
more memory to hold all active groups in memory. Classic DB stuff.

Thanks,

- Paul

On Sun, Jul 26, 2020 at 7:56 AM James Turton  wrote:

> Hi all
>
> I'm writing an aggregate UDF with help from the notes here
>
> https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
>
> .  I'm printing a line to stderr from each of the UDF methods so I can
> keep an eye on the call sequence.  When my UDF is invoked by a
> StreamingAgg operator the lifecycle of method calls - setup(), reset(),
> add(), output() - is as described in the wiki.  When my UDF is invoked
> by a HashAgg operator things change dramatically.  The setup() method is
> called some hundreds of times and reset() is never called even though I
> have three groups in the query's "group by"!  Anyone know what could be
> happening here?
>
> Thanks
> James
>
> --
> PGP public key <http://somecomputer.xyz/james.asc>
>

Re: Re: HDFS file is listable but not queryable (object not found)

2020-07-25 Thread Paul Rogers

Hi Clark,

This is a hard one. On the one hand, the "SASL" part of the data node log
messages suggests that Drill tried to do a data node operation, and it
failed for security reasons. But, we can't be sure if the two are connected.

On the other hand, the stack trace does not show the entries we'd expect.
Such a failure should appear as an IOException originating in the HDFS
DistributedFileSystem class, and bubbling up into Calcite. Instead, what we
see are only Calcite operations in the stack, which suggests that something
else is amiss.

I checked the code: unfortunately there is no logging in the file system
classes that I could find. Perhaps there is some in the HDFS code? The
operation to find a table in a namespace should go through
org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.getTable(String
tableName). There is debug-level logging in that method, so you can try
enabling debug (or even trace) level logging for all of Drill (which will
produce huge output) or just this one class or package.

The logging in the getTable() function does not cover the normal case,
however, only some odd cases. So, if you are comfortable building Drill
(not hard), you can add extra logging to see if the function is even called
(I'd bet it is not, for reasons above.)

The trick here is that problem occurs in your environment; I think we'd be
hard pressed to replicate the situation. So, if the above does not tell us
anything more, the next step is to run your Drill build in a debugger,
using a unit test, and setting breakpoints in likely places to see where
things go off the rails. If you want to go that route, we can give you
pointers for how to proceed.

Thanks,

- Paul

On Fri, Jul 24, 2020 at 8:45 AM Updike, Clark 
wrote:

> Yes, I've read that page but it wasn't clear to me how much of it
> applied.  I don't need kerberos auth from the client to the drillbits.  But
> the drillbits must use kerberos auth when interacting with hdfs.  By
> putting the principal and keytab info into the drillbit config
> (drill-override.conf), and not using impersonation, and no security.user
> settings,  I thought that was what I was effectively doing. And it at least
> partially works since SHOW FILES works.
>
> Is this not a valid setup?
>
> On 7/24/20, 11:23 AM, "Charles Givre"  wrote:
>
> Hey Clark,
> Have you gone through this:
> https://drill.apache.org/docs/configuring-kerberos-security/ <
> https://drill.apache.org/docs/configuring-kerberos-security/>
>
> As Paul indicated, this does seem like the likely suspect as to why
> this isn't working or at least the next thing to verify.  I'm surprised
> you're able to connect at all. I would have expected you to get connection
> denied when you tried the SHOW FILES query if Kerberos was not configured
> correctly.
>
> -- C
>
> > On Jul 24, 2020, at 11:14 AM, Updike, Clark 
> wrote:
> >
> > Using CDH version of 2.6.0.
> >
> > I was not able to find any errors on the Drill side besides what I
> already provided from sqlline.  However, I did find an exception on some of
> the datanodes (below).
> >
> > Everything works find using hdfs cli commands (ls, get, cat).
> >
> > I have set up security.auth.principal and security.auth.keytab for
> drill.exec in drill-override.conf.  That's what got SHOW FILES working.
> However, I have not been doing kerberos auth when using Sqlline.
> >
> > Is there any chance that SHOW FILES can work when Sqlline is not
> authenticated using kerberos, but the actual query requires Sqlline
> kerberos auth?  That might explain it if that's how it worked.  Note the
> only thing running kerberos is HDFS (not using kerberos on the Drill parts).
> >
> > STACKTRACE FROM DATANODE
> > dn003:20003:DataXceiver error processing unknown operation  src:
> /xx.xx.xx.22:53154 dst: /xx.xx.xx.23:20003
> > java.io.IOException:
> >  at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:217)
> >  at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:364)
> >  at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getEncryptedStreams(SaslDataTransferServer.java:178)
> >  at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:110)
> >  at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:193)
> >  at java.lang.Thread.run(Thread.java:745)
> >
> > Thanks,
> > Clark
>

Re: RE: HDFS file is listable but not queryable (object not found)

2020-07-23 Thread Paul Rogers

Hi Clark,

Security was going to be my next question. The stack trace didn't look like
one where the file open would fail: the planner doesn't actually open a
JSON file. There is no indication of the HDFS call that might have failed.
Another question is: what version of HDFS are you using? I wonder if there
is a conflict somewhere.

Although the stack trace does not tell us which file-system call failed,
the logs might. Can you check your Drill log file for entries at the time
of failure? Is there additional information about the specific operation
which failed?

What happens if you try to download the file using the command line HDFS
tools? Does that work? This test might verify that HDFS itself is sane and
that the security settings work.

Setting up Kerberos in Drill is documented on the web site. You probably
went through the steps there to ensure Drill has the needed info?

Thanks,

- Paul

On Thu, Jul 23, 2020 at 11:26 AM Updike, Clark 
wrote:

> I should mention that this is a kerberized HDFS cluster.  I'm still not
> sure why the SHOW FILES would work but the query would not--but it could be
> behind the issue somehow.
>
> On 7/23/20, 2:18 PM, "Updike, Clark"  wrote:
>
> No change unfortunately:
>
> apache drill> select * from hdfs.`root`.`/tmp/employee.json`;
> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 18:
> Object '/tmp/employee.json' not found within 'hdfs.root'
>
> On 7/23/20, 2:11 PM, "Paul Rogers"  wrote:
>
> Hi Clark,
>
> Try using `hdfs`.`root` rather than `hdfs.root`. Calcite wants to
> walk down
> `hdfs` then `root`. There is no workspace called `hdfs.root`.
>
> Thanks,
>
> - Paul
>
> On Thu, Jul 23, 2020 at 8:58 AM Updike, Clark <
> clark.upd...@jhuapl.edu>
> wrote:
>
> > Oops, sorry.  No luck there either unfortunately:
> >
> > apache drill> SELECT * FROM hdfs.`/tmp/employee.json`;
> > Error: VALIDATION ERROR: From line 1, column 15 to line 1,
> column 18:
> > Object '/tmp/employee.json' not found within 'hdfs'
> >
> >
> > On 7/23/20, 11:52 AM, "Charles Givre"  wrote:
> >
> > Oh.. I meant:
> >
> > SELECT *
> > FROM hdfs.`/tmp/employee.json`
> >
> > > On Jul 23, 2020, at 11:41 AM, Updike, Clark <
> clark.upd...@jhuapl.edu>
> > wrote:
> > >
> > > No change unfortunately...
> > >
> > > $ hdfs dfs -ls hdfs://nn01:8020/tmp/employee.json
> > > -rw-r--r--   2 me supergroup 474630 2020-07-23 10:53
> > hdfs://nn01:8020/tmp/employee.json
> > >
> > > apache drill> select * from
> > hdfs.root.`hdfs://nn01:8020/tmp/employee.json`;
> > > Error: VALIDATION ERROR: From line 1, column 15 to line 1,
> column
> > 18: Object 'hdfs://nn01:8020/tmp/employee.json' not found within
> 'hdfs.root'
> > >
> > >
> > > On 7/23/20, 11:30 AM, "Charles Givre" 
> wrote:
> > >
> > >Hi Clark,
> > >That's strange.  My initial thought is that this could
> be a
> > permission issue.  However, it might also be that Drill isn't
> finding the
> > file for some reason.
> > >
> > >Could you try:
> > >
> > >SELECT *
> > >FROM hdfs.``
> > >
> > >Best,
> > >--- C
> > >
> > >
> > >> On Jul 23, 2020, at 11:23 AM, Updike, Clark <
> > clark.upd...@jhuapl.edu> wrote:
> > >>
> > >> This is in 1.17.  I can use SHOW FILES to list the file
> I'm
> > targeting, but I cannot query it:
> > >>
> > >> apache drill> show files in
> hdfs.root.`/tmp/employee.json`;
> > >>
> >
> +---+-+++--++-+-+-+
> > >> | name  | isDirectory | isFile | length |  owner
>  |
> >  group| permissions |   accessTime|
> modificationTime
> >  |
> > >>

Re: Re: HDFS file is listable but not queryable (object not found)

2020-07-23 Thread Paul Rogers

Hi Clark,

Try using `hdfs`.`root` rather than `hdfs.root`. Calcite wants to walk down
`hdfs` then `root`. There is no workspace called `hdfs.root`.

Thanks,

- Paul

On Thu, Jul 23, 2020 at 8:58 AM Updike, Clark 
wrote:

> Oops, sorry.  No luck there either unfortunately:
>
> apache drill> SELECT * FROM hdfs.`/tmp/employee.json`;
> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 18:
> Object '/tmp/employee.json' not found within 'hdfs'
>
>
> On 7/23/20, 11:52 AM, "Charles Givre"  wrote:
>
> Oh.. I meant:
>
> SELECT *
> FROM hdfs.`/tmp/employee.json`
>
> > On Jul 23, 2020, at 11:41 AM, Updike, Clark 
> wrote:
> >
> > No change unfortunately...
> >
> > $ hdfs dfs -ls hdfs://nn01:8020/tmp/employee.json
> > -rw-r--r--   2 me supergroup 474630 2020-07-23 10:53
> hdfs://nn01:8020/tmp/employee.json
> >
> > apache drill> select * from
> hdfs.root.`hdfs://nn01:8020/tmp/employee.json`;
> > Error: VALIDATION ERROR: From line 1, column 15 to line 1, column
> 18: Object 'hdfs://nn01:8020/tmp/employee.json' not found within 'hdfs.root'
> >
> >
> > On 7/23/20, 11:30 AM, "Charles Givre"  wrote:
> >
> >Hi Clark,
> >That's strange.  My initial thought is that this could be a
> permission issue.  However, it might also be that Drill isn't finding the
> file for some reason.
> >
> >Could you try:
> >
> >SELECT *
> >FROM hdfs.``
> >
> >Best,
> >--- C
> >
> >
> >> On Jul 23, 2020, at 11:23 AM, Updike, Clark <
> clark.upd...@jhuapl.edu> wrote:
> >>
> >> This is in 1.17.  I can use SHOW FILES to list the file I'm
> targeting, but I cannot query it:
> >>
> >> apache drill> show files in hdfs.root.`/tmp/employee.json`;
> >>
> +---+-+++--++-+-+-+
> >> | name  | isDirectory | isFile | length |  owner   |
>  group| permissions |   accessTime|modificationTime
>  |
> >>
> +---+-+++--++-+-+-+
> >> | employee.json | false   | true   | 474630 | me   |
> supergroup | rw-r--r--   | 2020-07-23 10:53:15.055 | 2020-07-23
> 10:53:15.387 |
> >>
> +---+-+++--++-+-+-+
> >> 1 row selected (3.039 seconds)
> >>
> >>
> >> apache drill> select * from hdfs.root.`/tmp/employee.json`;
> >> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column
> 18: Object '/tmp/employee.json' not found within 'hdfs.root'
> >> [Error Id: 3b833622-4fac-4ecc-becd-118291cd8560 ] (state=,code=0)
> >>
> >> The storage plugin uses the standard json config:
> >>
> >>   "json": {
> >> "type": "json",
> >> "extensions": [
> >>   "json"
> >> ]
> >>   },
> >>
> >> I can't see any problems on the HDFS side.  Full stack trace is
> below.
> >>
> >> Any ideas what could be causing this behavior?
> >>
> >> Thanks, Clark
> >>
> >>
> >>
> >> FULL STACKTRACE:
> >>
> >> apache drill> select * from hdfs.root.`/tmp/employee.json`;
> >> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column
> 18: Object '/tmp/employee.json' not found within 'hdfs.root'
> >>
> >>
> >> [Error Id: 69c8ffc0-4933-4008-a786-85ad623578ea ]
> >>
> >> (org.apache.calcite.runtime.CalciteContextException) From line 1,
> column 15 to line 1, column 18: Object '/tmp/employee.json' not found
> within 'hdfs.root'
> >>   sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
> >>   sun.reflect.NativeConstructorAccessorImpl.newInstance():62
> >>   sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
> >>   java.lang.reflect.Constructor.newInstance():423
> >>   org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
> >>   org.apache.calcite.sql.SqlUtil.newContextException():824
> >>   org.apache.calcite.sql.SqlUtil.newContextException():809
> >>
>  org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4805
> >>
>  org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():127
> >>
>  org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():177
> >>   org.apache.calcite.sql.validate.AbstractNamespace.validate():84
> >>
>  org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():995
> >>
>  org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():955
> >>
>  org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():3109
> >>
>  
> org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():298
> >>
>

Re: exec.queue.enable in drill-embedded

2020-06-28 Thread Paul Rogers

Hi Avner,

Query queueing is not available in embedded mode: it uses ZK to throttle
the number of concurrent queries across a cluster; but embedded does not
have a cluster or use ZK. (If you are running more than a few concurrent
queries, embedded mode is likely the wrong deployment model anyway.)

The problem here is the use of the REST API. It has horrible performance;
it buffers the entire result set in memory in a way that overwhelms the
heap. The REST API was designed to power the Web UI for small queries of <
few hundred rows. Drill was designed assuming "real" queries would use the
ODBC, JDBC or native APIs.

That said, there is an in-flight PR designed to fix the heap memory issue
for REST queries. However, even with that fix, your client must still be
capable of handling a very large JSON document since rows are not returned
in a "jsonlines" format or in batches. If you retrieve a million rows, they
will be in single huge JSON document.

How many rows does the query return? If a few thousand or less, we can
perhaps finish up the REST fix to solve the issue. Else, consider switching
to a more scalable API.

How many rows are read from S3? Doing what kind of processing? Simple WHERE
clause, or is there some ORDER BY, GROUP BY or joins that would cause
memory use? If just a scan and WHERE clause, then the memory you are using
should be plenty - once the REST problem is fixed.

Thanks,

- Paul

On Sun, Jun 28, 2020 at 3:17 PM Avner Levy  wrote:

> Hi,
> I'm using Drill 1.18 (master) docker and trying to configure its memory
> after getting out of heap memory errors:
> "RESOURCE ERROR: There is not enough heap memory to run this query using
> the web interface."
> The docker is serving remote clients through the REST API.
> The queries are simple selects over tiny parquet files that are stored in
> S3.
> It is running on in 16GB container, configured with a heap of 8GB, and 8GB
> direct memory.
> I tried to use:
>   exec.queue.enable=true
>   exec.queue.large=1
>   exec.queue.small=1
>
> and verified it was configured correctly, but I still see queries running
> concurrently.
> In addition, the "drill.queries.enqueued" counter remains zero.
> Is this mechanism supported in drill-embedded?
>
> In addition, it seems there is some memory leak, since after a while even
> with no query running for a while, running a single tiny query still gives
> the same error.
> Any insight would be highly appreciated :)
> Thanks,
>   Avner
>

Re: FetchSize parameter by Drill driver

2020-05-30 Thread Paul Rogers

Hi Aditya,

Drill works with data in "batches" of value vectors. I believe that the
client has no control over the amount of data in each batch. Instead, the
batch size is set by the top-most operator in the DAG before "screen". The
good news is that Drill does limit batch sizes in most cases. The bad news
is that the limit can't be set (AFAIK) from JDBC.

- Paul

On Thu, May 28, 2020 at 6:12 PM Aditya Allamraju 
wrote:

> Team,
>
> Does drillbit understand the "fetchsize" connection parameter used by a
> JDBC driver.
> I was going through the Drill JDBC driver documentation and there is no
> mention about it.
>
> Is there a fixed default value Drill uses for fetchsize?
>
> Thanks
> Aditya
>

Re: one question about using pipe in drill

2020-05-19 Thread Paul Rogers

Try doing sqlline -? to get the help. I didn't see anything for reading
from a piped stdin, but you can read from and write to a file:

-fscript file to execute (same as --run)

-log  file to write output

- Paul

On Tue, May 19, 2020 at 8:15 AM Charles Givre  wrote:

> Hi there,
> You might first want to take a look at the docs for sqlline:
> http://sqlline.sourceforge.net/#introduction <
> http://sqlline.sourceforge.net/#introduction>
> I don't really think it is meant to be used in that manner.  I'd suggest
> writing a simple python script using PyDrill that accepts a query as input
> and outputs the result in the format that you need.
> Best,
> -- C
>
>
> > On May 19, 2020, at 3:39 AM, 肖辉  wrote:
> >
> > Hi, when i use drill, i can inquire successfully in the drill shell.
> However, now i want to inquire data in the format of the script, it failed.
> > when i transfer command to dirll through the pipe, it failed and quit
> immediately like the following picture：
> >
> > my script is a simple sentence like
> >
> > but when i remove the pipe, it can access into drill successfully. How
> can i use script to inquire and export data successfully in drill? Thanks
> and waiting for your advice sincerely.
> >
> >
> >
>
>

Re: Rest API and SQL injection

2020-05-10 Thread Paul Rogers

Hi Avner,

Drill does not support prepared statements. Nor does Drill support statements 
with parameters. This is true with all interfaces. These would be great 
features; but they've never been implemented.

Drill was designed to operate in a Hadoop-like environment with semi-trusted 
users. (Meaning that, if any user did something malicious, you could sue or 
fire them.) As noted, the file system enforced security. There was no notion of 
the public using Drill to access secure data, with Drill acting as the secure 
gateway. Again, code could be added, but it is not there today. FWIW, in its 
present state, I would not trust Drill on a public web site with sensitive data.

Given how Drill acts today, I'd wager your best bet is to insert your own 
server between your user and Drill. Allow the user to specify queries in some 
simple, non-SQL way. Then, your server can build the SQL and forward it to 
Drill.

Public Internet (Web Browser --> Secure Gateway) --> Private network (App 
server --> Drill)

For example, if I want to know about "Orders", I can specify a date range. Your 
server fills in the storage plugin, table name, WHERE clause, etc. to create 
the SQL. Such an approach allows the Drill REST API is on a private IP address 
within your data center. Only your outward-facing user service is on a public 
IP. With this approach, there is no SQL injection risk because you do not 
directly use the web-provided info in a SQL statement. Of course, you have to 
build your SQL statement correctly, as you are doing. Don't just append web 
text to a SQL statement.

I don't think this is unique to Drill. I'd be surprised if most people allow, 
say, public access to their HBase, Cassandra or MySQL DBs.

Thanks,
- Paul

On Sunday, May 10, 2020, 5:29:13 PM PDT, Avner Levy  
wrote:  

 Hi Charles, Paul,
Thanks for your answers.
I'm interested in a case, where there is a Rest service which
authenticate the service's users, get a request with the user parameters
and build from it the SQL sent to Drill.
The customers are identified by some account ID and they send for example
the name of an entity they are looking for as a parameter in the service's
REST request.
Then the service can build the select (just an example):
SELECT x FROM S3.db.`data/[CUSTOMER_ID]/data.parquet` where
name='CUSTOMER_USER_INPUT]'

In such case, they can still send in CUSTOMER_USER_INPUT the following: "x'
union SELECT x FROM S3.db.`data/[OTHER_CUSTOMER_ID]/data.parquet`".

Usually such stuff are solved with prepared statements, but I believe this
isn't supported over REST.
I would prefer not having to authenticate my end users to Drill since this
creates more work and complexity.
Is there a way to have prepared statements in Drill?
Is it supported in other protocols? (JDBC/ODBC)
Limiting the query folder outside the SQL would do the job as well.
Any feedback is appreciated,
Thanks,
  Avner

On Sun, May 10, 2020 at 5:57 PM Paul Rogers 
wrote:

> Hi Charles,
>
> One of the changes I was looking at was allowing multiple SQL statements
> per REST request to get around the lack of session. The idea would be to
> issue a number of ALTER SESSION, CTTAS, USE and similar statements followed
> by a single query that returns data.
>
>
> A better solution is to enable session support for the REST API. We
> discussed the challenges involved due the disconnected nature of HTTP
> requests.
>
> Another good improvement would be a SQL command way to create configs, not
> just JSON editing. That way it would be easier to automate creation of a
> config. Also, it would be handy to be able to externalize configs so they
> can be stored in locations other than ZK (or local disk, in embedded mode.)
> For this use case, a query for user "X" would work against the "s3-X"
> config would could be retrieved from an external system that knows the
> mapping from user X to the S3 files visible to X, and the security tokens
> to use for that user.
>
> The question for now, however, is how to do this with the code that exists
> in Drill 1.17. I'm hoping someone has worked out a solution.
>
>
> Thanks,
> - Paul
>
>
>
>    On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre <
> cgi...@gmail.com> wrote:
>
>  Hi Avner, Paul,
> I was reading this and wondering:
>
> 1.  Is it in fact true (I think it is) that Drill does not allow multiple
> queries to be submitted in one REST request?  I seem to remember running
> into that issue when I was trying to do some of the Superset work.
> 2.  If a user is required to be authenticated to execute a query, would
> that not prevent the possibility of a non-authenticated user executing
> arbitrary queries against someone else's data?
> 3.  I would definitely create separate data sources for each tenant, but I
> don't know that it

Re: Rest API and SQL injection

2020-05-10 Thread Paul Rogers

Hi Charles,

One of the changes I was looking at was allowing multiple SQL statements per 
REST request to get around the lack of session. The idea would be to issue a 
number of ALTER SESSION, CTTAS, USE and similar statements followed by a single 
query that returns data.

A better solution is to enable session support for the REST API. We discussed 
the challenges involved due the disconnected nature of HTTP requests.

Another good improvement would be a SQL command way to create configs, not just 
JSON editing. That way it would be easier to automate creation of a config. 
Also, it would be handy to be able to externalize configs so they can be stored 
in locations other than ZK (or local disk, in embedded mode.) For this use 
case, a query for user "X" would work against the "s3-X" config would could be 
retrieved from an external system that knows the mapping from user X to the S3 
files visible to X, and the security tokens to use for that user.

The question for now, however, is how to do this with the code that exists in 
Drill 1.17. I'm hoping someone has worked out a solution.

Thanks,
- Paul

On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre  
wrote:  

 Hi Avner, Paul, 
I was reading this and wondering:

1.  Is it in fact true (I think it is) that Drill does not allow multiple 
queries to be submitted in one REST request?  I seem to remember running into 
that issue when I was trying to do some of the Superset work.
2.  If a user is required to be authenticated to execute a query, would that 
not prevent the possibility of a non-authenticated user executing arbitrary 
queries against someone else's data?
3.  I would definitely create separate data sources for each tenant, but I 
don't know that it is necessary (or helpful) to create one for each query.  

I'd agree with Paul, that Drill's access model needs improvement and that would 
be a good addition to the project.  We might be able to assist with that if 
there's interest.
Best,
-- C

> On May 10, 2020, at 3:55 PM, Paul Rogers  wrote:
> 
> Hi Avner,
> 
> Drill was designed for a system in which the user name maps to a certificate 
> on the underlying file system, and the file system provides complete 
> security. This model has not been extended to the cloud world.
> 
> What you want is a way to authenticate your user, map the user to a storage 
> plugin config for only that client's files, then restrict that user to only 
> that config. Further, you'd want the config to obtain S3 keys from a vault of 
> some sort. If you have that, you'd not have to worry about SQL injection 
> since only an authorized user could muck with the SQL, and they could only 
> access their own data -- which they can presumably access anyway.
> 
> 
> At present, Drill has no out-of-the-box security model for this use case; 
> there is no mechanism to associate users with configs, or to externalize S3 
> security keys. Such a system would be a worthwhile addition to the project.
> 
> I wonder, has anyone else found a workaround for this use case? Maybe via 
> Kerberos or some such?
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy 
> wrote:  
> 
> Hi,
> I'm trying to use Apache Drill as a database for providing SQL over S3
> parquet files.
> Drill is used for serving multi-tenant data for multiple customers.
> Since I need to build the SQL string using the REST API I'm vulnerable to
> SQL injection attacks.
> I do test all user input and close it between apostrophes and
> escape apostrophe in the user input by doubling it but I'm still concerned
> about optional SQL attacks.
> Will adding a different data source (which points to a different folder on
> S3) per tenant is something that will have impact on performance? (I might
> have thousands of those)
> Does it make sense to create the data source on the fly before query?
> Is there another way to limit the sent SQL to a specific folder?
> Thanks,
>  Avner

Re: Rest API and SQL injection

2020-05-10 Thread Paul Rogers

Hi Avner,

Drill was designed for a system in which the user name maps to a certificate on 
the underlying file system, and the file system provides complete security. 
This model has not been extended to the cloud world.

What you want is a way to authenticate your user, map the user to a storage 
plugin config for only that client's files, then restrict that user to only 
that config. Further, you'd want the config to obtain S3 keys from a vault of 
some sort. If you have that, you'd not have to worry about SQL injection since 
only an authorized user could muck with the SQL, and they could only access 
their own data -- which they can presumably access anyway.


At present, Drill has no out-of-the-box security model for this use case; there 
is no mechanism to associate users with configs, or to externalize S3 security 
keys. Such a system would be a worthwhile addition to the project.

I wonder, has anyone else found a workaround for this use case? Maybe via 
Kerberos or some such?


Thanks,
- Paul

 

On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy  
wrote:  
 
 Hi,
I'm trying to use Apache Drill as a database for providing SQL over S3
parquet files.
Drill is used for serving multi-tenant data for multiple customers.
Since I need to build the SQL string using the REST API I'm vulnerable to
SQL injection attacks.
I do test all user input and close it between apostrophes and
escape apostrophe in the user input by doubling it but I'm still concerned
about optional SQL attacks.
Will adding a different data source (which points to a different folder on
S3) per tenant is something that will have impact on performance? (I might
have thousands of those)
Does it make sense to create the data source on the fly before query?
Is there another way to limit the sent SQL to a specific folder?
Thanks,
  Avner

Re: REST query improvements [Was: Heap memory and performance issue in Apache drill]

2020-05-05 Thread Paul Rogers

Hi Charles,

Thanks. Your SuperSet integration uses the REST API, doesn't it? Once the 
various PRs are done, it would be interesting to try out the new version with 
your SuperSet integration to learn if we see any performance difference.

Thanks,
- Paul

 

On Tuesday, May 5, 2020, 5:29:04 PM PDT, Charles Givre  
wrote:  
 
 Paul, 
Nice work!
--C

> On May 5, 2020, at 7:27 PM, Paul Rogers  wrote:
> 
> Hi All,
> 
> One more update. Went ahead and implemented the streaming solution for REST 
> JSON queries. The result is that REST queries run almost as fast as native or 
> JDBC queries: the results stream directly from the query DAG out to the HTTP 
> client with no buffering at all.
> 
> Tested with a file of 1 GB size: 1 M rows of 20 fields, each of 50 bytes. 
> Tests run in a debugger, single threaded. A COUNT(*) query took about 10 
> seconds. Running a SELECT * to JSON took about 18 seconds, presumably for the 
> cost of encoding data as JSON and streaming 1+ GB over the network.
> 
> This will help clients that use the REST JSON query API -- but only if the 
> client itself handles the data in a streaming way (parses rows as they 
> arrive, processes them, and disposes of them.) If the client buffers the 
> entire result set into a data structure, then the client will run out of 
> memory as result set sizes increases.
> 
> As noted earlier, the current JSON structure is awkward for this. A better 
> format might be as a stream of "JSON lines" in which each line is an 
> independent JSON object. An even better format would be binary-encoded rows 
> of some sort to avoid repeating the field names a million times.
> 
> FWIW, it turns out that the current design assumes uniform rows. The list of 
> column names is emitted at the start. If a schema change occurs, the set of 
> fields will change, but there is no way to go back and amend the column name 
> list. Not sure if anyone actually uses schema changes, but just something to 
> be aware of if you do.
> 
> The Web query feature (display the results in a web page) still uses the 
> buffering approach, which is probably fine because you don't want to load a 
> 1GB result set in the browser anyway.
> 
> See DRILL-7733 for the details.
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, May 4, 2020, 12:20:15 AM PDT, Paul Rogers  
>wrote:  
> 
> Hi All,
> 
> Was able to reduce the memory impact of REST queries a bit by avoiding some 
> excessive copies and duplicate in-memory objects. The changes will show up in 
> a PR for Drill 1.18.
> 
> The approach still buffers the entire result set on the heap, which is the 
> next thing to fix. Looks feasible to stream the results to the browser as 
> they arrive, while keeping the same JSON structure as the current version. 
> The current implementation sends column names, then all the data, then column 
> types. Might make more sense to send the names and types, followed by the 
> rows. That way, the client knows what to do with the rows as they arrive. As 
> long as the fields are identical, changing field order should not break 
> existing clients (unless someone implemented a brittle do-it-yourself JSON 
> parser.)
> 
> 
> With streaming, Drill should be able to deliver any number of rows with no 
> memory overhead due to REST. However, the current JSON-based approach is 
> awkward for that amount of data.
> 
> We briefly mentioned some possible alternatives. For those of you who want to 
> use REST to consume large data sets, do you have a favorite example of a tool 
> that does a good at sending such data? Might as well avoid reinventing the 
> wheel; would be great if Drill can just adopt the solution that works for 
> "Tool X." Suggestions?
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer 
> wrote:  
> 
> I think an okay approach to take is to use CTAS to dump your result into a 
> folder / bucket of your choice instead of trying to receive the result 
> directly from Drill.
> 
> The user can run a cron job or use lifecycle policies to clean up old query 
> results if they fail to delete them manually in the code that consumes them.
> 
> However, in my own experimentation I found that when I try to do this using 
> the REST API it will still complain about running out of memory, even though 
> it doesn't need to buffer any results.
> 
> I think it just used a lot of memory to perform the operation regardless of 
> whether it needs to serialize the results as JSON.
> 
> On 5/1/2020 2:51:49 PM, Paul Rogers  wrote:
> Hi All,
> 
> TL;DR: Your use case is too large for the REST API as it is currently 
> implemented. Thee alternatives:
>

Re: REST query improvements [Was: Heap memory and performance issue in Apache drill]

2020-05-05 Thread Paul Rogers

Hi All,

One more update. Went ahead and implemented the streaming solution for REST 
JSON queries. The result is that REST queries run almost as fast as native or 
JDBC queries: the results stream directly from the query DAG out to the HTTP 
client with no buffering at all.

Tested with a file of 1 GB size: 1 M rows of 20 fields, each of 50 bytes. Tests 
run in a debugger, single threaded. A COUNT(*) query took about 10 seconds. 
Running a SELECT * to JSON took about 18 seconds, presumably for the cost of 
encoding data as JSON and streaming 1+ GB over the network.

This will help clients that use the REST JSON query API -- but only if the 
client itself handles the data in a streaming way (parses rows as they arrive, 
processes them, and disposes of them.) If the client buffers the entire result 
set into a data structure, then the client will run out of memory as result set 
sizes increases.

As noted earlier, the current JSON structure is awkward for this. A better 
format might be as a stream of "JSON lines" in which each line is an 
independent JSON object. An even better format would be binary-encoded rows of 
some sort to avoid repeating the field names a million times.

FWIW, it turns out that the current design assumes uniform rows. The list of 
column names is emitted at the start. If a schema change occurs, the set of 
fields will change, but there is no way to go back and amend the column name 
list. Not sure if anyone actually uses schema changes, but just something to be 
aware of if you do.

The Web query feature (display the results in a web page) still uses the 
buffering approach, which is probably fine because you don't want to load a 1GB 
result set in the browser anyway.

See DRILL-7733 for the details.


Thanks,
- Paul

 

On Monday, May 4, 2020, 12:20:15 AM PDT, Paul Rogers  
wrote:  
 
 Hi All,

Was able to reduce the memory impact of REST queries a bit by avoiding some 
excessive copies and duplicate in-memory objects. The changes will show up in a 
PR for Drill 1.18.

The approach still buffers the entire result set on the heap, which is the next 
thing to fix. Looks feasible to stream the results to the browser as they 
arrive, while keeping the same JSON structure as the current version. The 
current implementation sends column names, then all the data, then column 
types. Might make more sense to send the names and types, followed by the rows. 
That way, the client knows what to do with the rows as they arrive. As long as 
the fields are identical, changing field order should not break existing 
clients (unless someone implemented a brittle do-it-yourself JSON parser.)


With streaming, Drill should be able to deliver any number of rows with no 
memory overhead due to REST. However, the current JSON-based approach is 
awkward for that amount of data.

We briefly mentioned some possible alternatives. For those of you who want to 
use REST to consume large data sets, do you have a favorite example of a tool 
that does a good at sending such data? Might as well avoid reinventing the 
wheel; would be great if Drill can just adopt the solution that works for "Tool 
X." Suggestions?
 

Thanks,
- Paul

 

On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer  
wrote:  
 
 I think an okay approach to take is to use CTAS to dump your result into a 
folder / bucket of your choice instead of trying to receive the result directly 
from Drill.

The user can run a cron job or use lifecycle policies to clean up old query 
results if they fail to delete them manually in the code that consumes them.

However, in my own experimentation I found that when I try to do this using the 
REST API it will still complain about running out of memory, even though it 
doesn't need to buffer any results.

I think it just used a lot of memory to perform the operation regardless of 
whether it needs to serialize the results as JSON.

On 5/1/2020 2:51:49 PM, Paul Rogers  wrote:
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently 
implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web 
client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current 
implementation has significant limitations. All results are held in memory 
until the end of the query, after which they are translated to JSON. This model 
was fine when the REST API was used to run a few, small, sample queries in the 
Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. 
A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. 
As it turns out, none of that will be buffered in direct memory unless you also 
have an ORDER BY clause (where we need to hold all data in memory to do the 
sort.)


The rea

Re: EXTERNAL: Re: Apache Drill Sizing guide

2020-05-05 Thread Paul Rogers

Hi Navin,

Just wanted to let you know I've not forgotten this thread. There is much here 
to digest; thanks for doing the research. There are many interesting 
performance questions raised by the results. Our goal, however, is to predict 
the resources needed to run the queries.

My previous note focused on disk as that is often the primary bottleneck. 
Indeed, for your scan-only queries, that seems to be true. Results are thrown 
off a bit by the large planning time relative to query execution time. That 3 
second plan time needs explanation by itself.

Then you ran the full query and found it takes 50x as long. Since it reads the 
same data, it must be CPU bound (since queries mostly use either CPU or disk.) 
I'm puzzled by what the query could be doing that uses up so much time. Can you 
perhaps share a bit more about the query? If you look at the query profile, 
what is taking the bulk of the time?

Looking at the table at the end of your note, the file contains 225,27,414 
records. (Is that actually 22,527,414?) The output is on the order of 20K 
records, or a 1000:1 reduction. How do you get from the one to the other? Just 
a WHERE clause? Aggregation? Windowing?

Does the query fetch the same number of Parquet columns in both the scan-only 
and full query? As you know, Parquet is columnar: cost is driven not only by 
the number of rows, but also by the number of columns. (CSV, JSON and the like 
are simpler: every query reads the entire file. But, of course, that is usually 
more expensive.)

If it does turn out that your queries are CPU-intensive because of the kind of 
analysis that needs to be performed, then sizing becomes more about getting 
enough cores rather than about dealing with disk I/O limitations.

Can you provide a bit more info about the query and the file? I'm sure you've 
told us about your file structure, but it is a bit hard to find among all the 
e-mails. (You might consider creating a JIRA ticket with all the info to avoid 
having to repeat yourself.)

Thanks,
- Paul

On Wednesday, April 29, 2020, 9:19:49 AM PDT, Navin Bhawsar 
 wrote:  

 Hi Paul,

I tried to follow your suggested approach , but not much luck

-- Scan Query

Planning = 3.481 sec

Queued = 0.007 sec

Execution = 0.591 sec

Total =              4.079 sec

Parquet_Row_group_Scan =

Avg Process Time = 0.036s

Max Process Time = 0.036s

CPU Usage After= 0.013 sec

CPU Usage before = .003 sec

Wall clock = 4.141 sec

Difference between CPU Time and Wall clock = 4.141 sec - 0.01 sec = 4.13
sec(maximum scan throughput for one Drill fragment )

Drill Throughput as 1095 MB / 4.13 sec          = 265 MB/Sec

Number of Fragments = 1

Per Fragments = 4.13/1 = 4.13 sec

Single Scan per Fragment = 1095 MB/4.13 = 265 MB/sec      (Non Linear)

No. of concurrent users = 30

Total size = 1320

Non Cached read = 265 MBPs

1.3 min required to complete all 30 queries

-- Full Result Query

Planning = 3.295 sec

Queued = 0.004 sec

Execution = 02 min 34.215 sec

Total =              02 min 37.514 sec

Parquet_Row_group_Scan =

Avg Process Time = 14.843s

Max Process Time = 14.843s

CPU Usage after = 1.08 sec

CPU Usage before = .003 sec

Wall clock = 212.7 sec

Difference between CPU Time and Wall clock = 212.7 sec - 1.05 sec = 211.65
sec(maximum scan throughput for one Drill fragment )

Drill Throughput as 1095 MB / 212 sec            = 5.17 MB/Sec

Number of Fragments = 2

Per Fragments = 211/2 = 105.5 sec

Single Scan per Fragment = 1095/2.25 = 486 MB/sec

There are few queries based on this approach -

1. Disk through put does not match expected disk read performance i.e. 500
Mbps.Although Disk Scan per Fragment is close to that.

2. Also in this case full query time is 50X of scan time  ,not sure how
this will give us expected cpu count for 30 users

Also drill has different performance with same cpu cores (6-core) but when
its two separate machine dill performs better,specially on scan and level
of parallelism in minor fragments.

Single Node  Single Node 2-Node
1 Core 6-Core 6-Core(3-Core per Node)
Planning 3.662 sec 0.553 sec 0.568 sec
Queued 0.007 sec 0.004 sec 0.021 sec
Execution 22.921 sec 6.833 sec 7.365 sec
Total 26.590 sec 7.390 sec 7.954 sec
No. of Fragments 2 5 7
Scan Process Time 21.124s 5.946s 4.726s
No. of Minor Fragments 1 4 6
Filter Time 0.288 sec 0.078 sec 0.053s
Scan Records 225,27,414 225,27,414 225,27,414
Scan Peak Memory 15 MB 15 MB 15 MB
No. of Rowgroups scanned 123 123 123
Filtered Rows 20,406 20,406 20,406

Thanks,

Navin

On Fri, Apr 17, 2020 at 11:05 PM Paul Rogers  wrote:

> Hi Navin,
>
>
> One more factor for you to consider. The straw-man analysis we just did
> was for a file format such as CSV in which Drill must read all data within
> each HDFS block. You said you are using Parquet. One of the great features
> of Parquet is that Drill reads only the columns needed for your query. This
> makes the analysi

Re: Partition Pruning in Apache Drill

2020-05-05 Thread Paul Rogers

Hi Sreeparna,

There are various reasons that planning might be slow. You mentioned you have a 
partitioned directory structure, which is a good approach. How many directories 
exist at each level? How many files in the leaf folders? How many of those 
folders are included in your query?

If the number is large, then the delay may be due to the fact that Drill must 
walk the tree to identify which files to include in the query.

Also, which file system are you using? HDFS? S3? Each has different 
characteristics when doing directory operations. (S3 has no actual directories, 
for example.)

Please provide the additional information so we can identify the source of the 
issue.


Thanks,
- Paul

 

On Monday, May 4, 2020, 9:01:08 AM PDT, sreeparna bhabani 
 wrote:  
 
 Hi Team,

Kindly check the below query regarding the partition pruning. We are using
the partition pruning for our current project in Apache Drill and have some
questions. Please find the below details of the scenario-

File Type-
Parquet generated from Python

Folder structure in hdfs-


Query used to select data under -
To take advantage of partition pruning
select column1, column2, ... from dfs.`tmp`.`` where dir0 =
 and dir1 =  and dir2 =  and  = ..;

Observation-
Although the execution is fast, the time taken for planning is quite high.
I didn't see VALUES operator in the physical plan of the query, rather
there was SCAN operator.
How can we ensure that the selected data is partition pruned here ?
As an alternative, I modified the query to bring down the planning time of
it and included the sub-directories in the root directory. The modified
query is-
select column1, column2, ... from
dfs.`tmp`.`///`  where  = ..;

Can you please tell me why the planning time is so high for the first
query? How can we take advantage of partition pruning from it ? Or should
we include sub-directories in the root directory ?

Thanks in advance.

*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

2020-05-05 Thread Paul Rogers

Hi Sreeparna,

Thanks much for digging into the details. SQL is pretty complex and things 
don't always work as we might expect.

The first question is: which plan is correct: the (Parquet UNION ALL DB) plan 
or the (Parquet UNION ALL Parquet) plan? I tried poking around, and got no 
definitive answer on whether UNION ALL implies ordering. On the one hand, the 
parts of the standards that I could find didn't seem to imply ordering. A UNION 
(no ALL) can't imply ordering since it essentially does an anti-join which may 
be hash-partitioned. But, a StackOverflow post suggested that there is an 
implied ordering in the case of an ORDER BY on the sub-queries:

(SELECT ... ORDER BY ...)
UNION AL
 (SELECT ... ORDER BY ...)

That is, if we can sort each sub-query, then doing so only makes sense if all 
results of the first are returned before any results of the second. (Have not 
checked if the above is valid in Drill. Even if it was, the planner should 
handle the above as a special case.)

So, let's assume that the (Parquet UNION ALL Parquet) case is the correct 
behavior. Then, we can speculate that the planner is getting confused somehow 
in the mixed case. Each data source (file for Parquet, JDBC for the DB) 
parallelizes independently. Each decides it needs just one fragment. Somehow 
the planner must be saying, "well, if they both only want one fragment, let's 
run the whole query in a single fragment."

Perhaps the decision is based on row count and the planner somehow thinks the 
row counts will be small for one or both of the queries. In fact, what happens 
if you do a query with (DB UNION ALL DB)? Drill's ability to estimate row 
counts is poor, especially from JDBC. Perhaps the planner is guessing the 
tables are small and parallelizing is unnecessary.

My advice is to file a JIRA ticket with as much detail as you can provide. 
Certainly the information from your-email. Ideally, also the query (with names 
redacted if necessary.) Also, the JSON query plan (obtained by using EXPLAIN 
PLAN FOR), again with names redacted if necessary.

With that info, we can dig a bit deeper to determine why the two cases come out 
differently.

Thanks,
- Paul

On Monday, May 4, 2020, 9:39:12 AM PDT, sreeparna bhabani 
 wrote:  

 Hi Team,

After further checking on this UNION ALL, I found that UNION ALL
(between Parquet and database) behaves as expected with limited number of
rows and columns. But for a larger Parquet file and higher number of
selected rows and columns, the UNION ALL takes much higher time than sum of
the same of individual Parquet and DB Query.

As per the analysis, it looks like the source of this issue is-
Although we are using distributed mode, the UNION ALL query is executed
only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
parallelized in multiple nodes.

Whereas, for individual query or UNION ALL between same type datasets
(Parquet + Parquet) it is getting distributed in 2 NODES.

Do you have any finding / idea on this ?

Thanks,
Sreeparna Bhabani

On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
bhabani.sreepa...@gmail.com> wrote:

> Hi Paul Team,
>
> Please check the observation mentioned in the  below Jira where we found
> that UNION ALL query is not parallelized between multiple nodes when there
> are 2 types dataset (Parquet and Database). But it is parallelized if we
> query individual Parquet file.
>
> Is there any way to enforce parallel execution in multiple nodes ?
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, 
> wrote:
>
>>
>> Hi Paul and Team,
>>
>> As you suggested I have created a Jira ticket which is  -
>> https://issues.apache.org/jira/browse/DRILL-7720.
>> I have mentioned details in the Jira you asked. Please have a look. As
>> the data is sensitive, I am trying to create dummy dataset. Will
>> provide once it is ready.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
>> bhabani.sreepa...@gmail.com> wrote:
>>
>>>
>>> -- Forwarded message -
>>> From: Paul Rogers 
>>> Date: Thu, 23 Apr 2020, 23:59
>>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>>> To: , sreeparna bhabani <
>>> bhabani.sreepa...@gmail.com>
>>> Cc: , 
>>>
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> As suggested in the earlier e-mail, we would not expect to see different
>>> performance in UNION ALL than in a simple scan. Clearly you've found some
>>> kind of issue. The next step is to investigate that issue, which is a bit
>>> hard to do over e-mail.
>>>
>>>
>>> Please file a JIRA ticket to describe the issue and prov

Re: Apache drill JDBC storage plugin for salesforce

2020-05-05 Thread Paul Rogers

Hi Mohammed,

Welcome to the Drill mailing list. We'll try to help solve your issue.

>From the text of the message, sounds like you are getting the error when you 
>try to create the storage plugin config in the Web console. Correct? Checking 
>the code, it appears you need one more entry in your config:

"caseInsensitiveTableNames": false

(Looks like Salesforce uses Oracle which has case-insensitive table names.)

By the way, I just changed the code so that this problem won't trip up future 
users.


Thanks,
- Paul

 

On Monday, May 4, 2020, 10:22:05 PM PDT, Mohammed Zeeshan 
 wrote:  
 
 Hi Team,

I've a query for creating a storage plugin to salesforce with JDBC

Installed necessary jdbc driver and tried with below configuration:
{
  "type": "jdbc",
  "driver": "oracle.jdbc.driver.OracleDriver",
  "url": "jdbc:oracle:thin:@login.salesforce.com",
  "username": "XXXMyUserXXX",
  "password": "XXXMyPasswordXXX",
  "enabled": true
}

But eventually ending up in an error, i have followed documentation but
unfortunately no luck

* Please retry: error (unable to create/ update storage) *

Could you please help to join the missing piece.

Best Wishes,
Mohammed Zeeshan

REST query improvements [Was: Heap memory and performance issue in Apache drill]

2020-05-04 Thread Paul Rogers

Hi All,

Was able to reduce the memory impact of REST queries a bit by avoiding some 
excessive copies and duplicate in-memory objects. The changes will show up in a 
PR for Drill 1.18.

The approach still buffers the entire result set on the heap, which is the next 
thing to fix. Looks feasible to stream the results to the browser as they 
arrive, while keeping the same JSON structure as the current version. The 
current implementation sends column names, then all the data, then column 
types. Might make more sense to send the names and types, followed by the rows. 
That way, the client knows what to do with the rows as they arrive. As long as 
the fields are identical, changing field order should not break existing 
clients (unless someone implemented a brittle do-it-yourself JSON parser.)


With streaming, Drill should be able to deliver any number of rows with no 
memory overhead due to REST. However, the current JSON-based approach is 
awkward for that amount of data.

We briefly mentioned some possible alternatives. For those of you who want to 
use REST to consume large data sets, do you have a favorite example of a tool 
that does a good at sending such data? Might as well avoid reinventing the 
wheel; would be great if Drill can just adopt the solution that works for "Tool 
X." Suggestions?
 

Thanks,
- Paul

 

On Friday, May 1, 2020, 4:15:01 PM PDT, Dobes Vandermeer  
wrote:  
 
 I think an okay approach to take is to use CTAS to dump your result into a 
folder / bucket of your choice instead of trying to receive the result directly 
from Drill.

The user can run a cron job or use lifecycle policies to clean up old query 
results if they fail to delete them manually in the code that consumes them.

However, in my own experimentation I found that when I try to do this using the 
REST API it will still complain about running out of memory, even though it 
doesn't need to buffer any results.

I think it just used a lot of memory to perform the operation regardless of 
whether it needs to serialize the results as JSON.

On 5/1/2020 2:51:49 PM, Paul Rogers  wrote:
Hi All,

TL;DR: Your use case is too large for the REST API as it is currently 
implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web 
client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current 
implementation has significant limitations. All results are held in memory 
until the end of the query, after which they are translated to JSON. This model 
was fine when the REST API was used to run a few, small, sample queries in the 
Drill Web Console, but is not well suited to larger, production use cases.


Let's roughly estimate the memory needs for your query with the current design. 
A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. 
As it turns out, none of that will be buffered in direct memory unless you also 
have an ORDER BY clause (where we need to hold all data in memory to do the 
sort.)


The real cost is the simple design of the REST API. As your query runs, the 
REST handler stores all rows in an on-heap map of name/string pairs: one for 
each column in each row of your table. This is 15 M rows * 16 cols/row = 250 
million keys and another 250 million string values. A quick check of the code 
suggests it does not do string "interning", so it is likely that each of the 15 
million occurrences of each name is a separate heap object. Verifying, and 
fixing this would be a good short-term improvement.


If your data is 4 GB uncompressed, then when expanded as above, it could easy 
take, say, 10 GB of help to encode as key/string pairs. The code does monitor 
heap size and gives you the error you reported as heap use grows too large. 
This obviously is not a good design, but it is how things work today. It was 
done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single 
minor fragment. None of the tuning parameters you mentioned will solve your 
REST problem because the query itself is quite simple; it is the REST handler 
which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM ()

This will do all the work to run your query, count the results, and return a 
single row using the REST API. This will give you a sense of how fast the query 
should run if the REST API were out of the picture.


As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they 
incrementally deliver results so that Drill need not hold the entire result set 
in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying 
to solve. If your client intends to work with all 15 M

Re: Heap memory and performance issue in Apache drill

2020-05-01 Thread Paul Rogers

Hi All,

TL;DR: Your use case is too large for the REST API as it is currently 
implemented. Thee alternatives:

1. Switch to JDBD/ODBC,
2. Write the results to a file rather than sending to your web client. The web 
client can then read the file.
3. Help us improve the scalability of the REST API.

The REST API is increasingly popular. Unfortunately, it's current 
implementation has significant limitations. All results are held in memory 
until the end of the query, after which they are translated to JSON. This model 
was fine when the REST API was used to run a few, small, sample queries in the 
Drill Web Console, but is not well suited to larger, production use cases.

Let's roughly estimate the memory needs for your query with the current design. 
A 400 MB Parquet file, with compression, might translate to 4 GB uncompressed. 
As it turns out, none of that will be buffered in direct memory unless you also 
have an ORDER BY clause (where we need to hold all data in memory to do the 
sort.)

The real cost is the simple design of the REST API. As your query runs, the 
REST handler stores all rows in an on-heap map of name/string pairs: one for 
each column in each row of your table. This is 15 M rows * 16 cols/row = 250 
million keys and another 250 million string values. A quick check of the code 
suggests it does not do string "interning", so it is likely that each of the 15 
million occurrences of each name is a separate heap object. Verifying, and 
fixing this would be a good short-term improvement.

If your data is 4 GB uncompressed, then when expanded as above, it could easy 
take, say, 10 GB of help to encode as key/string pairs. The code does monitor 
heap size and gives you the error you reported as heap use grows too large. 
This obviously is not a good design, but it is how things work today. It was 
done quickly many years ago and has only been slightly improved since then.

Four your query, with a single Parquet file, the query will run in a single 
minor fragment. None of the tuning parameters you mentioned will solve your 
REST problem because the query itself is quite simple; it is the REST handler 
which is causing this particular problem.

Here is a simple way to verify this. Take your query and wrap it in:

SELECT COUNT(*) FROM ()

This will do all the work to run your query, count the results, and return a 
single row using the REST API. This will give you a sense of how fast the query 
should run if the REST API were out of the picture.

As Rafael noted, the ODBC and JDBC interfaces are designed for scale: they 
incrementally deliver results so that Drill need not hold the entire result set 
in memory. They also transfer results in a compact binary format.

It may be useful to take a step back. It is unclear the use case you are tying 
to solve. If your client intends to work with all 15 M rows and 16 columns, 
then it needs sufficient memory to buffer these results. No human or dashboard 
can consume that much data. So, you must be doing additional processing. 
Consider pushing that processing into SQL and Drill. Or, consider writing the 
results to a file using a CREATE TABLE AS (CTAS) statement to avoid buffering 
the large result set in your client. Big data tools often transform data from 
one set of files to another since data is too large to buffer in memory. The 
REST API is perfectly suitable to run that CTAS statement.

That said, should the REST API be extended to be more scalable? Absolutely. 
Drill is open source. The community is encouraged to help expand Drill's 
capabilities. We've often discussed the idea of a session-oriented REST API so 
clients can fetch blocks of results without the need for server or client-side 
buffering. Easy enough to code.

The key challenge is state. With the current design, the entire query runs in a 
single message. If the user abandons the query, the HTTP connection closes and 
Drill immediately releases all resources. With a REST client using multiple 
messages to transfer results, how do we know when the client has abandoned the 
query? Impala struggled with this. Some commercial tools dump results to disk 
and eventually delete them after x minutes or hours of inactivity. Anyone know 
of how other tools solve this problem?

Thanks,
- Paul

On Friday, May 1, 2020, 10:49:36 AM PDT, Rafael Jaimes III 
 wrote:  

 Hi Sreeparna,

I know your dataset is 15 million rows and 16 columns, but how big is the
result set you are expecting from that query?

I think that result set is too large for Drill's REST interface to handle
especially with only 16G heap. I try to keep the REST queries in Drill to
about 10k rows with limited number of columns. JDBC or ODBC can handle MUCH
larger volumes without issue.

Best,
Rafael

On Fri, May 1, 2020 at 1:39 PM sreeparna bhabani <
bhabani.sreepa...@gmail.com> wrote:

> Hi Team,
>
> Kindly suggest on the below problem which we are facing in Apache Drill
> while running query in Web interface.

Re: Parquet Predicate Push down not working

2020-04-29 Thread Paul Rogers

Hi Navin,

You raise some good questions. I don't have a complete answer, but I can tackle 
some of the basics.

Rafael noted that images are blocked on Apache mail lists. I believe you can 
post images in the Drill Slack channel. Better, perhaps is to open a JIRA 
ticket with your images and information so it is easier for us to track these 
specific questions & issues.


Drill supports two forms of Parquet predicate push-down. The first is partition 
pruning, which removes files based on their directory names. (Let's say you 
have files in the 2019 and 2020 directories, and have a WHERE clause that 
limits the query to just the 2020 directory). Partition pruning should work as 
long as you explicitly mention the directories:

... WHERE dir0 = "2020"

(Unfortunately, since Drill has no schema, Drill cannot map directories to 
column names the way Hive can.)

The simplest, least-fuss way to enable filter push-down is to filter based on 
directories: doing so requires no extra schema information be provided to 
Drill, nor does it require Drill to do extra work (reading files) when planning 
a query. Directory pruning works for Parquet and all other file types as well.


The second form of pruning occurs at the row group level. Here I'll need some 
help from the folks that have worked with that code. I'm not sure if the 
planner will open every file at plan time to read this information. I do seem 
to recall that Drill does (did?) gather and cache the info. There is also a 
newly-added metadata feature to gather this information once to avoid per-query 
scans. Perhaps someone with more current knowledge can fill in the details.

You noted that the filter does not remove records. This is correct. The filter 
simply tags records as matching the filter or not. The Selection Vector Remover 
(SVR) does the actual removal. The SVR operator is used in other places as 
well. It is the combination of (Filter --> SVR) that performs the full filter 
operation. The (Filter --> SVR) combination will always run in the same minor 
fragment, so no extra network I/O occurs.


Another question asked about parallelism. Drill parallelizes based on HDFS file 
blocks which are commonly 256 MB or 512 MB. This is classic HDFS "data 
locality" behavior and is why Rafael suggests having larger Parquet files. 
(That said, Drill also parallelizes based on files, so having many small files 
should also work, ignoring the classic HDFS "small file problem", this was a 
big advantage of the MapR file system, and of S3.)


Your note does suggest another approach, which might work better on "blockless" 
systems such as S3 or local disk: parallelize at the row group level. Parquet 
is complex, we'd have to understand the costs and benefits of such an approach.

Thanks,
- Paul

 

On Wednesday, April 29, 2020, 9:35:40 AM PDT, Navin Bhawsar 
 wrote:  
 
 Hi  
We are trying to do a simple where clause query with Predicate .Parquet files 
are created using python and stored on hdfs.Apache Drill version used is 1.17 .

Below options are set as default required for Predicate Push Down

Drill query is scanning directory with multiple parquet files (total size 1 
GB).We are expecting if predicate push down works it will help reduce scan time 
which is currently 97 %.If Predicate push down works row group scan should only 
fetch 70,840 records instead of 14162187.




| 
Minor Fragment
 | 
NUM_ROWGROUPS
 | 
ROWGROUPS_PRUNED
 | 
NUM_DICT_PAGE_LOADS
 | 
NUM_DATA_PAGE_lOADS
 | 
NUM_DATA_PAGES_DECODED
 | 
NUM_DICT_PAGES_DECOMPRESSED
 | 
NUM_DATA_PAGES_DECOMPRESSED
 | 
TOTAL_DICT_PAGE_READ_BYTES
 | 
TOTAL_DATA_PAGE_READ_BYTES
 | 
TOTAL_DICT_DECOMPRESSED_BYTES
 | 
TOTAL_DATA_DECOMPRESSED_BYTES
 | 
TIME_DICT_PAGE_LOADS
 | 
TIME_DATA_PAGE_LOADS
 | 
TIME_DATA_PAGE_DECODE
 | 
TIME_DICT_PAGE_DECODE
 | 
TIME_DICT_PAGES_DECOMPRESSED
 | 
TIME_DATA_PAGES_DECOMPRESSED
 | 
TIME_DISK_SCAN_WAIT
 | 
TIME_DISK_SCAN
 | 
TIME_FIXEDCOLUMN_READ
 | 
TIME_VARCOLUMN_READ
 | 
TIME_PROCESS
 |
| 
01-00-04
 | 
7
 | 
0
 | 
77
 | 
0
 | 
77
 | 
77
 | 
77
 | 
0
 | 
0
 | 
7,147,852
 | 
8,884,071
 | 
598,070
 | 
0
 | 
97,822
 | 
11,440,739
 | 
2,081,514
 | 
17,694,740
 | 
598,070
 | 
0
 | 
112,108,259
 | 
703,103,096
 | 
815,245,307
 |
| 
01-01-04
 | 
6
 | 
0
 | 
66
 | 
0
 | 
66
 | 
66
 | 
66
 | 
0
 | 
0
 | 
2,115,860
 | 
4,316,153
 | 
1,778,468
 | 
0
 | 
144,320
 | 
3,665,957
 | 
775,403
 | 
8,693,618
 | 
1,778,468
 | 
0
 | 
105,066,657
 | 
776,807,232
 | 
882,070,408
 |
| 
01-02-04
 | 
6
 | 
0
 | 
66
 | 
0
 | 
66
 | 
66
 | 
66
 | 
0
 | 
0
 | 
6,835,560
 | 
8,630,174
 | 
337,404
 | 
0
 | 
100,190
 | 
10,876,145
 | 
1,970,521
 | 
11,789,061
 | 
337,404
 | 
0
 | 
102,833,433
 | 
655,338,696
 | 
758,203,357
 |
| 
01-03-04
 | 
6
 | 
0
 | 
66
 | 
0
 | 
66
 | 
66
 | 
66
 | 
0
 | 
0
 | 
2,242,112
 | 
4,516,183
 | 
1,586,562
 | 
0
 | 
164,398
 | 
3,827,371
 | 
877,814
 | 
8,604,307
 | 
1,586,562
 | 
0
 | 
112,745,628
 | 
758,634,132
 | 
871,586,588
 |
| 
01-04-04
 | 
6
 | 
0
 | 
66
 | 
2
 |

Re: Suggestion needed for UNION ALL performance in Apache drill

2020-04-23 Thread Paul Rogers

Hi Sreeparna,

As suggested in the earlier e-mail, we would not expect to see different 
performance in UNION ALL than in a simple scan. Clearly you've found some kind 
of issue. The next step is to investigate that issue, which is a bit hard to do 
over e-mail.


Please file a JIRA ticket to describe the issue and provide a reproducible test 
case including query and data. If your data is sensitive, please create a dummy 
data set, or use the provided TPC-H data set to recreate the issue. We can then 
take a look to see what might be happening.

Thanks,
- Paul

 

On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani 
 wrote:  
 
 Hi Team,
In addition to the below mail I have another finding. Please consider below 
scenarios. The first 2 scenarios are giving expected results in terms of 
performance. But we are not getting expected performance for 3rd scenario which 
is UNION ALL with 2 different types of datasets.

Scenario 1- Parquet UNION ALL Parquet
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd 
query - 5 secsUNION ALL of both queries execution time - 10 secs
Scenario 2 - DB query UNION ALL DB queryIndividual execution time of 1st query 
- 5 secsIndividual execution time of 2nd query - 5 secsUNION ALL of both 
queries execution time - 10 secs
Scenario 3 - Parquet UNION ALL DB query
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd 
query - 1 secUNION ALL execution time - 20 secsIdeally the execution time 
should not be more than 6 secs.
May I request you to check whether the UNION ALL performance of 3rd scenario is 
expected with different dataset types.
Please suggest if there is any specific way to bring down the execution time of 
3rd scenario.
Thanks in advance.
Sreeparna Bhabani


On Thu, 23 Apr 2020, 12:18 sreeparna bhabani,  
wrote:

Hi Team,
Apart from the below issue I have another question.
Is there any relation between number of row groups and performance ?
In the below query the number of files is 13 and numRowGroups is 69. Is the 
UNION ALL takes more time if the number of rowgroup is high like that.
Please note that the individual Parquet query takes 6 secs. But UNION ALL takes 
20 secs. Details are given in trail mail.
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 11:08 sreeparna bhabani,  wrote:

Hi Paul,
Please find the details below. We are using 2 drillbits. Heap memory 16 G, Max 
direct memory 32 G. One query selects from Parquet. Another one selects fron 
JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not sorting.
Single parquet query-Total execution time - 6.6 secScan time - 0.152 secScreen 
wait time - 5.3 sec
Single JDBC query-Total execution time - 0.261 secJDBC scan - 0.152 secScreen 
wait - 0.004 sec

Union all query -Execution time - 21. 118 secScreen wait time - 5.351 
secParquet scan - 15.368 secUnordered receiver wait time - 14.41 sec
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 10:43 Paul Rogers,  wrote:

Hi Sreeparna,

The short answer is it *should* work: a UNION ALL is simply an append. (Be sure 
you are not using a plain UNION as that needs to do more work to remove 
duplicates.)

Since you are seeing unexpected behavior, we may have some kind of issue to 
investigate and perhaps fix. Always hard to do over e-mail, but let's see what 
we can do.


The first question is to understand the full query: are you doing more than a 
simple scan of two files and a UNION ALL? Are there sorts or joins involved?

The best place to start to investigate performance issues is the query profile, 
which it looks like you are doing. What is the time for the scans if you run 
each of the two scans separately? You said that they take 8 and 1 seconds. Is 
that for the whole query or just the scan operators?

Then, when you run the UNION ALL, again looking at the scan operators, is there 
any difference in run times? If the scans take longer, that is one thing to 
investigate. If the scans take the same amount of time, what other operator(s) 
are taking the rest of the time? Your note suggests that it is the scan taking 
the time. But, there should be two scan operators: one for each file. How is 
the time divided between them?


How large are the data files? Using what storage system? How many Drillbits? 
How much memory?


Thanks,
- Paul

 

On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani 
 wrote:  
 
 Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani

Re: Suggestion needed for UNION ALL performance in Apache drill

2020-04-22 Thread Paul Rogers

Hi Sreeparna,

The short answer is it *should* work: a UNION ALL is simply an append. (Be sure 
you are not using a plain UNION as that needs to do more work to remove 
duplicates.)

Since you are seeing unexpected behavior, we may have some kind of issue to 
investigate and perhaps fix. Always hard to do over e-mail, but let's see what 
we can do.


The first question is to understand the full query: are you doing more than a 
simple scan of two files and a UNION ALL? Are there sorts or joins involved?

The best place to start to investigate performance issues is the query profile, 
which it looks like you are doing. What is the time for the scans if you run 
each of the two scans separately? You said that they take 8 and 1 seconds. Is 
that for the whole query or just the scan operators?

Then, when you run the UNION ALL, again looking at the scan operators, is there 
any difference in run times? If the scans take longer, that is one thing to 
investigate. If the scans take the same amount of time, what other operator(s) 
are taking the rest of the time? Your note suggests that it is the scan taking 
the time. But, there should be two scan operators: one for each file. How is 
the time divided between them?


How large are the data files? Using what storage system? How many Drillbits? 
How much memory?


Thanks,
- Paul

 

On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani 
 wrote:  
 
 Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani

Re: Important Message about Bay Area Apache Drill User Group

2020-04-20 Thread Paul Rogers

Thanks to Aman for previously hosting the group. We had some excellent meetups.

I do happen to live in the Bay Area and can offer to become the organizer to 
keep things going.


Thanks,
- Paul

 

On Monday, April 20, 2020, 11:29:36 AM PDT, Charles Givre 
 wrote:  
 
 If I lived in the Bay Area, I'd do it.  Can it be a virtual group?
-- C

> On Apr 20, 2020, at 2:27 PM, Ted Dunning  wrote:
> 
> Does anybody want to change this?
> 
> -- Forwarded message -
> From: Meetup 
> Date: Mon, Apr 20, 2020 at 8:43 AM
> Subject: Important Message about Bay Area Apache Drill User Group
> To: 
> 
> 
> [image: Meetup]
> 
> Your Meetup Group will shut down soon!
> 
> *Members of Bay Area Apache Drill User Group
> ,*
> 
> Your Organizer, Aman Sinha, just stepped down without nominating a
> replacement.
> 
> Without an Organizer, Bay Area Apache Drill User Group
> 
> will shut down on April 30, 2020.
> 
> *Step up to become this Meetup Group's Organizer* and you can guide its
> future direction!
> 
> Other members can help you. Ask them to suggest Meetups or even nominate a
> few to help as Assistant Organizers.
> KEEP THIS GROUP GOING
> 
> Bay Area Apache Drill User Group
> 
> *Members:*
> 585 Drillers
> 
> 
>

Re: [NOTICE] Maven 3.6.3

2020-04-17 Thread Paul Rogers

Hi Arina,

Thanks for keeping us up to date!

As it turns out, I use Ubuntu (Linux Mint) for development. Maven is installed 
as a package using apt-get. Packages can lag behind a bit. The latest maven 
available via apt-get is 3.6.0.

It is a nuisance to install a new version outside the package manager. I 
changed the Maven version in the root pom.xml to 3.6.0 and the build seemed to 
work. Any reason we need the absolute latest version rather than just 3.6.0 or 
later?

The workaround for now is to manually edit the pom.xml file on each checkout, 
then revert the change before commit. Can we maybe adjust the "official" 
version instead?


Thanks,
- Paul

 

On Friday, April 17, 2020, 5:09:49 AM PDT, Arina Ielchiieva 
 wrote:  
 
 Hi all,

Starting from Drill 1.18.0 (and current master from commit 20ad3c9 [1]), Drill 
build will require Maven 3.6.3, otherwise build will fail.
Please make sure you have Maven 3.6.3 installed on your environments. 

[1] 
https://github.com/apache/drill/commit/20ad3c9837e9ada149c246fc7a4ac1fe02de6fe8

Kind regards,
Arina

Re: EXTERNAL: Re: Apache Drill Sizing guide

2020-04-17 Thread Paul Rogers

Hi Navin,

One more factor for you to consider. The straw-man analysis we just did was for 
a file format such as CSV in which Drill must read all data within each HDFS 
block. You said you are using Parquet. One of the great features of Parquet is 
that Drill reads only the columns needed for your query. This makes the 
analysis a bit more interesting.

First, how much data will Drill actually read? You mentioned reading 10-15 of 
150 columns. If columns are of uniform size, that might mean reading only 10% 
of each block. The best approach is to actually measure the amount of disk I/O. 
In a previous life I used the MapR file system which provided a wealth of such 
information. Perhaps your system does also. For now, let's assume 10%; you can 
replace this with the actual ratio once you measure it.

We said that Drill will split the 1 GB file into four 256 MB blocks and will 
need 4 fragments (cores) to read them. We've just said we'd read 10% of that 
data or about 25 MB. You'll measure query run time for just scan, let's say it 
takes 1 second. (Parquet decoding is CPU intensive.) This means each query 
reads 4 * 25 MB = 100 MB in a second. Since your disk system can supply 500 
MB/s, you can run 5 concurrent queries. More if the data is cached.


We then add the full query cost as before. We made up a ratio of 2x, so each 
query takes 1 sec for scan, 2 sec to complete on 4 cores for scan plus 4 cores 
for compute. This means we can run 5 queries every 2 seconds. Your 30 queries 
would complete in 30 / 5 * 2 = 12 seconds, well within your 30-second SLA.

Now you have a choice. You can provision the full 8 * 5 = 40 cores needed to 
saturate your file system. Or, you can provision fewer, maybe run 2 concurrent 
queries, so 16 cores, with all 30 queries completing in 30 / 2 / 2 = 30 
seconds. In this case, you would enable query throttling to avoid overloads.

I hope this gives you a general sense for the approach: have a model, measure 
actual performance, get a ball-park estimate and test to see what additional 
factors crop up in your actual setup.

Thanks,
- Paul

 

On Thursday, April 16, 2020, 9:42:01 PM PDT, Navin Bhawsar 
 wrote:  
 
 Thanks Paul.. I will follow suggested approach next. Point noted on Rest 
API,do you have suggestion what interface should be best for larger set odbc or 
jdbc or any other reporting tool which gives better performance with drill.Our 
reports are mainly tabular format or pivot .jdbc we had to drop as UI client is 
.net

Thanks, Navin

On Fri, 17 Apr 2020, 07:35 Paul Rogers,  wrote:

Hi Navin,

Thanks for the additional info. Let's take it step by step. I'll walk you 
through the kind of exercise you'll need to perform, using made-up numbers to 
make the exercise concrete. Running the same analysis with your results will 
give you a ball-park estimate of expected performance.

As we'll see, you may end up being limited more by disk I/O than anything else.


First, let's characterize the read performance. We can do this by limiting the 
query run to a single node (easiest if you have a single-node cluster 
available) and a single thread of execution:

ALTER SESSION SET `planner.width.max_per_node` = 1

Now, take a typical query, say the 1 GB scan. Modify the query to keep all the 
column references in the SELECT clause (the 15 columns you mentioned) but 
remove all other expressions, calculations, GROUP BY, etc. That is:

SELECT col1, col2, ... col15
FROM yourfile

Then, add only the partitioning expression to WHERE clause to limit the scan to 
the 1GB of data you expect. Also add a "select nothing" expression on one of 
the columns:

WHERE dir0 = ... AND dir1 = ... AND col1 = "bogus"

This query forces Drill to read the full data amount, but immediately throws 
away the data so we can time just the scan portion of the query.

Run this query on a single node "cluster". Use top or another command to check 
CPU seconds used by Drill before and after the query. Look at the query profile 
to determine query run time. The difference between CPU and wall clock time 
tells us how much time was spent waiting for things. (You an also look at the 
scan timings in the query profile to get a better estimate than overall query 
run time.) 


This tells us the maximum scan throughput for one Drill fragment on one of your 
CPUs. Best to do the exercise a few times and average the results since your 
file system will read cached data in the second and subsequent runs.


OK, so suppose it takes 10 seconds to scan 1 GB of data. The disk can do 500 
MB/s so the estimate the Drill throughput as 1 GB / 10 sec = 100 MB/s. Your 
numbers will, of course, be different.

Now we can work out the benefits of parallelism. Parquet typically uses 256 MB 
or 512 MB blocks. This limits the benefit of parallelism on a 1 GB file. So, is 
the 1 GB the size of the scanned files? Or, are you scanning 1 GB from, say, a 
set of files totaling, say, 10 GB? In eith

Re: EXTERNAL: Re: Apache Drill Sizing guide

2020-04-16 Thread Paul Rogers

Hi Navin,

For .NET (Windows) the best solution is ODBC. MapR provides a driver created by 
Simba [1]. But, it looks like it hasn't been updated since Drill 1.15, which is 
a pity. Calcite Avatica has long promised a framework to create an open source 
version, but looks like that effort stalled.

You can try the 1.15 driver a newer Drill. We have tried to avoid changing the 
wire protocol so it should work. If not, please file a bug so we can fix the 
issue.

What are you using on the client side to connect your .net client to the REST 
API? Is this a home-grown tool or third-party? I wonder if there is an 
opportunity there to create a version of the REST API that works better in that 
use case.

Oh, one other limitation of the REST API: session options (of the kind I 
suggested) don't work because there is no session. A contributor just added an 
enhancement to the REST API with a workaround, but your client is likely not 
using that feature (since it was completed only a week ago.)


Thanks,
- Paul

[1] https://mapr.com/docs/61/Drill/drill_odbc_connector.html
 

On Thursday, April 16, 2020, 9:52:19 PM PDT, Navin Bhawsar 
 wrote:  
 
 Thanks Paul.. I will follow suggested approach next.
Point noted on Rest API,do you have suggestion what interface should be
best for larger set odbc or jdbc or any other reporting tool which gives
better performance with drill.Our reports are mainly tabular format or
pivot .
jdbc we had to drop as UI client is .net


Thanks,
Navin


On Fri, 17 Apr 2020, 07:35 Paul Rogers,  wrote:

> Hi Navin,
>
>
> Thanks for the additional info. Let's take it step by step. I'll walk you
> through the kind of exercise you'll need to perform, using made-up numbers
> to make the exercise concrete. Running the same analysis with your results
> will give you a ball-park estimate of expected performance.
>
>
> As we'll see, you may end up being limited more by disk I/O than anything
> else.
>
>
> First, let's characterize the read performance. We can do this by limiting
> the query run to a single node (easiest if you have a single-node cluster
> available) and a single thread of execution:
>
>
> ALTER SESSION SET `planner.width.max_per_node` = 1
>
>
> Now, take a typical query, say the 1 GB scan. Modify the query to keep all
> the column references in the SELECT clause (the 15 columns you mentioned)
> but remove all other expressions, calculations, GROUP BY, etc. That is:
>
>
> SELECT col1, col2, ... col15
>
> FROM yourfile
>
>
> Then, add only the partitioning expression to WHERE clause to limit the
> scan to the 1GB of data you expect. Also add a "select nothing" expression
> on one of the columns:
>
>
> WHERE dir0 = ... AND dir1 = ... AND col1 = "bogus"
>
>
> This query forces Drill to read the full data amount, but immediately
> throws away the data so we can time just the scan portion of the query.
>
>
> Run this query on a single node "cluster". Use top or another command to
> check CPU seconds used by Drill before and after the query. Look at the
> query profile to determine query run time. The difference between CPU and
> wall clock time tells us how much time was spent waiting for things. (You
> an also look at the scan timings in the query profile to get a better
> estimate than overall query run time.)
>
>
> This tells us the maximum scan throughput for one Drill fragment on one of
> your CPUs. Best to do the exercise a few times and average the results
> since your file system will read cached data in the second and subsequent
> runs.
>
>
> OK, so suppose it takes 10 seconds to scan 1 GB of data. The disk can do
> 500 MB/s so the estimate the Drill throughput as 1 GB / 10 sec = 100 MB/s.
> Your numbers will, of course, be different.
>
>
> Now we can work out the benefits of parallelism. Parquet typically uses
> 256 MB or 512 MB blocks. This limits the benefit of parallelism on a 1 GB
> file. So, is the 1 GB the size of the scanned files? Or, are you scanning 1
> GB from, say, a set of files totaling, say, 10 GB? In either case, the best
> Drill can do is parallelize down to the block level, which will be 2 or 4
> threads (depending on block size) for a single 1 GB file. You can work out
> the real numbers based on your actual block size and file count.
>
>
> Suppose we can get a parallelism of 4 on our made-up 10 sec scan. The
> ideal result would be four fragments which each take 2.5 secs. We'd like to
> multiply by 30 to get totals. But, here is where things get non-linear.
>
>
> A single scan reads 1 GB / 2.5 sec = 400 MB/s, which is close to your
> uncached read rate. So, you get no real benefit from trying to run 30 of
> these queries in parallel, you can maybe do 1.25 (given these made-up
> numbers.) S

Re: EXTERNAL: Re: Apache Drill Sizing guide

2020-04-16 Thread Paul Rogers

pared to the scan-only query. This tells us you need 2x the number of 
CPUs as we computed above: rather than 4 per query, maybe 8 per user. (Again, 
your numbers will certainly be different.) Since we are CPU limited, if we 
needed, say, 6 cores to saturate the disk, we need 12 to both saturate the disk 
and do the needed extra processing. (Again, your numbers will be different.)


This covers your "big" queries. The same analysis can be done for the "small" 
queries and a weighted total computed.

We've not talked about memory. Scans need minimal memory (except for Parquet 
which has a bunch of buffers and worker threads; check the top command and the 
query profile to see what yours needs.)

The rest of the query will require memory if you do joins, aggregations and 
sorts. Look at the query profile for the full run. Multiply the memory total by 
30 for your 30 concurrent users. Divide by your node count. That is the minimum 
memory you need per node, though you should have, say, 2x to provide sufficient 
safety margin. On the other hand, if the queries run sequentially (because of 
disk saturation), then you only need memory for the number of actively running 
queries.


All this could be put in a spreadsheet. (Maybe someone can create such a 
spreadsheet and attach it to a JIRA ticket so we can post it to the web site.)

Also, the above makes all this look scientific. There are, however, may factors 
we've not discussed. Is Drill the only user of the file system? How much 
variation do you get in load? There are other factors not accounted for. Thus, 
the above will give you a ball-park estimate, not a precise sizing. Caveat 
emptor and all that.


This is the approach I've used for a couple of systems. If anyone has a better 
(i.e. simpler, more accurate) approach, please share!


Finally, a comment about the REST API. It is a wonderful tool to power the 
Drill Web console. It is helpful for small-ish result sets (1000 rows or 
fewer.) It is not really designed for large result sets and you may run into 
performance or memory issues for large result sets. This is certainly something 
we should fix, but it is what it is for now. So, keep an eye on that as well.


Thanks,
- Paul

 

On Thursday, April 16, 2020, 9:16:38 AM PDT, Navin Bhawsar 
 wrote:  
 
 
Hi Paul,

Thanks for your response.




I have tried to add more details as advised :

Query Mix and selectivity

Query mix will be max 30 concurrent users running adhoc reporting queries via 
Drill Rest API called from ASP .Net Core(httpclient).

Query mix is combination of below query load running on server 

1.   queries (5-10) aggregating data over (1 GB or 1-3M records)

2.   Majority of queries aggregating data 100k records (15-25)

Most of the queries are using simple filter clause and few using group by on 
10-15 columns out of 150 columns in  Parquet File.

Performance expectation is these queries should be available in seconds (<= 30 
secs)




Partitioning - Data is already partitioned on date and business level with 
lower level include parquet files (200-300 MB,100 K records)

 

Storage -

VMDK(VMware Disk) with 1 TB Size

cached reads -  8000 MB/sec

buffered disk reads - 500 MB/sec

Drill queries parquet files on hdfs

 

Deployment - HDFS on-perm are hosted on Internal Cloud Platform (IaaS) 
,spinning new env will be quick.

Thanks,Navin


From: Paul Rogers 
Sent: Tuesday, April 14, 2020 12:41 AM
To: user 
Cc: arun...@gmail.com;  
Subject: EXTERNAL: Re: Apache Drill Sizing guide

 

Hi Navin,

 

 

Ted is absolutely right. To add a bit of context, here are some of the factors 
we've considered in the past.

 

 

Queries: A simple filter scan takes the minimum resources: scan the tables, 
throw away most of the data, and deliver the rows that are needed. Such a use 
case is strongly driven by scan time. As Ted suggests, partitioning drives down 
scan cost. If every query hits the full TB of data, you will need many machines 
& disks to get adequate performance. Depending on your hardware, if you get 100 
MB/s read performance per disk, it will take 10,000 seconds (three hours) to 
read your TB of data on one disk. If you have 100 disks, the time drops to 100 
seconds. You didn't mention your storage technology: these numbers are likely 
entirely different for something like S3.

 

 

So, you don't want to read the full TB. By using good partitioning (typically 
by date), you might reduce the scan by a factor of 1000. Huge win. And this is 
true whether you use Drill, Spark, Presto or Python to read your data.

 

 

The next question is the selectivity of your queries. In the simple filter 
case, are you returning a few rows or a GB of rows? The more rows, the more 
Drill must grind through the data once it is read. This internal grinding 
requires CPU and benefits from parallelism. The amount you need depends on the 
number of rows processed per query.

 

 

There is little memory need

Re: Drill large data build up in fragment by using join

2020-04-15 Thread Paul Rogers

Hi Shashank,

Let me make sure I understand the question. You have to large JSON data files? 
You are on a distributed Drill cluster. You want to know why you are seeiing a 
billion rows in one fragment rather than the work being distributed across 
multiple fragments? Is this an accurate summary?

The key thing to know is that Drill (and most Hadoop-based systems) rely on 
files to be "block-splittable". That is, if your file is 1 GB in size, Drill 
needs to be able to read, say, blocks of 256 MB from the file so that we can 
have four Drill fragments read that single 1 GB file. This is true even if you 
store the files in S3.


CSV, Parquet, Sequence File and others are block splittable. As it turns out, 
JSON is not. The reason is simple: there is no way to jump into a typical JSON 
file and scan for the start of the next record. With CSV, newlines are record 
separators. Parquet has row groups. With JSON, there may or may not be newlines 
between records, and there may or may not be newlines within records.

It turns out that there is an emerging standard called jsonlines [1] which 
requires that there be newlines between, but not within, JSON records. Using 
jsonlines would make JSON into a block-splittable format. Drill does not yet 
support this specialized JSON format, but doing so would be good enhancement 
for data files that adhere to the jsonlines format. Is your data in jsonlines 
format?


For now, the solution is simple: rather than storing your data in a single 
large JSON file, simply split the data into multiple small files within a 
single directory. Drill will read each file in a separate fragment, giving you 
the parallelism you want. Make each file on the order of 100MB, say. The key is 
to ensure that you have at least as many files as you have minor fragments. The 
number of minor fragments will be 70% of your CPU count per node. If you have 
10 CPUs, say, Drill will create 7 fragments per node. Then, multiply this by 
the number of nodes. If you have 4 nodes, say, you'll have 28 minor fragments 
total. You want to have at least 28 JSON files so you can keep each fragment 
busy.

If your code generates the JSON, then you can change the code to split the data 
into smaller files. If you obtain the JSON from somewhere else, then your 
options may be more limited.

Will any of this help resolve your issue?


Thanks,
- Paul

 
[1] http://jsonlines.org/

On Wednesday, April 15, 2020, 12:32:35 PM PDT, Shashank Sharma 
 wrote:  
 
 Hi folks,

I have a two large big json data set and querying on distributed apache
drill system, can anyone explain why it is  making or build billion of
records to scan in fragment when join between two big records by hash join
as well as merge join with only 60,000 record data set through s3 bucket
file distributed system?

-- 

[image: https://jungleworks.com/] 

Shashank Sharma

Software Engineer

Phone: +91 8968101068

Re: Apache Drill Sizing guide

2020-04-13 Thread Paul Rogers

Hi Navin,

Ted is absolutely right. To add a bit of context, here are some of the factors 
we've considered in the past.

Queries: A simple filter scan takes the minimum resources: scan the tables, 
throw away most of the data, and deliver the rows that are needed. Such a use 
case is strongly driven by scan time. As Ted suggests, partitioning drives down 
scan cost. If every query hits the full TB of data, you will need many machines 
& disks to get adequate performance. Depending on your hardware, if you get 100 
MB/s read performance per disk, it will take 10,000 seconds (three hours) to 
read your TB of data on one disk. If you have 100 disks, the time drops to 100 
seconds. You didn't mention your storage technology: these numbers are likely 
entirely different for something like S3.

So, you don't want to read the full TB. By using good partitioning (typically 
by date), you might reduce the scan by a factor of 1000. Huge win. And this is 
true whether you use Drill, Spark, Presto or Python to read your data.

The next question is the selectivity of your queries. In the simple filter 
case, are you returning a few rows or a GB of rows? The more rows, the more 
Drill must grind through the data once it is read. This internal grinding 
requires CPU and benefits from parallelism. The amount you need depends on the 
number of rows processed per query.

There is little memory needed for a pure filter query. Drill reads the data, 
tosses most rows, a returns the remainder to the client. Interesting queries, 
however, do more than filtering: they might group, join, sort and so on. Each 
of these operations carries its own cost. Joins are network heavy (to shuffle 
data). Sorts want enough memory to buffer the entire result set to avoid slow 
disk-based sorts.

The query profile will provide lots of good information about the row count, 
memory usage and operators in each of your queries so you can determine the 
resources needed for each. When Ted asks you to analyze each query, the best 
way to do that is to look at the query profile and see which resources were 
needed by that query.

Then, there are concurrent users. What do you mean by concurrent? 40 people who 
might use Drill during the day so that only a few are active at the same time? 
Or, 40 users each watching dashboard that each run 10 queries, updated each 
second, which will place a huge load on the system? Most humans are 
intermittent users. Dashboards, when overdone, can kill any system.

Also, as Ted has said many times, if you run 40 queries a minute, and each 
takes 1 second, then concurrency turns into sequential processing. On the other 
hand, if one query uses all cluster resources for an hour, and you run 10 of 
them per hour, then the workload will fail.

Once you determine the actual "concurrent concurrency" level (number of queries 
that run at the same time), work out the mix. Sum the resources for those 
concurrent queries. That tells you the cluster capacity you need (plus some 
safety margin because load is random.) Drill does have features to smooth out 
the load peaks by queuing queries. Not state-of-the-art, but can prevent the 
inevitable overloads that occur at random peak loads when there is not 
sufficient reserve capacity.

You didn't mention your deployment model. In classic Hadoop days, with an 
on-prem cluster, you had to work all this out ahead of time so you could plan 
your equipment purchases 3 to 6 months in advance. In the cloud, however, 
especially with K8s, you just resize the cluster based on demand. Drill is not 
quite there yet with our K8s integration, but the team is making good progress 
and we should have a solution soon; contributions/feedback would be very 
helpful.

In short, there are many factors, some rather complex. (We all know it should 
be simple, but having done this with many DBs, it just turns out that it never 
is.)

We'd be happy to offer pointers if you can offer a few more specifics. Also, 
perhaps we can distill this discussion into a few pages in the Drill docs.

Thanks,
- Paul

On Monday, April 13, 2020, 7:59:08 AM PDT, Ted Dunning 
 wrote:  

 Navin,

Your specification of 40 concurrent users and data size are only a bit less
than half the story. Without the rest of the story, nobody will be able to
give you even general guidance beyond a useless estimate that it will take
between roughly 1 and 40 drillbits with with a gob of memory.

To do better than such non-specific "guidance", you need to add some
additional answers. For example,

What is the query mix?
How long do these queries run without any question of concurrency?
Could that query speed be enhanced with better partitioning?
How are you storing your data?
What promises are you making to these concurrent users?

On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar 
wrote:

> Hi Team ,
>
> We are planning to use drill to query hdfs cluster with about a terabyte
> data in parquet file format .There will be approx.

Re: Querying encrypted JSON file

2020-04-12 Thread Paul Rogers

Hi Prabhakar,

Looking at the Drill code, the existing compression support (via "codecs") is 
in the FileSystemPlugin class, [1]. Looks like Drill uses the compression codec 
feature of Hadoop [2] based on a CompressionCodec class [3].

This means that you just need to use standard Hadoop mechanisms to define a 
custom codec. [4].

If you are storing JSON, it might be worthwhile combining compression and 
encryption together, since JSON files tend to be large (especially if the JSON 
is indented.) Perhaps one of the existing Hadoop codecs (see [2]) might do the 
job for you.

Here it might be worth pointing out that you'll need a file system to store the 
files. If your use case is small enough that your files fit on a single 
machine, you can use a single Drillbit to query local files. If the set of 
files is large, then one node will not provide adequate performance so you'll 
need a Drill cluster. For that, you'll need a distributed file system: HDFS, 
MapR-FS, S3 or whatever.

Note also that JSON is a convenient, but inefficient, format. If you have to 
encrypt files, we already suggested compressing them as well. However, JSON 
files are not block-splittable: if you have a big JSON file, it must be read in 
a single thread. (Not as much of a problem if you instead have many smaller 
files.) A format such as Parquet is better suited for queries. So, if you must 
convert your file to encrypt it, consider converting the files to Parquet to 
get better query performance. Drill can even do the conversion for you with the 
CREATE TABLE AS (CTAS) command.

Thanks,
- Paul

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemPlugin.java#L141

[2] https://netjs.blogspot.com/2018/04/data-compression-in-hadoop.html

[3] 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html

[4] 
https://stackoverflow.com/questions/37608227/adding-custom-code-to-hadoop-spark-compression-codec

On Sunday, April 12, 2020, 12:03:13 AM PDT, Prabhakar Bhosaale 
 wrote:  

 Hi Paul,
Thanks  for details. As of now i have not finalized on any encryption
tecnique as first i wanted to understand drill capabilities on encryption
and decryption.
To give you more details on my requirent. I will be archiving data in
JSON format from database. And that archived data will be acceased using
drill for reporting pupose. I am already zipping up JSON files using gzip.
But for security reasons i need to encrypt the files also. Thx

Regards
Prabhakar

On Sun, Apr 12, 2020, 11:38 Paul Rogers  wrote:

> Hi Prabhakar,
>
> Depending on how you perform encryption, you may be able to treat it
> similar to compression. Drill handles compression (zip, gzip, etc.) via an
> extra layer of functionality on top of any format plugin. That means,
> rather than writing a new JSON file reader, you write a new compression
> plugin (which will actually do decryption). I have not added one of these,
> but I'll poke around to see if I can find some pointers.
>
> On the other hand, if encryption is part of the access protocol (such as
> S3), then you can configure it via the S3 client.
>
> Can you describe a bit more how you encrypt your files and what is needed
> to decrypt?
>
>
> Thanks,
> - Paul
>
>
>
>    On Saturday, April 11, 2020, 10:39:15 PM PDT, Prabhakar Bhosaale <
> bhosale@gmail.com> wrote:
>
>  Hi Ted,
> Thanks for your reply. Could you please give some more details on how to
> write to create file format, how to use it. Any pointers will be
> appreciated. Thx
>
> Regards
> Prabhakar
>
> On Sun, Apr 12, 2020, 00:19 Ted Dunning  wrote:
>
> > Yes.
> >
> > You need to write a special file format for that, though.
> >
> >
> > On Sat, Apr 11, 2020 at 6:58 AM Prabhakar Bhosaale <
> bhosale@gmail.com>
> > wrote:
> >
> > > Hi All,
> > > I have a  encrypted JSON file. is there any way in drill to query the
> > > encrypted JSON file? Thanks
> > >
> > > Regards
> > > Prabhakar
> > >
> >
>

Re: Querying encrypted JSON file

2020-04-12 Thread Paul Rogers

Hi Prabhakar,

Depending on how you perform encryption, you may be able to treat it similar to 
compression. Drill handles compression (zip, gzip, etc.) via an extra layer of 
functionality on top of any format plugin. That means, rather than writing a 
new JSON file reader, you write a new compression plugin (which will actually 
do decryption). I have not added one of these, but I'll poke around to see if I 
can find some pointers.

On the other hand, if encryption is part of the access protocol (such as S3), 
then you can configure it via the S3 client.

Can you describe a bit more how you encrypt your files and what is needed to 
decrypt?

Thanks,
- Paul

On Saturday, April 11, 2020, 10:39:15 PM PDT, Prabhakar Bhosaale 
 wrote:  

 Hi Ted,
Thanks for your reply. Could you please give some more details on how to
write to create file format, how to use it. Any pointers will be
appreciated. Thx

Regards
Prabhakar

On Sun, Apr 12, 2020, 00:19 Ted Dunning  wrote:

> Yes.
>
> You need to write a special file format for that, though.
>
>
> On Sat, Apr 11, 2020 at 6:58 AM Prabhakar Bhosaale 
> wrote:
>
> > Hi All,
> > I have a  encrypted JSON file. is there any way in drill to query the
> > encrypted JSON file? Thanks
> >
> > Regards
> > Prabhakar
> >
>

Re: java version for Drill JDBC driver

2020-04-09 Thread Paul Rogers

Nice sleuthing!

Thanks,
- Paul

On Thursday, April 9, 2020, 1:07:48 PM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  

 One of my coworkers looked at the pom.xml in /exec/jdbc and noticed there was 
a version of javax.validation being called in about 7 years old (1.1.0.Final)
Replacing it with version 2.0.1.Final and rebuild of the JDBC driver jar solved 
the problem.

-Original Message-
From: Paul Rogers  
Sent: Thursday, April 9, 2020 3:31 PM
To: user@drill.apache.org
Subject: Re: java version for Drill JDBC driver

Hi Rafael,

Drill's Git-based tests run against all Java versions from 8 to 14. Our biggest 
challenge is Guava: Drill has many dependencies and some use different (and 
incompatible) Guava versions. There is a "patcher" to edit the code at runtime 
to fix the issue.

Presto is nice in that it will load your connector using a dedicated class 
loader so that Drill's many dependencies should not conflict with Preso's 
dependencies. (We are slowly working on something similar for Drill.)

Your specific error is mysterious. That "getClockProviderClassName()" looks 
like Java's SPI system is trying to find a "clock provider" and failing. I've 
not seen anything like that in Drill.

I wonder if Drill's overly large set of JDBC dependencies is somehow 
conflicting with those in Presto?

Thanks,
- Paul

    On Thursday, April 9, 2020, 8:55:37 AM PDT, Bob Rudis  wrote:  

 I use the JDBC driver via an RJDBC wrapper I wrote and the rJava it runs in is 
built with JDK 11, so it definitely is working in 11 for me.

> On Apr 9, 2020, at 11:38, Jaimes, Rafael - 0993 - MITLL 
>  wrote:
> 
> On the topic of java versions, I am trying to load the Drill JDBC driver in a 
> docker container running Presto and Java 11, I’m getting the following error:
>  
> ERROR main io.prestosql.server.PrestoServer 'java.lang.String 
> javax.validation.BootstrapConfiguration.getClockProviderClassName()' 
> java.lang.NoSuchMethodError: 'java.lang.String 
> javax.validation.BootstrapConfiguration.getClockProviderClassName()'
>  
> Some stackoverflow searching shows that others have resolved that error for 
> other projects by changing Java versions (7 to 8 for example). I normally run 
> Drill in a Java 8 environment, but what about the JDBC driver? Should it work 
> in Java 11 or is it 8 only?
>  
> My query Presto with Drill experiment has failed, so I am trying it the other 
> way around out of curiosity (query Drill with Presto).

Re: java version for Drill JDBC driver

2020-04-09 Thread Paul Rogers

Hi Rafael,

Drill's Git-based tests run against all Java versions from 8 to 14. Our biggest 
challenge is Guava: Drill has many dependencies and some use different (and 
incompatible) Guava versions. There is a "patcher" to edit the code at runtime 
to fix the issue.

Presto is nice in that it will load your connector using a dedicated class 
loader so that Drill's many dependencies should not conflict with Preso's 
dependencies. (We are slowly working on something similar for Drill.)

Your specific error is mysterious. That "getClockProviderClassName()" looks 
like Java's SPI system is trying to find a "clock provider" and failing. I've 
not seen anything like that in Drill.

I wonder if Drill's overly large set of JDBC dependencies is somehow 
conflicting with those in Presto?

Thanks,
- Paul

On Thursday, April 9, 2020, 8:55:37 AM PDT, Bob Rudis  wrote:  

 I use the JDBC driver via an RJDBC wrapper I wrote and the rJava it runs in is 
built with JDK 11, so it definitely is working in 11 for me.

> On Apr 9, 2020, at 11:38, Jaimes, Rafael - 0993 - MITLL 
>  wrote:
> 
> On the topic of java versions, I am trying to load the Drill JDBC driver in a 
> docker container running Presto and Java 11, I’m getting the following error:
>  
> ERROR main io.prestosql.server.PrestoServer 'java.lang.String 
> javax.validation.BootstrapConfiguration.getClockProviderClassName()' 
> java.lang.NoSuchMethodError: 'java.lang.String 
> javax.validation.BootstrapConfiguration.getClockProviderClassName()'
>  
> Some stackoverflow searching shows that others have resolved that error for 
> other projects by changing Java versions (7 to 8 for example). I normally run 
> Drill in a Java 8 environment, but what about the JDBC driver? Should it work 
> in Java 11 or is it 8 only?
>  
> My query Presto with Drill experiment has failed, so I am trying it the other 
> way around out of curiosity (query Drill with Presto).

Re: Drill embedded mode on Linux

2020-04-09 Thread Paul Rogers

Hi Prabhakar,

Rafael has pointed you in the right direction. Drill does code generation at 
run time and for that it needs the Java compiler which requires the JDK, not 
just the JRE.

I do development on Linux (Ubuntu-based Linux Mint) and was able to install the 
JDK. It's been awhile so I don't recall exactly what I did.

Which distro are you using? There are different ways to install the JDK 
depending on your distro. I find a Google search often reveals the correct path 
for each.

That you have to ask this question shows a hole in the Drill documentation. 
Please file a JIRA ticket to describe the problem. Once you find how to install 
the JDK on your distro, please add that to the ticket so we can update the docs.

Thanks,
- Paul

 

On Thursday, April 9, 2020, 7:27:58 AM PDT, Prabhakar Bhosaale 
 wrote:  
 
 Thanks Jaims, This helps. I have only openJDK. I will get the devel and
will update you.

Regards
Prabhakar

On Thu, Apr 9, 2020 at 7:53 PM Rafael Jaimes III 
wrote:

> Prab,
>
> I don't think screenshots work on the list. What distro are you using?
>
> On Red Hat, OpenJDK is a JRE but OpenJDK-devel has the JDK. It may be
> confusing.
>
> On Thu, Apr 9, 2020, 10:17 AM Prabhakar Bhosaale 
> wrote:
>
> > Hi All,
> >
> > Just to give you some additional information. I came across information
> on
> >
> http://www.openkb.info/2017/05/drill-errors-with-jdk-java-compiler-not.html
> >
> >
> > As per this article, my output of step 2 is not as expected.  But this
> > article does not mention what to do in this case.  thx
> >
> > Regards
> > Prabhakar
> >
> > On Thu, Apr 9, 2020 at 7:37 PM Prabhakar Bhosaale  >
> > wrote:
> >
> >> Hi James,
> >> thanks for quick reply.
> >> Below is Java version screenshot. As per documentation this is correct.
> >> [image: image.png]
> >>
> >> Below is screenshot of java path. this is also correct. But still same
> >> error
> >> [image: image.png]
> >>
> >> Regards
> >> Prabhakar
> >>
> >> On Thu, Apr 9, 2020 at 7:08 PM Jaimes, Rafael - 0993 - MITLL <
> >> rafael.jai...@ll.mit.edu> wrote:
> >>
> >>> The error tells you that it's not finding a Java 1.8 JDK. You can use
> >>> OpenJDK
> >>> 1.8 for the job.
> >>> I would check:
> >>> 1) your java version (both version # and whether it is a JDK, not a
> JRE)
> >>> 2) your java path env vars
> >>>
> >>> -Original Message-
> >>> From: Prabhakar Bhosaale 
> >>> Sent: Thursday, April 9, 2020 9:29 AM
> >>> To: user@drill.apache.org
> >>> Subject: Drill embedded mode on Linux
> >>>
> >>> Hi All,
> >>> I am using drill 1.16 and trying to start the drill in embedded mode on
> >>> linux
> >>> machine. Following the documentation from drill website.
> >>>
> >>> I am using  bin/drill-embedded command but it is giving following
> error.
> >>> Checked the java version and it is correct.  Please help urgently. thx
> >>>
> >>> Regards
> >>> Prabhakar
> >>>
> >>> Error: Failure in starting embedded Drillbit:
> >>> org.apache.drill.exec.exception.DrillbitStartupException: JDK Java
> >>> compiler
> >>> not available. Ensure Drill is running with the java executable from a
> >>> JDK and
> >>> not a JRE (state=,code=0)
> >>> java.sql.SQLException: Failure in starting embedded Drillbit:
> >>> org.apache.drill.exec.exception.DrillbitStartupException: JDK Java
> >>> compiler
> >>> not available. Ensure Drill is running with the java executable from a
> >>> JDK and
> >>> not a JRE
> >>>        at
> >>>
> >>>
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
> >>>        at
> >>>
> >>>
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:67)
> >>>        at
> >>>
> >>>
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:67)
> >>>        at
> >>>
> >>>
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
> >>>        at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
> >>>        at
> >>> sqlline.DatabaseConnection.connect(DatabaseConnection.java:130)
> >>>        at
> >>> sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:179)
> >>>        at sqlline.Commands.connect(Commands.java:1278)
> >>>        at sqlline.Commands.connect(Commands.java:1172)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >>>        at
> >>>
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>        at java.lang.reflect.Method.invoke(Method.java:498)
> >>>        at
> >>>
> >>>
> sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
> >>>        at sqlline.SqlLine.dispatch(SqlLine.java:736)
> >>>        at sqlline.SqlLine.initArgs(SqlLine.java:428)
> >>>        at sqlline.SqlLine.begin(SqlLine.java:531)
> >>>        at sqlline.SqlLine.start(SqlLine.java:270)
> >>>        at sqlline.SqlLine.main(SqlLine.java:201)
> >>>

Re: Apache Drill Support concurrent parallel Request

2020-04-08 Thread Paul Rogers

in their time series query engine.
There, the primary data source is a variant of Open TSDB and query costs
are dominated by the primary facts (the time series itself). Tuning the
optimizer to not think too much is a good thing.

So, could you say more about your workload so that the Drill community can
say more about what Drill will (or won't) do for you?



On Wed, Apr 8, 2020 at 12:02 PM Paul Rogers 
wrote:

> Hi Ramasamy,
>
> Let's define some terms. By "parallel requests" do you mean multiple
> people submitting queries at the same time? If so, then Drill handles this
> just fine: Drill is designed to run multiple queries from multiple users
> concurrently.
>
> There is a caveat. Many people run Drill in embedded mode when they get
> started. Embedded mode is a single user, single-machine setup that is great
> for testing Drill, exploring small data sets and so on. However, to support
> multiple concurrent queries, the proper way to run Drill is as a service,
> preferably across multiple machines. Further, if you are running a cluster
> of two or more machines, you need some kind of distributed file system: S3,
> Hadoop, etc.
>
>
> Once you start running concurrent queries, memory becomes an important
> consideration, especially if your JSON files are large and you are doing
> memory-intensive operations such as sorting and joins. The Drill
> documentation explains the correct configuration steps.
>
> Thanks,
> - Paul
>
>
>
>    On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <
> ramas...@ezeeinfosolutions.com> wrote:
>
>  Hi, I did an analytics web application on drill, data set in json file.
> We
> are facing issues while getting multiple parallel requests. Does Apache
> Drill support concurrent requests?. Please let me know
>
>
> Thanks & Regards
> Ramasamy
>
> Product Manager
> EzeeInfo Cloud Solutions
> +91 95000 07269
>

Re: Apache Drill Support concurrent parallel Request

2020-04-08 Thread Paul Rogers

Hi Ramasamy,

Let's define some terms. By "parallel requests" do you mean multiple people 
submitting queries at the same time? If so, then Drill handles this just fine: 
Drill is designed to run multiple queries from multiple users concurrently.

There is a caveat. Many people run Drill in embedded mode when they get 
started. Embedded mode is a single user, single-machine setup that is great for 
testing Drill, exploring small data sets and so on. However, to support 
multiple concurrent queries, the proper way to run Drill is as a service, 
preferably across multiple machines. Further, if you are running a cluster of 
two or more machines, you need some kind of distributed file system: S3, 
Hadoop, etc.


Once you start running concurrent queries, memory becomes an important 
consideration, especially if your JSON files are large and you are doing 
memory-intensive operations such as sorting and joins. The Drill documentation 
explains the correct configuration steps.

Thanks,
- Paul

 

On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar 
 wrote:  
 
 Hi, I did an analytics web application on drill, data set in json file.  We
are facing issues while getting multiple parallel requests. Does Apache
Drill support concurrent requests?. Please let me know


Thanks & Regards
Ramasamy

Product Manager
EzeeInfo Cloud Solutions
+91 95000 07269

Re: Linux versions supported for Apache drill

2020-04-03 Thread Paul Rogers

Hi Ted,

Very cool! Saw an article from a guy who ran K8s on a cluster of Raspberry Pis. 
[1] Combine that with your setup and we've have a Drill cluster in a shoe-box. 
(Memory would be a problem.)

So, I'm guessing if Drill runs on your Raspberry Pi (ARM-based), it will 
probably run on just about any i64 Linux.

Thanks,
- Paul

[1] https://medium.com/nycdev/k8s-on-pi-9cc14843d43

On Friday, April 3, 2020, 12:06:24 PM PDT, Ted Dunning 
 wrote:  

 Paul,

My Raspberry Pi4's run Drill with no problem. They have 4GB of RAM.

On Fri, Apr 3, 2020 at 10:41 AM Paul Rogers 
wrote:

> Hi Prabhakar,
>
> Drill is written in Java and should support just about any Linux version;
> certainly all the major versions. It's been run on MacOS, Ubuntu, CentOS,
> RedHat and probably many more. Might struggle a bit on a RaspberryPi, but I
> think someone even did that several years back.
>
> The main limitation is Windows, simply because no one has ever written the
> wrapper scripts/batch files/PowerShell scripts to launch Drill.
>
>
> Is there a specific version of interest?
>
> Thanks,
> - Paul
>
>
>
>    On Thursday, April 2, 2020, 8:59:23 PM PDT, Prabhakar Bhosaale <
> bhosale@gmail.com> wrote:
>
>  Hi All,,
> Can any one help us with the versions of  linux supported by Apache dril? I
> could not find this information on drill website. Thanks in advance.
>
> Regards
> Prabhakar
>

Re: Linux versions supported for Apache drill

2020-04-03 Thread Paul Rogers

Hi Prabhakar,

Drill is written in Java and should support just about any Linux version; 
certainly all the major versions. It's been run on MacOS, Ubuntu, CentOS, 
RedHat and probably many more. Might struggle a bit on a RaspberryPi, but I 
think someone even did that several years back.

The main limitation is Windows, simply because no one has ever written the 
wrapper scripts/batch files/PowerShell scripts to launch Drill.


Is there a specific version of interest?

Thanks,
- Paul

 

On Thursday, April 2, 2020, 8:59:23 PM PDT, Prabhakar Bhosaale 
 wrote:  
 
 Hi All,,
Can any one help us with the versions of  linux supported by Apache dril? I
could not find this information on drill website. Thanks in advance.

Regards
Prabhakar

Re: REST data source?

2020-04-02 Thread Paul Rogers

Hi Rafael,

Thanks for the update! We were thinking to try to finish up the current PR so 
we can get it merged into Drill. Then, we can add a simpler way to handle the 
extra message fields, and add the filter push-down code. We look forward to you 
continued advice as we add those additional features.

Thanks,
- Paul

 

On Thursday, April 2, 2020, 7:40:08 AM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  
 
 Hi all,

Just an update after testing HTTP REST plugin some more. It's working well.
I'm not sure how common or standardized these operators are, but in case it
is useful to you, I've been using the following:
= =~  %3C %3E .. , 
equals, not equals, less than, greater than, between, in

Let me know if you have any questions or if additional testing would help.

Thanks,
Rafael

-Original Message-
From: Jaimes, Rafael - 0993 - MITLL  
Sent: Wednesday, April 1, 2020 12:43 PM
To: user@drill.apache.org
Subject: RE: REST data source?

Yes that's correct. I saw the work you started with the env vars, but for
now I set the proxy in the plugin.

- Rafael

-Original Message-
From: Charles Givre  
Sent: Wednesday, April 1, 2020 12:42 PM
To: user@drill.apache.org
Subject: Re: REST data source?

Hey Rafael, 
Thanks for the feedback.  My original idea was to pull the proxy from the
environment vars in HTTP_PROXY/HtTTPS_PROXY and ALL_PROXY but that part
isn't quite done yet. Did you set the proxy info via the plugin config?
-- C


> On Apr 1, 2020, at 10:22 AM, Jaimes, Rafael - 0993 - MITLL
 wrote:
> 
> Hi all,
> 
> I built Charles' latest branch including the proxy setup. It appears to be

> working quite well going through the proxy.
> 
> I'll continue to test and report back if I find any issues.
> 
> Note: Beyond Paul's repo recommendations, I had to skip checkstyle to get
the 
> maven build to complete. You're probably already aware of that, I think
it's 
> just specific to this branch.
> 
> Thanks!
> Rafael
> 
> -Original Message-
> From: Paul Rogers 
> Sent: Wednesday, April 1, 2020 1:29 AM
> To: user 
> Subject: Re: REST data source?
> 
> Thanks, Charles.
> 
> As Charles suggested, I pushed a commit that replaces the "old" JSON
reader 
> with the new EVF-based one. Eventually this will allow us to use a
"provided 
> schema" to handle any JSON ambiguities.
> 
> As we've been discussing, I'll try to add the ability to specify a path to

> data: "response/payload/records" or whatever. With the present commit,
that 
> path can be parsed in code, but I think a simple path spec would be
easier.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Tuesday, March 31, 2020, 10:00:52 PM PDT, Charles Givre 
>  wrote:
> 
> Hello all,
> I pushed some updates to the REST PR to include initial work on proxy 
> configuration.  I haven't updated the docs yet (until this is finalized).
It 
> adds new config variables as shown below:
> 
> {
>  "type": "http",
>  "cacheResults": true,
>  "connections": {},
>  "timeout": 0,
>  "proxyHost": null,
>  "proxyPort": 0,
>  "proxyType": null,
>  "proxyUsername": null,
>  "proxyPassword": null,
>  "enabled": true
> }
> I started on getting Drill to recognize the proxy info from the
environment, 
> but haven't quite finished that.  The plan is for the plugin config to 
> override environment vars.
> Feedback is welcome.
> 
> @paul-rogers, I think you can push to my branch (or submit a PR?) and that

> will be included in the main PR.
> -- C
> 
> 
> 
>> On Mar 31, 2020, at 10:40 PM, Rafael Jaimes III 
wrote:
>> 
>> Yes your initial assessment was correct, there is extra material other
>> than the data field.
>> The returned JSON has some top-level fields that don't go any deeper,
>> akin to your "status" : ok field. In the example I'm running now, one
>> is called MessageState which is set to "NEW". There's another field
>> called MessageData, which, obviously, holds most of the data. There
>> are some other top-level fields, and one is called MessageHeader which
>> is nested. There's a lot of stuff here, and this is just one "table" I'm 
>> querying against now.
>> Not sure how it will differ with the other services.
>> 
>> The service is definitely returning multiple records - I believe it's
>> a JSON array and Drill+HTTP/plugin appears to handle it quite well.
>> 
>> You're right, Drill is handling most of the structure by modifying my
>> SELECT statement as you suggested.
>> 
>> For filter pushdown, expressions of that form would be great. That's
>> what I had i

Re: REST data source?

2020-03-31 Thread Paul Rogers

Thanks, Charles.

As Charles suggested, I pushed a commit that replaces the "old" JSON reader 
with the new EVF-based one. Eventually this will allow us to use a "provided 
schema" to handle any JSON ambiguities.

As we've been discussing, I'll try to add the ability to specify a path to 
data: "response/payload/records" or whatever. With the present commit, that 
path can be parsed in code, but I think a simple path spec would be easier.

Thanks,
- Paul

 

On Tuesday, March 31, 2020, 10:00:52 PM PDT, Charles Givre 
 wrote:  
 
 Hello all, 
I pushed some updates to the REST PR to include initial work on proxy 
configuration.  I haven't updated the docs yet (until this is finalized).  It 
adds new config variables as shown below:

{
  "type": "http",
  "cacheResults": true,
  "connections": {},
  "timeout": 0,
  "proxyHost": null, 
  "proxyPort": 0,
  "proxyType": null,
  "proxyUsername": null,
  "proxyPassword": null,
  "enabled": true
}
I started on getting Drill to recognize the proxy info from the environment, 
but haven't quite finished that.  The plan is for the plugin config to override 
environment vars.
Feedback is welcome.

@paul-rogers, I think you can push to my branch (or submit a PR?) and that will 
be included in the main PR. 
-- C



> On Mar 31, 2020, at 10:40 PM, Rafael Jaimes III  wrote:
> 
> Yes your initial assessment was correct, there is extra material other than
> the data field.
> The returned JSON has some top-level fields that don't go any deeper, akin
> to your "status" : ok field. In the example I'm running now, one is called
> MessageState which is set to "NEW". There's another field called
> MessageData, which, obviously, holds most of the data. There are some other
> top-level fields, and one is called MessageHeader which is nested. There's
> a lot of stuff here, and this is just one "table" I'm querying against now.
> Not sure how it will differ with the other services.
> 
> The service is definitely returning multiple records - I believe it's a
> JSON array and Drill+HTTP/plugin appears to handle it quite well.
> 
> You're right, Drill is handling most of the structure by modifying my
> SELECT statement as you suggested.
> 
> For filter pushdown, expressions of that form would be great. That's what I
> had in mind too.
> 
> Thanks,
> Rafael
> 
> On Tue, Mar 31, 2020 at 10:14 PM Paul Rogers 
> wrote:
> 
>> Hi Rafael,
>> 
>> Thanks much for the info. We had already implemented filter push-down for
>> other plugins, and for a few custom REST APIs, so should be possible to
>> port it over to the HTTP plugin. If you can supply code, then you can
>> convert filters to anything you want, a specialized JSON request body, etc.
>> To do this generically, we have to make some assumptions, such as either 1)
>> all fields can be pushed as query parameters, or 2) only those in some
>> config list. Either way, we know how to create name=value pairs in either a
>> GET or POST format.
>> 
>> You mentioned that your "payload" objects are structured. Drill can
>> already handle this; your query can map them to the top level:
>> 
>> SELECT t.characteristic.color.name AS color_name,
>> t.characteristic.color.confidence AS color_confidence, ...  FROM yourTable
>> AS t
>> 
>> You'll get that "out of box." Drill does assume that data is in "record
>> format": a single list of objects which represent records. Code would be
>> needed to handle, say, two separate lists of objects or other,
>> more-general, JSON structures.
>> 
>> 
>> My specific question was more around the response from your web service.
>> Does that have extra material besides just the data records? Something like:
>> 
>> 
>> { "status": "ok", "data": [ {characteristic: ... }, {...}] }
>> 
>> Or, is the response directly an array of objects:
>> 
>> [ {characteristic: ... }, {...}]
>> 
>> 
>> If it is just an array, then the "out of the box" plugin will work. If
>> there is other stuff, then you'll need the new feature to tell Drill how to
>> find the field to your data. The present version needs code, but I'm
>> thinking we can just use an array of names in the plugin config:
>> 
>> dataPath: [ "data" ],
>> 
>> Or, in your case, do you get a single record per HTTP request? If a single
>> record, then either your queries will be super-simple, or performance will
>> be horrible when requesting multiple records. (The HTTP plugin only

Re: REST data source?

2020-03-31 Thread Paul Rogers

Hi Rafael,

Thanks much for the info. We had already implemented filter push-down for other 
plugins, and for a few custom REST APIs, so should be possible to port it over 
to the HTTP plugin. If you can supply code, then you can convert filters to 
anything you want, a specialized JSON request body, etc. To do this 
generically, we have to make some assumptions, such as either 1) all fields can 
be pushed as query parameters, or 2) only those in some config list. Either 
way, we know how to create name=value pairs in either a GET or POST format.

You mentioned that your "payload" objects are structured. Drill can already 
handle this; your query can map them to the top level:

SELECT t.characteristic.color.name AS color_name,   
t.characteristic.color.confidence AS color_confidence, ...   FROM yourTable AS t

You'll get that "out of box." Drill does assume that data is in "record 
format": a single list of objects which represent records. Code would be needed 
to handle, say, two separate lists of objects or other, more-general, JSON 
structures.


My specific question was more around the response from your web service. Does 
that have extra material besides just the data records? Something like:


{ "status": "ok", "data": [ {characteristic: ... }, {...}] }

Or, is the response directly an array of objects:

 [ {characteristic: ... }, {...}]


If it is just an array, then the "out of the box" plugin will work. If there is 
other stuff, then you'll need the new feature to tell Drill how to find the 
field to your data. The present version needs code, but I'm thinking we can 
just use an array of names in the plugin config:

dataPath: [ "data" ],

Or, in your case, do you get a single record per HTTP request? If a single 
record, then either your queries will be super-simple, or performance will be 
horrible when requesting multiple records. (The HTTP plugin only does one 
request and assumes it will get back a set of records as a JSON array or as 
whitespace-separated JSON objects as in a JSON file.)

Can you clarify a bit which of these cases your data follows?

I like your idea of optionally supplying a parser class for the "hard" cases:

messageParserClass: "com.mycompany.drill.MyMessageParser",

As long as the class is on the classpath, Java will find it.

Finally, on the filter push-down, the existing code we're thinking of using can 
handle expressions of the form:

column op constant

Where "op" is one of the relational operators: =, !=, < etc. Also handles the 
obvious variations (const op constant, column BETWEEN const1 AND const2, column 
IN (const1, const2, ...)).

The code cannot handle expressions (due to a limitation in Drill itself). That 
is, this won't work as a filter push-down: col = 10 + 2 or col + 2 = 10. Nor 
can it handle multi-column expressions: column1 = column2, etc.


I'll write up something more specific so you can see exactly what we propose.


Thanks,
- Paul

 

On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III 
 wrote:  
 
 Either a text description of the parse path or specifying the class
with the message parser could work.
I think the latter would be better, if it were simple as dropping the
JAR in 3rdparty after Drill is already built.
That way we can just continually add parsers ad-hoc.

An example JSON response includes about 4 top-level fields,
then 2 of those fields have many sub-fields.
For example a field could be nested 3 levels deep and say:

Characteristic:

  Color:

      Color name: "Red"

      Confidence: 100

  Physical:

      Size: 405

      Confidence:  95

As you can imagine, it would be difficult to flatten this because of
repeated sub-field names like "Confidence".

I don't think it would be easily exportable into a CSV.
At least for me pandas dataframe is the ultimate destination for all
of this, which also don't handle nested fields well either.
I'll have to handle some parsing on my end.

Filter pushdown would be huge and much desired.
Our other end-users are accustomed to using SQL in that manner and the
REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I can
get a full list if you're interested).
For example I think "between" is a ",". Converting the SQL statement
into the URL format would be awesome and help streamline querying
across data sources.
This is one of the main reasons why we're so interested in Drill.


Thanks,

Rafael

Re: REST data source?

2020-03-31 Thread Paul Rogers

Hi Rafael,

You mention that your JSON response is nested. As it turns out, I just used 
something similar to Charle's HTTP plugin for a recent project. We had to deal 
with a bit of message overhead to get to the data:

{status: "ok", data: [your data here ]}

A PR was just submitted for a change to the "new" JSON parser to handle this 
case. However, the "message parser" does require code to parse its way down 
through the JSON.

The next step is to upgrade Charle's PR with the new JSON reader and support 
for the message parser. (The new JSON reader also allows you to specify a 
schema to handle messy JSON, if we could figure out where to store the schema.)

Can you perhaps share the JSON response structure you need? I'm trying to 
figure out if it is better to work out some kind of text description of the 
parse path, or just let you specify the name of a class that implements the 
message parser. Which would work better for you?

We are also trying to update an earlier ill-fated PR that adds filter 
push-down: the ability to convert a SQL WHERE expression into an HTTP 
parameter. That is WHERE foo = 'bar' becomes =bar in the URL. It is easy to 
implement the "naive" approach that handles only equality, and does a direct 
mapping to HTTP query params. Would this be useful in your case? Do you need to 
parameterize your HTTP request?

Any real-world insight would be helpful.

Thanks,
- Paul

On Tuesday, March 31, 2020, 1:40:17 PM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  

 Ok, I commented in that thread.
I think the proxy is the only missing piece. I tried connecting to a different 
service that is inside the proxy and it worked as expected. This looks like it 
will work well for our application.

FYI, Although it has basic auth, I am not using the authType field in the 
storage config.
Rather, our service authenticates from the header in this format: 
{"Authentication": "Basic "}.

The response JSON is nested quite a bit but I think it can be fixed by 
modifying the SELECT as you have done in your examples.

Thanks,
Rafael

-Original Message-
From: Charles Givre  
Sent: Tuesday, March 31, 2020 3:27 PM
To: user@drill.apache.org
Subject: Re: REST data source?

Rafael,
At the moment the plugin does not support proxy servers.  However, this is 
pretty easy to implement using the current libraries.  Could you please add a 
comment to the PR for the plugin (https://github.com/apache/drill/pull/1892 
<https://github.com/apache/drill/pull/1892>) with some explanation of what you 
need?
Thanks,
-- C

> On Mar 31, 2020, at 3:21 PM, Jaimes, Rafael - 0993 - MITLL 
>  wrote:
> 
> Hi Paul,
> 
> I tried that (even tried a vanilla build before on its own) and I run into 
> the same dependency problem. There is something in apache-21.pom that I 
> cannot resolve. If it works for you I am certain it is a config on our end 
> due to the way our proxies and mirrors are setup, we have to go through these 
> internal channels when building and it sometimes causes issues.
> 
> Charles,
> 
> I am almost up and running with your pre-built instance. I have narrowed the 
> problem down to possibly being another proxy issue. The GET requests don't 
> seem to be honoring my system env variable proxy settings. Do you think 
> there's any way to force Drill/plug-in to use a proxy? I'm unable to get the 
> examples you have posted working: getting Connection reset error on HTTPS and 
> Connect time out with HTTP.  The URLs work fine if I test them outside of 
> Drill.
> 
> Thanks,
> Rafael
> 
> -Original Message-
> From: Paul Rogers 
> Sent: Tuesday, March 31, 2020 2:36 PM
> To: user@drill.apache.org
> Subject: Re: REST data source?
> 
> Hi Rafael,
> 
> The easiest way to build the plugin will be to build all of Drill 1.18 
> Snapshot with the plugin included.
> 
> 1. Grab master from GitHub.
> 
> 2. Merge in Charle's PR branch.
> 
> 3. mvn clean install -DskipTests
> 
> The above usually works for me. This process ensures that all the snapshot 
> versions come from your own build.
> 
> Not sure how we started storing snapshot versions in a Maven repo. 
> This causes issues. If you rebuild part of Drill, and have not built 
> the other parts in more than a day, Maven helpfully downloads the 
> snapshots from the repo, causing all kinds of chaos. (We should fix 
> this.)
> 
> Once you do the build, you'll have a full Drill distribution, just like you'd 
> download. You can use that distribution to run Drill with the plugin included.
> 
> There are other ways that also work; the above may be the simplest.
> 
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Tuesday, March 31, 2020, 10:51:18 AM PDT, Jaimes, Rafael - 0993 - MITLL 
> wrote:  
> 
> Hi Charles,
&g

Re: REST data source?

2020-03-31 Thread Paul Rogers

Hi Rafael,

You may be running into something that I hit at a recent employer. The firm 
hosted its own in-house artifactory that would pull only from "authorized" 
repos. Drill has a couple of dependencies on MapR-hosted repos which this firm 
did not mirror, causing Drill to break. Rather than argue with the Powers That 
Be to change the rules for my little POC, I found a work-around. If you are 
having the same issue, this might work for you. My notes from that time are at 
[1]. Of course, your issue could be different, so we might need a different 
solution. As I recall, the error I got was a bit different than the one you 
got. Still, worth a try.

Thanks,
- Paul


[1] 
https://github.com/paul-rogers/drill/wiki/Build-Drill-in-a-Corporate-Environment


 

On Tuesday, March 31, 2020, 12:21:42 PM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  
 
 Hi Paul,

I tried that (even tried a vanilla build before on its own) and I run into the 
same dependency problem. There is something in apache-21.pom that I cannot 
resolve. If it works for you I am certain it is a config on our end due to the 
way our proxies and mirrors are setup, we have to go through these internal 
channels when building and it sometimes causes issues.

Charles,

I am almost up and running with your pre-built instance. I have narrowed the 
problem down to possibly being another proxy issue. The GET requests don't seem 
to be honoring my system env variable proxy settings. Do you think there's any 
way to force Drill/plug-in to use a proxy? I'm unable to get the examples you 
have posted working: getting Connection reset error on HTTPS and Connect time 
out with HTTP.  The URLs work fine if I test them outside of Drill.

Thanks,
Rafael

-Original Message-----
From: Paul Rogers  
Sent: Tuesday, March 31, 2020 2:36 PM
To: user@drill.apache.org
Subject: Re: REST data source?

Hi Rafael,

The easiest way to build the plugin will be to build all of Drill 1.18 Snapshot 
with the plugin included.

1. Grab master from GitHub.

2. Merge in Charle's PR branch.

3. mvn clean install -DskipTests

The above usually works for me. This process ensures that all the snapshot 
versions come from your own build.

Not sure how we started storing snapshot versions in a Maven repo. This causes 
issues. If you rebuild part of Drill, and have not built the other parts in 
more than a day, Maven helpfully downloads the snapshots from the repo, causing 
all kinds of chaos. (We should fix this.)

Once you do the build, you'll have a full Drill distribution, just like you'd 
download. You can use that distribution to run Drill with the plugin included.

There are other ways that also work; the above may be the simplest.


Thanks,
- Paul

 

    On Tuesday, March 31, 2020, 10:51:18 AM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  
 
 Hi Charles,

(1./2.)
I have not been able to build Drill, from either a full clone of your tagged 
http-storage branch or from the standard Drill 1.17 release. 
I've narrowed it down to some dependency problems from the POM. In particular, 
I run into issues here:

Downloading: 
https://repo.maven.apache.org/maven2/org/apache/apache/21/apache-21.pom
[ERROR] The build could not read 1 project -> [Help 1] [ERROR] [ERROR]  The 
project org.apache.drill:drill-root:1.18.0-SNAPSHOT 
(/home/ra29435/drill-official/drill/pom.xml) has 1 error [ERROR]    
Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:21 
from/to conjars (http://conjars.org/repo): Connection to http://conjars.org 
refurelativePath' points at no local POM @ line 24, column 11: Connection timed 
out (Connection timed out) -> [Help 2] [ERROR] [ERROR] To see the full stack 
trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
[ERROR] [Help 2] 
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException

I think it has something to do with the fact that I normally resolve 
dependencies from our local Maven repo mirrors. We have no problems getting 
stuff from Maven Central and common places, but I am unfamiliar with 
conjars.org. I wonder if it is related to that?

(3./4.)
I tried putting the JAR into either jars/ or jars/3rdparty with the same error. 
I haven't gone down the dependency tree so I have not made and JARs of them, 
that could be a major thing I'm missing.

Yes this is still in a testing environment. I'm going to use your pre-built 
images for testing the REST endpoint, this is extremely helpful. If it works 
out I'll go back to trying to build it. Also, hoping that this will make its 
way into the next (1.18) release.

Best,
Rafael

-Original Message-
From: Charles Givre 
Sent: Tuesday, March 31, 2020 1:34 PM
To: user 
Subjec

Re: REST data source?

2020-03-31 Thread Paul Rogers

Hi Rafael,

The easiest way to build the plugin will be to build all of Drill 1.18 Snapshot 
with the plugin included.

1. Grab master from GitHub.

2. Merge in Charle's PR branch.

3. mvn clean install -DskipTests

The above usually works for me. This process ensures that all the snapshot 
versions come from your own build.

Not sure how we started storing snapshot versions in a Maven repo. This causes 
issues. If you rebuild part of Drill, and have not built the other parts in 
more than a day, Maven helpfully downloads the snapshots from the repo, causing 
all kinds of chaos. (We should fix this.)

Once you do the build, you'll have a full Drill distribution, just like you'd 
download. You can use that distribution to run Drill with the plugin included.

There are other ways that also work; the above may be the simplest.

Thanks,
- Paul

On Tuesday, March 31, 2020, 10:51:18 AM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  

 Hi Charles,

(1./2.)
I have not been able to build Drill, from either a full clone of your tagged 
http-storage branch or from the standard Drill 1.17 release. 
I've narrowed it down to some dependency problems from the POM. In particular, 
I run into issues here:

Downloading: 
https://repo.maven.apache.org/maven2/org/apache/apache/21/apache-21.pom
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR]  The project org.apache.drill:drill-root:1.18.0-SNAPSHOT 
(/home/ra29435/drill-official/drill/pom.xml) has 1 error
[ERROR]    Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:21 from/to conjars (http://conjars.org/repo): Connection 
to http://conjars.org refurelativePath' points at no local POM @ line 24, 
column 11: Connection timed out (Connection timed out) -> [Help 2]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
[ERROR] [Help 2] 
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException

I think it has something to do with the fact that I normally resolve 
dependencies from our local Maven repo mirrors. We have no problems getting 
stuff from Maven Central and common places, but I am unfamiliar with 
conjars.org. I wonder if it is related to that?

(3./4.)
I tried putting the JAR into either jars/ or jars/3rdparty with the same error. 
I haven't gone down the dependency tree so I have not made and JARs of them, 
that could be a major thing I'm missing.

Yes this is still in a testing environment. I'm going to use your pre-built 
images for testing the REST endpoint, this is extremely helpful. If it works 
out I'll go back to trying to build it. Also, hoping that this will make its 
way into the next (1.18) release.

Best,
Rafael

-Original Message-
From: Charles Givre  
Sent: Tuesday, March 31, 2020 1:34 PM
To: user 
Subject: Re: REST data source?

Hi Rafael,
Glad you're getting some value from Drill.  Repackaging that directory as a 
truly pluggable jar is tricky.  A few questions:
1.  Did you copy the contrib/storage-http into its own folder and then do a 
build from that?
2.  Did it build successfully?
3.  Did you copy the JARs into your Drill jars/3rdparty folder?
4.  You'll also have to get JARs of any dependencies as well and copy them to 
the jars/3rdparty.  Have you done that?

I actually have a pre-built version of Drill with the storage-http plugin 
available here: https://github.com/cgivre/drill/releases 
.  Please do not use that in any kind 
of production setup.  If you're just wanting to try this out, it might be 
easier to d/l that and use that.
-- C

> On Mar 31, 2020, at 12:57 PM, Jaimes, Rafael - 0993 - MITLL 
>  wrote:
> 
> Hi Charles,
>  
> I am trying to use the http-storage plugin from your branch. I put the 
> storage plug-in files in a jar and tried to keep the jar directory structure 
> the same as other plug-ins. Upon starting drill-embedded I’m getting the 
> error below.  I am using your drill-module.conf and 
> bootstrap-storage-plugins.json from your branch. Is there another step I need 
> to perform to get Drill to recognize the plug-in? I am using 1.17 release.
>  
> Error: Failure in starting embedded Drillbit: 
> java.lang.IllegalStateException: 
> com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Could not resolve 
> type id 'http' as a subtype of [simple type, class 
> org.apache.drill.common.logical.StoragePluginConfig]: known type ids = 
> [InfoSchemaConfig, SystemTablePluginConfig, file, hbase, hive, jdbc, kafka, 
> kudu, mock, mongo, named, openTSDB] (for POJO property 'storage') at [Source: 
> (String)"{
>  "storage":{
>    "http" : {
>      "type":"http",
>      "connections": {},
>

Re: JDBC datasource on Websphere server 8.5.5.9

2020-03-30 Thread Paul Rogers

Hi Prabhakar,

Not being much of a JDBC expert, I did some poking around. It seems that 
Drill's open-source JDBC driver is based on Apache Calcite's Avatica framework. 
Avatica does not appear include JDBC DataSource support, it is just a simple, 
basic JDBC driver.

My attempts to Google how to use such a basic Driver with Websphere did not 
produce many results. I found the click-this, type-that instructions (from 
2007!) but did not see anything about how to handle a basic driver.

So, seems that there several approaches:

1. Extend the Drill JDBC driver to include DataSource support.
2. Find a Websphere or third-party solution to wrap "plain" JDBC drivers.
3. Try MapR's commercial JDBC driver created by Simba. [1] Looks like this 
driver works on Windows. The documentation [2] does not list DataSource 
support, however.


Contributions are welcome to solve item 1. You are probably more of a WS expert 
than I, so perhaps you can research item 2. You can also check whether the MapR 
Driver give you what you need.

Also, if any others out there have more JDBC experience, it would be great if 
someone could add a bit more context. For example, how is this issue handled 
for other JDBC drivers? What would it take for Drill to add DataSource support?


Thanks,
- Paul
[1] https://mapr.com/docs/61/Drill/drill_odbc_connector.html
[2] 
https://mapr.com/docs/61/attachments/JDBC_ODBC_drivers/DrillODBCInstallandConfigurationGuide.pdf

 

On Sunday, March 29, 2020, 9:26:52 PM PDT, Prabhakar Bhosaale 
 wrote:  
 
 Hi Paul,

Any further inputs on JDBC driver for drill?  thx

Regards
Prabhakar

On Thu, Mar 26, 2020 at 1:25 PM Prabhakar Bhosaale 
wrote:

> Hi Paul,
> Please see my answers inline below
>
> Drill is supported on Windows only in embedded mode; we have no scripts to
> run a server. Were you able to create your own solution?
> Prabhakar: We are using drill on windows only in embedded mode
>
> The exception appears to indicate that the Drill JDBC connection is being
> used inside a transaction, perhaps with other data sources, so a two-phase
> commit is needed. However, Drill does not support transactions as
> transactions don't make sense for data sources such as HDFS or S3.
>
>
> Is there a way to configure WAS to use Drill just for read-only access
> without transactions? See this link: [1]. To quote:
>
> Non-transactional data source
> Specifies that the application server does not enlist the connections from
> this data source in global or local transactions. Applications must
> explicitly call setAutoCommit(false) on the connection if they want to
> start a local transaction on the connection, and they must commit or roll
> back the transaction that they started.
>
> Prabhakar: I tried making the datasource as non-transactional data source.
> But still it gave same error
>
> Can you run a test? Will SQLLine connect to your Drill server? If so, then
> you know that you have the host name correct, that the ports are open, and
> that Drill runs well enough on Windows for your needs.
>
> Prabhakar: I tried connecting drill using squirrel and it connected
> successfully to drill. Even we tried simple java code using this driver
> class and it successfully retrieved the data. So drill with its port and
> host is working fine.
>
>
> Our understanding is that webphere is expecting any JDBC driver to
> implement the javax.sql.ConnectionPoolDataSource class, But in drill driver
> we are not sure whether this is implemented.
>
> Please refer
> https://www.ibm.com/mysupport/s/question/0D50z62kMU2CAM/classcastexception-comibmoptimconnectjdbcnvdriver-incompatible-with-javaxsqlconnectionpooldatasource?language=en_US
>
> Any help in this regard is highly appreciated. thx
>
> REgards
> Prabhakar
>
> On Thu, Mar 26, 2020 at 10:52 AM Paul Rogers 
> wrote:
>
>> Hi Prabhakar,
>>
>> Drill is supported on Windows only in embedded mode; we have no scripts
>> to run a server. Were you able to create your own solution?
>>
>> The exception appears to indicate that the Drill JDBC connection is being
>> used inside a transaction, perhaps with other data sources, so a two-phase
>> commit is needed. However, Drill does not support transactions as
>> transactions don't make sense for data sources such as HDFS or S3.
>>
>>
>> Is there a way to configure WAS to use Drill just for read-only access
>> without transactions? See this link: [1]. To quote:
>>
>> Non-transactional data source
>> Specifies that the application server does not enlist the connections
>> from this data source in global or local transactions. Applications must
>> explicitly call setAutoCommit(false) on the connection if they want to
>> start a local transaction on the connect

Re: ECS parquet files query timing out

2020-03-28 Thread Paul Rogers

Creator.getRootExec():114
org.apache.drill.exec.physical.impl.ImplCreator.getExec():90
org.apache.drill.exec.work.fragment.FragmentExecutor.run():292
org.apache.drill.common.SelfCleaningRunnable.run():38
...():0
  Caused By (java.lang.Exception) getFileStatus on 
s3a://test-bucket/TestDir/Test_1.parquet:  com.amazonaws.SdkClientException: 
Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException():352
org.apache.hadoop.fs.s3a.S3AUtils.translateException():177
org.apache.hadoop.fs.s3a.S3AUtils.translateException():151
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus():2242
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus():2204
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus():2143
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath():39

org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator.readFooter():353

org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator.getBatch():149
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch():42
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch():36
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch():163
org.apache.drill.exec.physical.impl.ImplCreator.getChildren():186
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch():141
org.apache.drill.exec.physical.impl.ImplCreator.getChildren():186
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch():141
org.apache.drill.exec.physical.impl.ImplCreator.getChildren():186
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch():141
org.apache.drill.exec.physical.impl.ImplCreator.getChildren():186
org.apache.drill.exec.physical.impl.ImplCreator.getRootExec():114
org.apache.drill.exec.physical.impl.ImplCreator.getExec():90
org.apache.drill.exec.work.fragment.FragmentExecutor.run():292
org.apache.drill.common.SelfCleaningRunnable.run():38
...():0Thanks & Regards ,Navin 
On Sat, 28 Mar 2020, 09:27 Paul Rogers,  wrote:

Hi Navin,

You had mentioned your ECS solution in an earlier note. What are you using to 
access data in your container? Is your ECS container running HDFS? Or, do you 
have some other API?

Do you have Drill running in a container on ECS, or is that were your data is 
located? It would be helpful if you could perhaps describe your setup in a bit 
more detail so we can offer suggestions about where to look for an issue.

By the way: the query profile is often a good place to start. You'll find them 
in the Drill Web Console. Looking at each operator you can see how much memory 
was used and how long things took. Specifically, look at the time taken by the 
scan: is the slowness due to reading the data, or is some other part of the 
query taking the time?

When you get the error, what is the stack trace? Is the error coming from some 
particular HDFS client? In some particular operation?


Thanks,
- Paul

 

On Friday, March 27, 2020, 6:59:42 AM PDT, Navin Bhawsar 
 wrote:  
 
 Hi,

We are facing performance issue where apache drill query on ecs time out
with below error "ConnectionPoolTimeoutException: Timeout waiting for
connection from pool"

However  same query works fine on hdfs single node with execution time of
2.1 sec.(planning =.483s)

Parquet file size <1.5 GB
Total parquet files scanned = 8( total 19 in directory)
Apache drill version 1.17
JDK 1.8.0_74
Total rows returned from query =71000

There are 2 drillbits running in distributed mode .
13 GB default allocated per drill bit.

Any ideas why ecs performance so bad when compared with hdfs for drill  ?
Please advise if drill provides options to optimize ecs querying .

Please let me know if you need more details.

Thanks & Regards,
Navin

Re: ECS parquet files query timing out

2020-03-27 Thread Paul Rogers

Hi Navin,

You had mentioned your ECS solution in an earlier note. What are you using to 
access data in your container? Is your ECS container running HDFS? Or, do you 
have some other API?

Do you have Drill running in a container on ECS, or is that were your data is 
located? It would be helpful if you could perhaps describe your setup in a bit 
more detail so we can offer suggestions about where to look for an issue.

By the way: the query profile is often a good place to start. You'll find them 
in the Drill Web Console. Looking at each operator you can see how much memory 
was used and how long things took. Specifically, look at the time taken by the 
scan: is the slowness due to reading the data, or is some other part of the 
query taking the time?

When you get the error, what is the stack trace? Is the error coming from some 
particular HDFS client? In some particular operation?


Thanks,
- Paul

 

On Friday, March 27, 2020, 6:59:42 AM PDT, Navin Bhawsar 
 wrote:  
 
 Hi,

We are facing performance issue where apache drill query on ecs time out
with below error "ConnectionPoolTimeoutException: Timeout waiting for
connection from pool"

However  same query works fine on hdfs single node with execution time of
2.1 sec.(planning =.483s)

Parquet file size <1.5 GB
Total parquet files scanned = 8( total 19 in directory)
Apache drill version 1.17
JDK 1.8.0_74
Total rows returned from query =71000

There are 2 drillbits running in distributed mode .
13 GB default allocated per drill bit.

Any ideas why ecs performance so bad when compared with hdfs for drill  ?
Please advise if drill provides options to optimize ecs querying .

Please let me know if you need more details.

Thanks & Regards,
Navin

Re: JDBC datasource on Websphere server 8.5.5.9

2020-03-25 Thread Paul Rogers

Hi Prabhakar,

Drill is supported on Windows only in embedded mode; we have no scripts to run 
a server. Were you able to create your own solution?

The exception appears to indicate that the Drill JDBC connection is being used 
inside a transaction, perhaps with other data sources, so a two-phase commit is 
needed. However, Drill does not support transactions as transactions don't make 
sense for data sources such as HDFS or S3. 

Is there a way to configure WAS to use Drill just for read-only access without 
transactions? See this link: [1]. To quote:

Non-transactional data source
Specifies that the application server does not enlist the connections from this 
data source in global or local transactions. Applications must explicitly call 
setAutoCommit(false) on the connection if they want to start a local 
transaction on the connection, and they must commit or roll back the 
transaction that they started.

Can you run a test? Will SQLLine connect to your Drill server? If so, then you 
know that you have the host name correct, that the ports are open, and that 
Drill runs well enough on Windows for your needs.

By the way, the Apache mail agent does not support attachments. Can you post 
the log somewhere else? Or, just past into an e-mail the lines around the 
failure.

Thanks,
- Paul

[1] 
https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/udat_jdbcdatasorprops.html

On Wednesday, March 25, 2020, 8:53:31 PM PDT, Prabhakar Bhosaale 
 wrote:  

 Hi Charles,

Thanks for the reply.  The dril version is 1.16 and JDBC version is also
same. The drill is installed on windows in standalone mode.

The challenge here is that, when we created the data provider and data
source on WAS, we have not given any hostname or port details of drill
server, so when test connection happens on WAS server, it is actually not
connecting to drill.

Please let me know if you need any additional information. Once again
thanks for your help

Regards
Prabhakar

On Tue, Mar 24, 2020 at 6:19 PM Charles Givre  wrote:

> HI Prabhakar,
> Thanks for your interest in Drill.  Can you share your config info as well
> as the versions of Drill and JDBC Driver that you are using?
> Thanks,
> -- C
>
>
> > On Mar 24, 2020, at 7:07 AM, Prabhakar Bhosaale 
> wrote:
> >
> > Hi Team,
> >
> > we are trying to connect to apache drill from websphere 8.5.5.9.  We
> created the the Data provider and data source as per standard process of
> WAS.  But when we try to test the connection, it gives following error.
> >
> > "Test connection operation failed for data source retrievalds on server
> ARCHIVE_SERVER at node ARCHIVALPROFILENode1 with the following exception:
> java.lang.Exception: DSRA8101E: DataSource class cannot be used as
> one-phase: ClassCastException: org.apache.drill.jdbc.Driver incompatible
> with javax.sql.ConnectionPoolDataSource  "
> >
> > We are using SDK version 1.8
> > Attaching the JVM log also for your reference. thx
> >
> > Any pointers or any documentation in this regards would be appreciated.
> Please help. thx
> >
> > Regards
> > Prabhakar
> > 
>
>

Re: Excessive Memory Use in Parquet Files (From Drill Slack Channel)

2020-03-24 Thread Paul Rogers

red: One or more nodes ran out of memory while executing the 
query. (null)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more 
nodes ran out of memory while executing the query.
null
[Error Id: 67b61fc9-320f-47a1-8718-813843a10ecc ]
    at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:657)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:338)
    at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
    at 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.java:380)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.initializeBatch(PartitionerTemplate.java:400)
    at 
org.apache.drill.exec.test.generated.PartitionerGen5.setup(PartitionerTemplate.java:126)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createClassInstances(PartitionSenderRootExec.java:263)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createPartitioner(PartitionSenderRootExec.java:218)
    at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:188)
    at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:93)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:323)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:310)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:310)
    ... 4 common frames omitted
Now, I'm running this query from a 16 core, 32GB Ram machine, with Heap sized 
at 20GB, Eden sized at 16GB (added manually to JAVA_OPTS) and Direct Sized at 8 
GB.
By querying sys.memory I can confirm all limits apply. At no point throughout 
the query Am I nearing memory limit of the HEAP/DIRECT or the OS itself





8:25
However, due to the way 
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew is 
impelmented
8:27
@Override
  public void allocateNew() throws OutOfMemoryException {
    if (!allocateNewSafe()) {
      throw new OutOfMemoryException();
    }
  }
8:27
The actual exception/error is swallowed, and I have no idea what's the cause of 
the failure
8:28
The data-set itself consists of say 15 parquet files, each one weighing at 
about 100kb
8:30
but as mentioned earlier, the parquet files are a bit more complex than the 
usual.
8:32
@cgivre @Vova Vysotskyi is there anything I can do or tweak to make this error 
go away?

cgivre  8:40 AM
Hmm...
8:40
This may be a bug.  Can you create an issue on our JIRA board?

Idan Sheinberg  8:43 AM
Sure
8:43
I'll get to it

cgivre  8:44 AM
I'd like for Paul Rogers to see this as I think he was the author of some of 
this.

Idan Sheinberg  8:44 AM
Hmm. I'll keep that in mind

cgivre  8:47 AM
We've been refactoring some of the complex readers as well, so its possible 
that is caused this, but I'm not really sure.
8:47
What version of Drill?

cgivre  9:11 AM
This kind of info is super helpful as we're trying to work out all these 
details.
9:11
Reading schemas on the fly is not trivial, so when we find issues, we do like 
to resolve them

Idan Sheinberg  9:16 AM
This is drill 0.18 -SNAPSHOT as of last month
9:16
U
9:16
I do think I managed to resolve the issue however
9:16
I'm going to run some additional tests and let you know

cgivre  9:16 AM
What did you do?
9:17
You might want to rebase with today's build as well

Idan Sheinberg  9:21 AM
I'll come back with the details in a few moments

cgivre  9:38 AM
Thx
new messages

Idan Sheinberg  9:50 AM
Ok. See it seems as though it's a combination of a few things.
The data-set in question is still small (as mentioned before), but we are 
setting planner.slice_target  to an extremely low value in order to trigger 
parallelism and speed up parquet parsing by using multiple fragments.
We have 16 cores, 32 GB (C5.4xlarge on AWS) but we set 
planner.width.max_per_node  to further increase parallelism.  it seems as 
though each fragment is handling parquet parsing on it's own, and somehow 
incurs a great burden on
the direct memory buffer pool, as I do see 16GB peaks of direct memory usage 
after lowering the planner.width.max

Re: scaling drill in an openshift (K8s) cluster

2020-03-24 Thread Paul Rogers

Hi All,

The issue of connecting to a pod from outside the K8s cluster is a known, 
intentional limitation of K8s. K8s creates is own overlay network for pod 
addresses (at least in the plain-vanilla version.) Amazon EKS seems to draw pod 
IPs from the same pool as VMs, and so, on AWS, pods may be reachable.

Dobes makes a good point about stateful sets. However, in normal operation, the 
Drillbit IPs should not matter: it is ZK which is critical. Each Drillbit needs 
to know the ZK addresses and will register itself with ZK. Clients consult ZK 
to find Drilbits. So, the Drillbit IPs themselves can change on each Drillbit 
run.

This does mean that ZK has to be visible outside the K8s overlay network. And, 
to connect to the Drillbit, each Drillbit IP must also be visible (but not 
known ahead of time to the client, only the ZK addresses must be known to the 
client ahead of time.)


The general solution is to put a load balancer or other gateway in front of 
each ingress point. In a production environment, each ingress tends to be 
secured with that firm's SSO solution. All of this is more K8s magic than a 
Drill issue.

One quick solution is to run a K8s proxy to forward the Drillbit web address to 
outside nodes. Won't help for the JDBC driver, but let's you manage the Drill 
server via REST.

Abhishek has been working on a K8s solution. If he is reading this, perhaps he 
can offer some advice of what worked for him.


Thanks,
- Paul

 

On Tuesday, March 24, 2020, 9:04:35 AM PDT, Dobes Vandermeer 
 wrote:  
 
 I was able to get drill up and running inside a k8s cluster but I didn't 
connect to it from outside the cluster, so the DNS names were always resolvable 
by the client(s).

I had to run it as a statefulset to ensure the DNS names are stable, otherwise 
the drillbits couldn't talk to each other, either.

On 3/24/2020 6:37:44 AM, Jaimes, Rafael - 0993 - MITLL 
 wrote:
I’m seeing a problem with scaling the number of pod instances in the 
replication controller because they aren’t reporting their hostnames properly. 
This was a common problem that got fixed in scalable architectures like 
ZooKeeper and Kafka (see reference at bottom I think this was related).
 
In Drill’s case, ZooKeeper is able to see all of the drillbits, however, the 
hostnames are only locally addressable within the cluster, so as soon as you 
perform a query it fails since the client can’t find the drillbit that it got 
assigned, its hostname isn’t externally addressable.
 
Kafka fixes this by allowing an override for advertised names. Has anyone 
gotten Drill to scale in a K8s cluster?
 
https://issues.apache.org/jira/browse/KAFKA-1070

Re: REST data source?

2020-03-24 Thread Paul Rogers

Hi Rafael,

We are seeing increasing interest in Drill obtaining data via REST. At present, 
there are no real standards for things like telling the server which fields are 
wanted, or "pushing" filter conditions to the source. It would be great if we 
can build on Charle's work to add some kind of solution.

Can you share a bit about the REST service to which you want to connect? How 
does it handle the "projection" and "filter" issues described above? How does 
it return the data? Streaming JSON? JSON embedded in a response structure? 
Something else?

Thanks,
- Paul

 

On Tuesday, March 24, 2020, 7:17:20 AM PDT, Jaimes, Rafael - 0993 - MITLL 
 wrote:  
 
 
Thank you so much. I apologize, looks like someone asked a similar question 
right after I checked the archives yesterday. This looks great.

-  Rafael 

  

From: Charles Givre  
Sent: Tuesday, March 24, 2020 10:14 AM
To: user@drill.apache.org
Cc: Jaimes, Rafael - 0993 - MITLL 
Subject: Re: REST data source?

  

Hi Rafael, 

Thanks for your interest in Drill.  To answer your question, there is a PR in 
progress which allows you to query REST APIs from Drill. [1]. Here's a link to 
the documentation as well. [2].  The idea behind Drill is to be able to query a 
wide variety of data sources, and this PR will enable you to reach out to REST 
endpoints and query that data. 

  

Mechanically, this is very different from how Drill queries other RDBMS systems 
via JDBC.  If you have any questions, please let me know. 

Thanks,

  

  

  

[1]: https://github.com/apache/drill/pull/1892

[2]: 
https://github.com/apache/drill/blob/27e72499a3a80c0b2927d532d2d4959d8be4eea6/contrib/storage-http/README.md

  

  






On Mar 24, 2020, at 9:22 AM, Jaimes, Rafael - 0993 - MITLL 
 wrote:

  

I know you can use REST API to query against Drill, but can Drill make REST 
queries itself?

 

It might seem unnecessary but if the idea is one stop shop for all querying, I 
don’t see how it’s different than using SQL against Drill which then queries 
against a RDBMS using SQL.

 

Thanks in advance.

Re: Embedding Drill as a distributed query engine

2020-03-24 Thread Paul Rogers

t;> The ease of communication between threads within the same process is
>> dramatically better than communication between processes, even
>> (especially?) with shared memory.
>>
>> My own recommendation would be to *allow* collocation but not assume it.
>> Allow for non-collocated Drill bits as well. That allows you to pivot at
>> any point.
>>
>>
>> On the other hand
>>
>> On Tue, Jan 21, 2020 at 5:10 PM Paul Rogers 
>> wrote:
>>
>> > Hi Benjamin,
>> >
>> > Very cool project! Drill works well on top of custom data sources.
>> >
>> > That said, I suspect that actually running Drill inside your process
>> will
>> > lead to a large amount of complexity. Your comment focuses on code
>> issues.
>> > However, there are larger concerns. Although we think of Drill as a
>> simple
>> > single-threaded, single node tool (when run in SqlLine or on a Mac),
>> Drill
>> > is designed to be fully distributed.
>> >
>> > As queries get larger, you will find that Drill itself uses large
>> amounts
>> > of memory and CPU to run a query quickly. (Imagine a join or sort of
>> > billions of rows from several tables.) Drill has its own memory
>> management
>> > system to handle the large blocks of memory needed. Your DB also needs
>> > memory. You'd need a way to unify Drill's memory management with your
>> own
>> > -- a daunting task.
>> >
>> > Grinding through billions of rows is CPU intensive. Drill manages its
>> own
>> > thread and makes very liberal use of CPU. Your DB engine likely also
>> has a
>> > threading model. Again, integrating the two is difficult. We could go
>> on.
>> >
>> > In short, although Drill works well as a query engine on top of a custom
>> > data source; Drill itself is not designed to be a library included into
>> > your app process; it is designed to run as its own distributed set of
>> > processes running alongside your process.
>> >
>> > We could, of course, change the design, but that would be a bit of a big
>> > project because of the above issues. Might be interesting to think how
>> > you'd embed a distributed framework as a library in some host process.
>> Not
>> > sure I've ever seen this done for any tool. (If anyone knows of an
>> example,
>> > please let us know.)
>> >
>> >
>> > I wonder if there is a better solution. Run Drill alongside your DB on
>> the
>> > same nodes. Have Drill then obtain data from your DB via an API. The
>> quick
>> > & dirty solution is to use an RPC API. You can get fancy and use shared
>> > memory. A side benefit is that other tools can also use the API. For
>> > example, if you find you need Spark integration, it is easier to
>> provide.
>> > (You can't, of course, run Spark in your DB process.)
>> >
>> > In this case, an "embedded solution" means that Drill is embedded in
>> your
>> > app cluster (like ZK), not that it is embedded in your app process.
>> >
>> >
>> > In this way, you can tune Drill's memory and CPU usage separately from
>> > that of your engine, making the problem tractable. This model is, in
>> fact,
>> > very similar to the traditional HDFS model in which both Drill and HDFS
>> run
>> > on the same nodes. It is also similar to what MapR did with the MapR DB
>> > integration.
>> >
>> >
>> > Further, by separating the two, you can run Drill on its own nodes if
>> you
>> > find your queries are getting larger and more expensive. That is, you
>> can
>> > scale out be separating compute (Drill) from storage (your DB), allowing
>> > each to scale independently.
>> >
>> >
>> > And, of course, a failure in one engine (Drill or DB) won't take down
>> the
>> > other if the two are in separate processes.
>> >
>> >
>> > In either case, your storage plugin needs to compute data locality. If
>> > your DB is distributed, then perhaps it has some scheme for distributing
>> > data: hash partitioning, range partitioning, or whatever. Somehow, if I
>> > have key 'x', I know to go to node Y to get that value. For example, in
>> > HDFS, Drill can distribute block scans to the node(s) with the blocks.
>> >
>> >
>> > Or, maybe data is randomly distributed, so that every scan must run
>> > against every DB node; in which case if you have N

Re: Apache Drill rest api plugin

2020-03-24 Thread Paul Rogers

Hi Arun,

If I understand you, the Parquet file format is essentially unimportant. You 
have your own in-memory structures that happen to be populated from Parquet.

You probably have some form of REST API that, at the least, includes projection 
and filtering. That is, I can say which columns I want (projection) and which 
rows (filtering).

The API delivers data in some format. If a REST API, that format is probably 
JSON, though JSON is horribly inefficient for a big data, high-speed query 
engine. The data would, ideally, be in some compact, easy-to-parse binary 
format.

There is no out-of-the-box storage plugin to that will do everything you want. 
However, Drill is designed to be extensible; it is not hard to build such a 
plugin. For example, I've now built a couple that do something similar, and 
Charles is working on a generic HTTP version.

There are a number of resources to help. One is our book Learning Apache Drill. 
Another is a set of notes on my wiki from when I built a similar plugin. [1] 
Charles mentioned his in-flight REST API which gives you much of what you need 
except the filter push-down. [2]

There are two minor challenges. The first is just learning how to build Drill 
and assemble the pieces needed. The book and Wiki can help. The other is to 
build the "filter push-down" logic that translates from Drill's internal parse 
tree for filters to the format that your REST API needs. Basically, you pull 
predicates out of the query (WHERE a=10 AND b="fred") and pass them along using 
your REST request. There is a framework to help with filter push downs in [3]. 
The framework converts from Drill's filter format to a compact (col op const) 
format that handles most of the cases you'll need, at least when getting 
started.

The obvious way to pass the predicates is via an HTTP GET query string. 
However, predicates can be complex; some people find it better to pass the 
predicates encoded in JSON via an HTTP POST request.

If your data does return in JSON, you can use the existing JSON parser to read 
the data. See PR 1892 for an example. We are also working on an improved JSON 
reader which will be available in a couple of weeks (if all goes well.)

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
[2] https://github.com/apache/drill/pull/1892
[3] https://github.com/apache/drill/pull/1914 

On Monday, March 23, 2020, 11:03:01 PM PDT, Arun Sathianathan 
 wrote:  

 Hi Paul,
Thanks for getting back. Let me rephrase the question for clarity. We have an 
architecture where parquet files in ECS are read into memory and hosted in 
in-memory structures. We then have API exposed to users that would return data 
from memory via REST API. We would like to know if we can query the REST API 
using Apache Drill. The authentication to API is via OAUTH2. 
We were also pointed to below enhancement in pipeline. 
HTTPS://github.con/apache/Drill/pull/1892 

Regards,Arun 

Sent from my iPhone

On 24-Mar-2020, at 12:10 AM, Paul Rogers  wrote:

Hi Navin,

Can you share a bit more what you are trying to do? ECS is Elastic Container 
Service, correct? So, the Parquet files are ephemeral: they exist only while 
the container runs? Do the files have a permanent form, such as in S3?

Parquet is a complex format. Drill exploits the Parquet structure to optimize 
query performance. This means that Drill must seek the the header, footer and 
row groups of each file. More specifically, Parquet cannot be read in a 
streaming fashion the way we can read CSV or JSON.

The best REST API for Parquet would be a clone of the Amazon S3 API. 
Alternatively, expose the files using something like NFS so that the file on 
ECS appears like a local file to Drill.

You can even implement the HDFS client API on top of your REST API (assuming 
your REST API supports the required functions), and use Drill's DFS plugin with 
your client.

Yet another alternative is to store Parquet in S3, so Drill can use the S3 API 
directly. Or, to stream the content to Drill from a container, use JSON or CSV.

Lots of options that depend on what you're trying to do.

Thanks,

- Paul

On Monday, March 23, 2020, 6:03:48 AM PDT, Charles Givre  
wrote:  

 Hi Navin, 
Thanks for your interest in Drill.  To answer your question, there is currently 
a pull request for a REST storage plugin [1], however as implemented it only 
accepts JSON responses.  However, it would not be difficult to get the reader 
to accept Parquet files.  Please take a look and send any feedback.
-- C

[1]: https://github.com/apache/drill/pull/1892 
<https://github.com/apache/drill/pull/1892>

> On Mar 23, 2020, at 8:14 AM, Navin Bhawsar  wrote:
> 
> Hi
> 
> We are currently doing an experiment to use apache drill to query parquet
> files .These parquet files will be copied on ecs and exposed via rest api .
> 
> Can you please advise if there is a storage plugin to query r

Re: Apache Drill rest api plugin

2020-03-23 Thread Paul Rogers

Hi Navin,

Can you share a bit more what you are trying to do? ECS is Elastic Container 
Service, correct? So, the Parquet files are ephemeral: they exist only while 
the container runs? Do the files have a permanent form, such as in S3?

Parquet is a complex format. Drill exploits the Parquet structure to optimize 
query performance. This means that Drill must seek the the header, footer and 
row groups of each file. More specifically, Parquet cannot be read in a 
streaming fashion the way we can read CSV or JSON.

The best REST API for Parquet would be a clone of the Amazon S3 API. 
Alternatively, expose the files using something like NFS so that the file on 
ECS appears like a local file to Drill.

You can even implement the HDFS client API on top of your REST API (assuming 
your REST API supports the required functions), and use Drill's DFS plugin with 
your client.

Yet another alternative is to store Parquet in S3, so Drill can use the S3 API 
directly. Or, to stream the content to Drill from a container, use JSON or CSV.

Lots of options that depend on what you're trying to do.

Thanks,

- Paul

On Monday, March 23, 2020, 6:03:48 AM PDT, Charles Givre  
wrote:  

 Hi Navin, 
Thanks for your interest in Drill.  To answer your question, there is currently 
a pull request for a REST storage plugin [1], however as implemented it only 
accepts JSON responses.  However, it would not be difficult to get the reader 
to accept Parquet files.  Please take a look and send any feedback.
-- C

[1]: https://github.com/apache/drill/pull/1892 

> On Mar 23, 2020, at 8:14 AM, Navin Bhawsar  wrote:
> 
> Hi
> 
> We are currently doing an experiment to use apache drill to query parquet
> files .These parquet files will be copied on ecs and exposed via rest api .
> 
> Can you please advise if there is a storage plugin to query rest api ?
> 
> Currently we are using Apache Drill 1.17 version in distributed mode .
> 
> Please let me know if you need more details .
> 
> Thanks and Regards,
> Navin

Re: Updating tables stored on s3

2020-03-14 Thread Paul Rogers

Hi Dobes,

Updating data in this way does seem to be challenge. Hive added ACID features a 
while back, but they are fiendishly complex: every record is given an ID. A 
delete or replace creates a edit entry in another file that uses the same ID. 
Hive then joins the main and update file to apply the updates, and anti-joins 
the deletes file to remove deletions. Seems to work, but it does seem fragile 
and expensive.

For how long are your files open to revision? Days? Weeks? The whole school 
term? Can deletes arrive after the school term? I'm wondering how dynamic the 
data is? Is there a "frothy" present, but then a stable history? Or is the 
whole history frothy?


If revisions are short-term (due to correcting errors, make-up tests, etc.), 
you are trying to create a streaming database as described in the book 
Streaming Systems [1]. If you read that book online, it has lots of nice 
animations showing the various issues, and how late-arriving values can replace 
earlier values. Might spark some ideas.


Depending on how fancy you want to get, you can create a fake directory 
structure in S3 and let Drill's usual partition pruning reduce the number of 
files for each query (partition pruning.)

Or, you can create an index database that holds the complete list of your 
files, along with metadata (maybe the school, class, date, assignment, 
whatever.) You can write custom code that first looks up filter conditions in 
your index DB, to return a list of files. From that, pass the files to Drill in 
place of Drill's partition pruning. This is not an "out of the box" task; you'd 
basically be writing a plugin that replaces Drill's normal directory & 
partition push-down code.

With the index DB, replacing a Parquet file is as easy as replacing the file at 
a particular (coordinate) in your index. Doing this also avoids race 
conditions: you can replace the index entry, wait a few hours to ensure all 
in-flight queries using the old file complete, then delete the obsolete Parquet 
file.

If queries are local (for a single teacher, student or school), the number of 
files can be small (with good data localization: all of the data from one 
district in one file, say.) So, Drill might normally scan a handful (dozens, 
few hundreds) of files. If, however, you are computing nation-wide trends 
across all data, then all files might be involved. In the localized, case, 
having a good index would help you keep the number of files per query small.

Hive, by the way, helps with this because Hive maps table columns to 
directories. You could have partitions for, say, school district, teacher, 
class, student, test which would directly map to terms in your actual queries. 
With Drill, you have to do the mapping yourself, which is a hassle. With a 
roll-your-own index, you can reproduce the Hive behavior. (Yes, we should 
implement the Hive pattern in Drill - perhaps an outcome of the Drill 
metastore.)


Are you planning to combine this with Mongo for the most recent data? If so, 
then your index mechanism can also identify when to use Mongo, and when data 
has migrated to in S3.

Just out of curiosity, if your data changes frequently, have you considered a 
distributed DB such as Cassandra or the like? A Google search suggested there 
are about 60 million students in the US. Let's assume you are wildly 
successful, and serve all of them. There are 250 school days per year (IIRC). 
Let's say each student takes 5 tests per day. (My kids would revolt, but 
still.) That is 75 billion data points per school year. At, say, 100 bytes per 
record, that is about 8 TB of data. Seems like something a small cluster could 
handle, with the number of nodes probably driven by query rate.

Maybe break the problems into pieces? Ignoring the update issue, could you get 
good query performance from S3 with some partitioning strategy? Could you get 
good performance from Mongo (or Cassandra or whatever?) If a static solution 
won't perform well, a dynamic update one probably won't be any better.


Anyone else whose built this kind of system who can offer suggestions?

Thanks,
- Paul


[1] https://learning.oreilly.com/library/view/streaming-systems/9781491983867/



 

On Friday, March 13, 2020, 9:35:13 PM PDT, Dobes Vandermeer 
 wrote:  
 
 Hi,

I've been thinking about how I might be able to get a good level of performance 
from drill while still having data that updates and while storing the data in 
s3.  Maybe this is a pipe dream, but here are some thoughts and questions.

What I would like to be able to do is to update, replace, re-balance the 
parquet files in s3, but I don't want to calculate and specify the whole list 
of files that are "current" in each query.

I was thinking perhaps I could use a view, so when I replace a file I can add a 
new file, update the view to include it, and then delete the old file.

But I'm worried that having a view with thousands of files could perform poorly.

Building on that idea, it

Re: Help with time taken to generate parquet file.

2020-03-12 Thread Paul Rogers

Hi Vishwajeet,

Welcome to the Drill community. As it turns out, our mailing list does not 
forward images. But, fortunately, the moderation software preserved the images 
so I was able to find them. Let me tackle your questions one by one.

Like all generalizations, saying "Drill needs lots of memory" is relative. The 
statement applies to a production system, running against large files, with 
many concurrent users. It probably does not apply to your local machine running 
a few sample queries.

What drives memory usage? It is not just file size. It is the buffered size. If 
you scan 1TB of data with a simple query with only a WHERE clause, Drill will 
use very little memory. But, if you sort the 1TB of data, Drill will obviously 
need lots of memory to perform the sort. For sort (and several other 
operations), if there is not enough memory, Drill will spill to disk, which is 
slow. (At least three IOs for each block of data instead of just one.)

Second, the variable you used to set memory:

JAVA_TOOL_OPTIONS -Xmx8192m

Is not the documented way to set memory. See [1] for the preferred approach. 
Looks like your approach works; but probably because you are running an 
embedded-mode Drillbit.


Just to emphasize this: Drill works fine as an embedded desktop tool. But, it 
is designed to run well on clusters, with distributed storage and multiple 
machines all working away on large queries.


To assign memory, consider your use case. Your second image is a screen shot of 
one line of the Drill web console showing the Drillbit using .2GB of 8GB of 
heap, 0GB of direct memory, and basically 0% CPU. You did not say if this is 
during a query or between queries. I assume it is between queries.


You mention that you want to "reduce file generation time", but you did not 
state the kind of file you are reading, or the expected sizes of the input and 
output files. (The message title does state the output is Parquet.) I'll guess 
that both files reside on your local machine. So, depending on disk type (SSD 
or HDD), you can expect maybe 50 MB/s (HDD) to 200MB/s (SSD) IO throughput. If 
you want to process a 1GB file, you will need to do 2GB of I/O. At 100MB/s, it 
will take 20 seconds just for the I/O, maybe more if the HDD starts seeking 
between input and output files. This is why a production Drillbit runs on 
multiple servers: to spread out the I/O.

Another issue might be that your input is all one big file. In this case, Drill 
will run in a single thread, with no parallelism. Drill works better if your 
input is divided into multiple files. (Or, multiple blocks in HDFS or S3.) On 
the local system, create a directory that contains your file split into four, 
eight or more chunks. That way, Drill can put all your CPUs to work for 
CPU-intensive tasks such as filtering, computing values, and so on.


At times like this, the query profile is your friend. The amount of information 
can be overwhelming. Look at the total run time. Then, look at the time in the 
various operators. Which ones take time? Only the scan and root (the root 
writes your output file)? Or, do you have a join, sort, or other complex 
operation? How much parallelism are you getting? You would prefer to keep all 
your CPUs busy.

These are a few hints to help you get started. Please feel free to report back 
your findings and perhaps give us a bit more of a description of what you are 
trying to accomplish.


Thanks,
- Paul

[1] https://drill.apache.org/docs/configuring-drill-memory/



 

On Wednesday, March 11, 2020, 5:42:39 AM PDT, Vishwajeet Anantvilas SONUNE 
 wrote:  
 
  
Hi Team,
 
  
 
Learned about apache drill that – ‘Drill is memory intensive and therefore 
requires sufficient memory to run optimally. You can modify how much memory 
that you want allocated to Drill. Drill typically performs better with as much 
memory as possible.’
 
  
 
With this I tried allocating as much as memory I could for drill to run. I’m 
running the drill on my local machine so configured the JAVA_TOOL_OPTIONS to 
8GB as Environment variable. Which in turn increased the heap memory. 
 
  
 

 
  
 
While running a query for generating a parquet file from SQL Server having 
millions of record, the drill just uses 3 – 4 % of heap memory. Any ways there 
is no increase in the performance(reduce in time of file generation).
 
  
 
Can you please let us know if there’s a way to reduce the file generation time?
 
  
 
  
 

 
  
 
  
 
Please let me know if any further details are required.
 
  
 
  
 
Looking for your reply to Vishwajeet Anantvilas SONUNE 

 
  
 
  
 
Thanks
 
Vishwajeet Sonune
 
  
 






***
This e-mail is confidential. It may also be legally privileged. 
If you are not the addressee you may not copy, forward, disclose 
or use any part of it. If you have received this message in error, 
please delete it and all copies from your system and notify the 
sender

Re: Issue connecting to ECS from Drill

2020-03-12 Thread Paul Rogers

Hi Arun,

Charles commented on the Drill version; here I'll assume you are actually using 
Drill 1.17. Hard to diagnose the issue remotely. Here are a few things to 
double-check.

The message does look odd, do you really have a user named ":doesBucketExist"? 
Not sure where Drill got this user from. Searching the source code, this 
message occurs when Drill does a "do as" with a user defined, it seems, defined 
by the Hadoop proxy user system. Does your core-site.xml file contain the 
definition of a proxy user? See documentation at [1].

Does your core-site.xml file contain the entries from the Drill docs? [2] And 
only those entries (no other entries copied from some other Hadoop-based file?)

Is the file named "core-site.xml" and is it in your Drill conf directory? Or, 
if you are using a separate "site" directory, is it in the site's conf 
directory?

If using core-site.xml, did you remove the corresponding properties from the 
JSON storage plugin config?

Do these same credentials work with the AWS script to access S3?

Thanks,
- Paul

[1] https://drill.apache.org/docs/configuring-inbound-impersonation/

[1] https://drill.apache.org/docs/s3-storage-plugin/ 

On Wednesday, March 11, 2020, 4:38:49 AM PDT, Arun Sathianathan 
 wrote:  
 
 Hi,

We are running Drill 1.7 in embedded mode and trying to connect to ECS using s3 
plugin. We tried having the config in both plugin and core-site.xml but doesn’t 
work and gives below error:

Validation error: Failed to create DrillFileSystem for proxy user 
:doesBucketExist on ...

Please can someone give pointers? Appreciate any help. Thanks. 

Regards,
Arun 

Sent from my iPhone

Re: Drill + Mongo

2020-03-04 Thread Paul Rogers

Hi Ron,

Sounds like the good news is that Drill is about as good as Presto when 
querying Mongo. Sounds like the bad news is that both are equally deficient. On 
the other hand, the other good news is that better performance is just a matter 
of adding additional planning rules (with perhaps some Mongo metadata.)


The Wikipedia page for Mongo [1] suggests several features that Mongo (Simba) 
is probably using in their own JDBC driver, but which Drill probably does not 
use:

* Primary and secondary indices
* Field, range query, and regular-expression searches
* User-defined JavaScript functions
* Three ways to perform aggregation: the aggregation pipeline, the map-reduce 
function, and single-purpose aggregation methods.

My guess is that the Mongo JDBC driver does thorough planning to exploit each 
of the above functions, while Drill may use only a few. We already noted other 
weaknesses in the filter push-down code for the Drill Mongo plugin. Seems 
fixable if we can put in the effort.


Seems Mongo provides a Simba JDBC driver, which is proprietary, so no source 
code is available we could use as a "cheat sheet" to see what's what.


Just out of curiosity, what is the query that works well with the Mongo JDBC 
driver, but poorly with Drill?

Anybody know more about how Mongo works and what Drill might be missing?


Thanks,
- Paul

[1] https://en.wikipedia.org/wiki/MongoDB



 

On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini 
 wrote:  
 
 Hi, guys.

This is actually more of a Mongo question than a Drill-specific question as it 
also applies to Presto + Mongo, and the vanilla Mongo shell as well.

I'm asking here, though, because, well, I'm curious, and because you're the 
database geniuses...

So, I essentially get why a NoSQL database, in general, wouldn't be as 
performant as a SQL one at "relational" things.  From what I gather, there are 
denormalization and optimization techniques and tricks you can use to speed up 
a Mongo query and so forth, but my question is:

Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large 
collection (100-200 million documents) that includes even a single WHERE 
clause, or the Mongo equivalent query made via Mongo shell, basically never 
returns and has to be killed, whereas the same (Mongo equivalent) query against 
the same collection made via *Mongo's* JDBC driver takes only a second or two?

Is the Mongo JDBC using some indexing that the others aren't?  (But how would 
that explain Mongo shell's non-performance...  Why doesn't Mongo shell just 
make a JDBC call to the db...)

Thank you in advance for educating me.

Ron

Re: [DISCUSS]: Proposed Agenda for Drill Hangout

2020-02-28 Thread Paul Rogers

If we have time, I'd like to outline the "SPI" project I'm working on. This 
project will create a "Service Provider Interface" for add-on code which 
follows Java practices used elsewhere, such as in the JDK, Presto etc.
Another topic is our efforts to straighten out our data model as touched on by 
recent mail list discussions.
I'll prepare a couple of writeups ahead of time to give folks some background.
These are both long-term discussions, so the point will just to make folks 
aware we can contribute to the design.

Thanks,
- Paul

 

On Friday, February 28, 2020, 12:57:56 PM PST, Charles Givre 
 wrote:  
 
 All, 
Here is a proposed agenda for Drill Hangout:

1.  Improvements to CI tools
2.  Plan for next release. (Yes I know it seems early, but given the amount of 
time the last release took... )
3.  General updates / long term goals.

Thoughts?
-- C

Re: Mongo column types

2020-02-27 Thread Paul Rogers

Hi Dobes,

I like your idea! I think it would be a great addition to Drill in general and 
will show the way for other storage plugins.

Technically, you are right, there should not be that much work. We have working 
examples in other plugins of what would be needed. Perhaps the biggest cost is 
just to get familiar with the storage plugin lifecycle, which is not that 
complex, but has a number of moving parts.

Some general thoughts. For the scchema, you can use the 
TupleMetadata/ColumnMetadata classes which are how we handle the provided 
schema (and many other tasks.) Pretty simple and there are many examples.

The first step is to pass the schema from the planner to the execution engine 
(reader). Basically, in the GroupScan/SubScan, obtain your schema from wherever 
make sense. Extend the SubScan with an element of type TupleMetadata and set it 
to the schema. That will automatically get serialized to JSON in the execution 
plan and sent to each executor. Then, use this schema so the reader knows what 
type of column to create.

The next step is to actually use the schema when reading data. We have a newer 
framework called EVF ("extended vector framework") which handles much of the 
boilerplate including creating a set of vectors and "column writers" given a 
schema. Mongo, however, uses an older method, but the same ideas should apply.

For JSON, we are working on a newer, EVF-based JSON parser that will support a 
provided schema.

Mongo uses BSON. Looks like it uses a BsonRecordReader class. The way that 
class is structured, it might be a bit tricky to insert the type conversion 
layer. There is a trick we've used elsewhere to convert the in-line case 
statement by type into a set of classes, one per column type, which can do the 
conversion. Once we've made that change, the column "adapter" can use the 
schema to write of the write type.

The fancy solution would be to update the Mongo plugin to use EVF and the 
BsonRecordReader to follow the pattern for the EVF-based parser.  The more 
pragmatic solution is to just wrap the existing solution with a bit more code 
to add the schema; we can do a full upgrade later.

We also have a work-in-progress filter push-down framework which can improve 
Mongo's ability to push what it can, leave the rest to Drill. That way, if you 
can push into Mongo things like a date range, student ID or whatever, then more 
advanced predicates can stay in Drill.

Let me know where we can help with suggestions, answering questions, pointers 
to examples, tackling some of the trickier bits, or whatever.

Thanks,
- Paul

On Thursday, February 27, 2020, 6:21:46 PM PST, Dobes Vandermeer 
 wrote:  

 Hi Paul,

After looking at the mongo stuff a bit more today I realized that probably the 
simplest solution would be to put some kind of schema mapping into the storage 
plugin configuration. Some subset of JSON schema using the drill config syntax.

What do you think of this idea? Might actually be something that can be 
implemented pretty quickly.
On Feb 26, 2020, 1:36 PM -0800, Dobes Vandermeer , wrote:
> Hi Paul,
>
> You can always model a type union as a struct with a nullable field for each 
> variant.  So JSON would be a struct with nullable boolean, double, object, 
> array fields.  This would work for JDBC as well.  As you have pointed out, 
> this does have negative performance implications.
>
> Operators would have to be updated to support the JSON data type, it's true, 
> and there's some complexity in terms of type promotions, the sort of of 
> heterogeneous types, and so forth.
>
> Our data isn't really unstructured but we typically omit fields if they are 
> on a default value, to save space.  And we have fields that might be null, an 
> int, or a double.  Would need Drill to support this scenario without relying 
> on a double or int specifically showing up in some sampling of records in 
> order to determine the data type of that column.
>
> We aren't allowing our clients to write any SQL.  This is just for our 
> purposes trying to run reports in a scalable manner.  MongoDB doesn't scale 
> horizontally or dynamically, and writing reports in MongoDB is quite painful. 
>  SQL is much better for this.
>
> If SQL isn't suitable for use with MongoDB/JSON that's fine in general, but 
> for me at least I was thinking of using Apache Drill to allow SQL against 
> both MongoDB, JSON, and Parquet data together.  Otherwise there are probably 
> more mature products that just target purely relational data or Parquet files.
>
>
>
> > On 2/26/2020 11:44:09 AM, Paul Rogers  wrote:
> > Hi Dobes,
> >
> > You have a very interesting use case. Let's explore a bit further.
> >
> > You are right that is should be possible to build a query engine on top of 
> > Union/Variant (JSON object) types. MapR has

Re: Patterns for data updating?

2020-02-27 Thread Paul Rogers

Hi Dobes,

Also, if Ted is still lurking on this list, he's an expert at this stuff. Here
are some patterns I've seen.

What you describe is a pretty standard pattern. Substitute anything for
"scores" (logs, sales, clicks, GPS tracking locations) and you find that many
folks have solved the same issue.

The general approach is either: single system (some kind of DB), or a two-part
system (lambda architecture [1]) such as you described earlier.

As you've discovered, Parquet is designed to be an archive format, not a live
update format. As you've seen, considerable work goes into write time to
organized data into columns row groups, footers, etc. to speed up read access.
But, as a result, Parquet cannot be updated incrementally.

A common pattern is to write data into Kafka as it arrives. Then, run your ETL
step every so often (once a day, say.) In the mean time, stream data to a
short-term store. For scores, each (student, exercise) pair is unique, you
probably don't have to accumulate multiple events for the same exercise or test?

Depending on your constraints and budget, you can accumulate current activity
into a DB, an in-memory store, etc. Or, if you can find a good partitioning
scheme that keeps each Kafka partition relatively small, you can even query
Kafka directly (Drill has a Kafka plugin). In this model, Kafka would hold,
say, today's data and Parquet would hold past data (partitioned in some way so
you scan only one Parquet file per day per school, say.)

The Kafka component is essential for many reasons including those above. It
also gives you a rock-solid guarantee that you never lose data even if your ETL
fails and so on.

So:

External source --> Kafka --> ETL --> Parquet

And

... Kafka --> "Current Data" Cache

Both can then be queried by Drill to combine results as you described in an
earlier message.

The best reference I've seen for this kind of thing is Designing Data-Intensive
Applications by Martin Kleppmann (avoid the knock-off with nearly the same
title.)

Thanks,
- Paul

[1] http://lambda-architecture.net/
[2]
https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/

On Thursday, February 27, 2020, 11:37:45 AM PST, Dobes Vandermeer
wrote:

Hi,

I am trying to figure out a system that can offer both low latency in
generating reports, low latency between data being collected and being
available in the reporting system, and avoiding glitches and errors.

In our system users are collecting responses from their students. We want to
generate reports showing student scores over time, and the scores should
typically be available within a minute of being collected.

I am looking to store the data tables in parquet on S3 and query them using
drill.

However, updating parquet files can be a bit troublesome. The files cannot
easily be appended to. So some process has to periodically re-write the
parquet files. Also, we don't want to have hundreds or thousands of separate
files, as this can slow down query executing. So we don't want to end up with
a new file every 10 seconds.

What I have been thinking is to have a process that runs which writes changes
fairly frequently to small new files and another process that rolls up those
small files into progressively larger ones as they get older.

When querying the data I will have to de-duplicate and keep only the most
recent version of each record, which I think is possible using window
functions. Thus the file aggregation process might not have to worry about
having the exact same row in two files temporarily.

I'm wondering if anyone has gone down this road before and has insights to
share about it.

Re: Mongo column types

2020-02-26 Thread Paul Rogers

Hi Dobes,

MHO, you are on the right track with the design. If your JSON and Mongo 
represents tabular data, Drill should work for you (even if we have to fix a 
few things.)

The Union-as-struct idea is pretty much what Drill does. The actual 
implementation is a vector for each type, plus another that gives the type (or 
null) and tells Drill which vector to consult. As you noted, JDBC presents the 
data as a Java map. When using Sqlline, the tool simply calls "toString()" on 
the resulting object (String, Double, Map, etc.). This trick makes it look as 
though everything "just works." But, a typed client won't see the field as Int 
or Double, it would see a Java Map.

We've also discussed leveraging Drill's Java Object vector support to store the 
JSON struct directly: as a null (Object), or String or Integer or whatever. For 
objects, we convert to a Map in the usual way. Users (such as yourself) could 
write UDF functions to work with this data using Java code to bypass SQL 
weaknesses. You can even write a function that resolves the Java object to a 
SQL scalar for xDBC clients.

The "omit defaults" compression is a good idea. We have to tell what those 
defaults are, however, The provided schema supports this idea: you can provide 
a default value for this use case: if no value is available, Drill uses the 
default value instead. This occurs even if the column does not appear at all in 
some data set.

IIf you eventually need to store arbitrary JSON, you'll need some more general 
purpose tool (XPath? Is there a JSON equivalent?) But, then, for arbitrary 
JSON, one does not expect to chart the results in a BI tool using SQL.


Let's see if we can work through the issues to get your tabular use case to 
work.


Thanks,
- Paul

 

On Wednesday, February 26, 2020, 1:36:20 PM PST, Dobes Vandermeer 
 wrote:  
 
 Hi Paul,

You can always model a type union as a struct with a nullable field for each 
variant.  So JSON would be a struct with nullable boolean, double, object, 
array fields.  This would work for JDBC as well.  As you have pointed out, this 
does have negative performance implications.

Operators would have to be updated to support the JSON data type, it's true, 
and there's some complexity in terms of type promotions, the sort of of 
heterogeneous types, and so forth.

Our data isn't really unstructured but we typically omit fields if they are on 
a default value, to save space.  And we have fields that might be null, an int, 
or a double.  Would need Drill to support this scenario without relying on a 
double or int specifically showing up in some sampling of records in order to 
determine the data type of that column.

We aren't allowing our clients to write any SQL.  This is just for our purposes 
trying to run reports in a scalable manner.  MongoDB doesn't scale horizontally 
or dynamically, and writing reports in MongoDB is quite painful.  SQL is much 
better for this.

If SQL isn't suitable for use with MongoDB/JSON that's fine in general, but for 
me at least I was thinking of using Apache Drill to allow SQL against both 
MongoDB, JSON, and Parquet data together.  Otherwise there are probably more 
mature products that just target purely relational data or Parquet files.



On 2/26/2020 11:44:09 AM, Paul Rogers  wrote:
Hi Dobes,

You have a very interesting use case. Let's explore a bit further.

You are right that is should be possible to build a query engine on top of 
Union/Variant (JSON object) types. MapR has done so with MapRDB. Sounds like 
Mongo has also (I'll make a note to read up on Mongo.) Drill has implemented 
some of these ideas in bits and pieces. However we've hit several problems due 
to the SQL and columnar nature of Drill.


First, SQL is not designed for this use case: every operator and function has 
to be rethought. (For example, a sort has see only INT values and has been 
using an INT comparison. Now, the column becomes a DOUBLE, so we need to add a 
DOUBLE comparison, DOUBLE/INT comparison, and logic to combine the two. Vectors 
that were pure INT in previous sorted runs now have to change to become a UNION 
of either INT or DOUBLE so we can merge runs of INTs with runs of DOUBLEs.)

Second, UNIONs are very inefficient with the extra storage, type checks and so 
on. So, they need to be used only where needed else we violate the #1 concern 
for any query engine: Performance. So, code complexity vastly increases.

Third, neither JDBC nor ODBC understand UNION (Variant). BI tools don't 
understand UNION. Yet, these are Drill's primary clients. So, somebody has to 
convert the UNION into something that xDBC can understand. Should this be 
client code (which, I think, can't even be done for ODBC) or should Drill do 
it? And, if we've only ever seen untyped nulls, we could not convert to the 
correct type, even in principle, because we don't know it (unless we enforce 
the CAST push-down from the other discussion.)


If Drill doe

Re: Mongo filter push-down limitations?

2020-02-26 Thread Paul Rogers

Hi Dobes,

Good points as always. The way open source projects like Drill improve is to 
understand use cases such as yours, then implement them.

We discussed some of Drill's join optimizations, which, if added to the Mongo 
plugin, would likely solve your join problem. The process you describe is 
typical of an RDBMS: the optimizer notices that the cheapest path is to do a 
row-key lookup per join key (using a B-tree in an RDBMS, say). This was 
implemented for MapRDB, and can be added for Mongo.

On the deletion issue, one handy trick is to not actually delete a question: 
just mark it as deleted. That way, you can always compute a score, but if 
someone asks "which questions are available", only those not marked as deleted 
appear. Else, you might find you have some tricky race conditions and 
non-repeatable queries.

Thanks,
- Paul

 

On Wednesday, February 26, 2020, 11:37:22 AM PST, Dobes Vandermeer 
 wrote:  
 
 Hi Paul,

In my case I was looking to union and join.  I was thinking of using a join to 
build up a sort of filter on the parquet data based on the user's query.

Example:

We have "tags" that can be applied to each question, and we want a report of 
each student's average score per tag for a given time period.  Questions can 
also be deleted and we have to verify that a question is not deleted before 
including it in the score.

So, we will scan the answers table, but filtering on whether a question is 
deleted, and grouping on the tags on each question, then take an average.

It seems like the way drill functions currently, if I wanted to get a 
question's tags and deleted status from mongodb, drill will load the entire 
mongodb collection of questions, which is too slow.

What I had hoped for is that drill would be able to scan the answers and gather 
up a list of question ids of interest and query mongodb for those questions 
only with some kind of grouping.

As for the union, I was also hoping that I would be able to pull the most 
recent answers from mongodb and union those with the ones from S3 parquet 
files.  However, my brief test trying to query answers from mongodb via drill 
showed it trying to load the entire collection.

My feeling at the moment is that Drill is not very useful for combining mongodb 
data with other data sources because I will constantly run into times where I 
accidentally pull down the entire collection, and also times where it gives an 
error if the data does not conform to a fixed tabular schema.



On 2/26/2020 12:04:38 AM, Paul Rogers  wrote:
Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

You describe the classic "lambda" architecture: historical data in system A, 
current data in system B. In this case, it would be more of a union than a 
join. Drill handles this well. But, the user has to know how to write a query 
that does the union.

At a prior job, we wrote a front end that rewrote queries so the user just asks 
for, say, "testScores", and the rewrite layer figures that, for a time range of 
more than a week ago, go to long-term storage, else go to current storage. If 
current storage is faster, then, of course, some customers want a longer 
retention period in current storage to get faster queries. This means that the 
cut-off point is not fixed: it differs per customer (or data type.)

Would be cool to do this logic in Drill itself. Can probably even do it today 
with a cleverly written storage plugin that, during planning, rewrites itself 
out of the query in favor of the long-term and short-term data sources. 
(Calcite, Drill's query planner, is quite flexible.)


Once Drill has data, it can join it with any other data source. Drill comes 
from the "big data, scan the whole file" tradition, so the most basic join 
requires a scan of both tables. There is "partition filter push-down" for 
directories which works on each table individually. There is also a 
"join-predicate push-down" (JPPD) feature added a while back. A couple of years 
ago, Drill added the ability to push keys into queries (as would be done for an 
old-school DB with indexes.)

I believe, the Mongo plugin was done before most of the above work was added, 
so there might need to be work to get Mongo to work with these newer features.


Thanks,
- Paul



On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer wrote:

Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I 
was curious what I could expect to be applied at the mongodb level and what 
would not.

Would drill be able to do joins between queries where it pushes down filters 
for the elements that were found?  By the sounds of it, this may be quite far 
off, which does reduce Drill's appeal vs competitors to some degree.

I had hoped that Drill could intelligen

Re: Mongo column types

2020-02-26 Thread Paul Rogers

Hi Dobes,

You have a very interesting use case. Let's explore a bit further.

You are right that is should be possible to build a query engine on top of 
Union/Variant (JSON object) types. MapR has done so with MapRDB. Sounds like 
Mongo has also (I'll make a note to read up on Mongo.) Drill has implemented 
some of these ideas in bits and pieces. However we've hit several problems due 
to the SQL and columnar nature of Drill.

First, SQL is not designed for this use case: every operator and function has 
to be rethought. (For example, a sort has see only INT values and has been 
using an INT comparison. Now, the column becomes a DOUBLE, so we need to add a 
DOUBLE comparison, DOUBLE/INT comparison, and logic to combine the two. Vectors 
that were pure INT in previous sorted runs now have to change to become a UNION 
of either INT or DOUBLE so we can merge runs of INTs with runs of DOUBLEs.)

Second, UNIONs are very inefficient with the extra storage, type checks and so 
on. So, they need to be used only where needed else we violate the #1 concern 
for any query engine: Performance. So, code complexity vastly increases.

Third, neither JDBC nor ODBC understand UNION (Variant). BI tools don't 
understand UNION. Yet, these are Drill's primary clients. So, somebody has to 
convert the UNION into something that xDBC can understand. Should this be 
client code (which, I think, can't even be done for ODBC) or should Drill do 
it? And, if we've only ever seen untyped nulls, we could not convert to the 
correct type, even in principle, because we don't know it (unless we enforce 
the CAST push-down from the other discussion.)

If Drill does the UNION-to-type conversion, why not do it early rather than 
late to avoid all the complexity?

Sounds like you may live in a world where it is JSON in, JSON out, JSON-aware 
clients and you need a JSON-aware query engine. That query engine cannot be 
based on (standard) SQL. Might be based on SQL++ or a JSON-extended SQL which 
does not try to work with xDBC.

Likely, elsewhere in your project, there is someone else (maybe it is you) 
trying to convert runs of untyped JSON nulls into a typed Parquet file that 
matches the case where those JSON fields have a type. Now you'd need to 
convince Parquet, and all its consumers (Amazon Athena, say) to support 
ambiguous types and unions. What about your ML group that now has to have 
feature vectors of mixed types? Right about now a little voice might be saying, 
"ah, maybe we're on the wrong track."

One could argue that the education market is looking for a simple solution: one 
that uses standard (SQL-based) tools. Rather than having to reinvent the entire 
stack to work around a flaw in JSON encoding (untyped nulls), it might be 
better to fix the JSON bug so you can leverage the rich ecosystem of SQL-based 
tools.

Can you explain a bit more the source and use of the data? Will either the 
source or the user expect typed data? How might you handle type ambiguity in 
parts of your app other than query?

Thanks,
- Paul

On Wednesday, February 26, 2020, 8:14:56 AM PST, Dobes Vandermeer 
 wrote:  

 Hi Paul,

You can, of course, represent a JSON value as a union of the various types, and 
your questions are already well answered in mongo and elsewhere.

Typically numeric operations (sum) will fail if values are not all numbers. 
Numbers are promoted to double if they are not all double.

Group by can just compare each field separately and its presence or absence, no 
problem there.

Conversion to jdbc can use whatever system is already used for objects. Any 
variant not populated is null. If they want a more concise representation then 
CAST will do a best effort conversion or fail.

For a chart you would CAST the data points to double.

I don't personally see any major roadblock there.
On Feb 25, 2020, 11:50 PM -0800, Paul Rogers , wrote:
> Hi Dobes,
>
> Looking at the BSON spec [1], it seems that BSON works like JSON: a null 
> isn't a null of some type, it is just null. So, if Drill sees, say, 10 nulls, 
> and has to create a vector to store them, it doesn't know which type to use. 
> In fact, if BSON worked any other way, there could not be a 1:1 translation 
> between the two.
>
>
> This contrasts with the SQL model where a NULL is a NULL INT or a NULL 
> VARCHAR; never just a NULL. Makes sense: all values in a SQL column must be 
> of the same type so NULLs are of that column type.
>
> And here we have the fundamental conflict between SQL and JSON. If I have a 
> field "c" in three JSON objects, JSON does not care if those three fields 
> have the same type or wildly differing types (null, Integer, object, say.) 
> SQL, however, gets its convenience from types being implicit and constant: 
> the data dictionary gives the types so the user does not have to.
>
> Drill has a UNION type (called a "Variant" in some DBs) that

Re: Mongo column types

2020-02-26 Thread Paul Rogers

Hi Dobes,

I had a similar thought: perhaps we can implement "type push-down": use the 
CAST to implicitly declare a provided schema for a column. In your query:

SELECT _id, CAST(points AS DOUBLE)
FROM mongo.formative.answers AS answer
WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
LIMIT 100

A normal SQL planner runs type propagation from column types to expressions to 
result columns.
For "type push-down", we would run the planner type propagation "backwards" to 
say that `points` must be a DOUBLE and `current_timestamp` must be a DATE, and 
thus so must `answer.createdAt`. These types would then be pushed into the scan 
so that if the scan sees NULLs for any of these types, it knows the type of the 
NULL. In this case, we have no information about the type of _id, which is OK 
because _id sounds like something that would never be NULL.

This does not help if `points` is read as an INT: now every reader would have 
to be prepared for any type conversion the user dreams up. This is, in fact, 
the problem I've been fighting in a current PR about type conversions. We end 
up not only with CAST implementations, but a separate implementation of the 
CAST logic in each reader. I'm actually pretty skeptical that the approach is a 
much of a win.

While this might be a clever hack, and can probably be implemented, it pushes 
all the complexity onto your user. If your target market is primary or 
secondary education, you may have a hard time convincing users or app 
developers to include the needed boilerplate. That is, while the above would 
work, the following would not:

SELECT _id, points ...

Further, your users are not likely going to write SQL. (In my years working 
with my local schools, I never found a single person who could write SQL, 
usually not even the IT staff.) Instead, the users will use some front-end: 
your hosted UI, a custom web UI, a BI tool, etc.. Those tools won't know to 
insert the cast. You could require the web UI people to do so, but not a BI 
tool vendor.

Each of these audiences would observe, rightly, that in the SQL world, the 
right place for type information is in the data dictionary so that UI 
engineers, BI tool vendors, ad-hoc query writers don't all have to keep the DD 
in their heads and write the correct CAST statements.

At least, that's been my (limited) experience. Is there something unique or 
clever about your app where pushing type info into the query would be practical?

We've been chipping away at a generalized data dictionary ("metastore") so that 
you can connect Drill to whatever type registry you might have. Not quite ready 
for prime-time, but close.

Drill has another trick: if we did do "type push-down", you could write a view 
on top of each table that includes all the casts. This is described in the 
Drill docs, on the mailing list, and in our Drill book. Problem is, now instead 
of maintaining a DD, you maintain a set of views. I've yet to be convinced that 
this is an improvement.

Would views + type-push-down (+ maybe hints) give you what you want without 
having some form of data dictionary?

Thanks,
- Paul

On Wednesday, February 26, 2020, 8:06:20 AM PST, Dobes Vandermeer 
 wrote:  

 Shower thought:

If you wrote types into the where clause as filters it would both declare the 
type and give permission to skip anything that doesn't match that type. This 
filter can even be pushed down to the mongo query using the $type operator.

It would still have to support nullable values, though.

Do you think this idea has potential?
On Feb 25, 2020, 11:50 PM -0800, Paul Rogers , wrote:
> Hi Dobes,
>
> Looking at the BSON spec [1], it seems that BSON works like JSON: a null 
> isn't a null of some type, it is just null. So, if Drill sees, say, 10 nulls, 
> and has to create a vector to store them, it doesn't know which type to use. 
> In fact, if BSON worked any other way, there could not be a 1:1 translation 
> between the two.
>
>
> This contrasts with the SQL model where a NULL is a NULL INT or a NULL 
> VARCHAR; never just a NULL. Makes sense: all values in a SQL column must be 
> of the same type so NULLs are of that column type.
>
> And here we have the fundamental conflict between SQL and JSON. If I have a 
> field "c" in three JSON objects, JSON does not care if those three fields 
> have the same type or wildly differing types (null, Integer, object, say.) 
> SQL, however, gets its convenience from types being implicit and constant: 
> the data dictionary gives the types so the user does not have to.
>
> Drill has a UNION type (called a "Variant" in some DBs) that can hold, say, a 
> NULL a FLOAT8 and an object (MAP). Perfect! But, how do you sum, sort or 
> group by this monster? How do you send the results to JDBC or ODBC? How to 
> you chart a UNION?
>
>
> Tryi

Re: Apache Drill: cast boolean to number

2020-02-26 Thread Paul Rogers

Hi,
I think this should also work:

IF(foo, 1, 0)

That said, the question was about the internal casting rules. As you've seen, a 
BIT is actually stored as a 1-bit integer internally (via the BitVector). One 
the one hand, it seems possible to add the needed implementations so that 
casting works.

On the other hand, I wonder if that would cause undesired interference with the 
SQL Boolean rules: maybe there is some place where the ability to treat a Bit 
as both an INT and a BOOLEAN causes ambiguities in the planner or in run-time 
type resolution code.

Still, easy enough to add the rule and rerun the unit tests to see what's what.

Thanks,
- Paul

 

On Wednesday, February 26, 2020, 6:16:52 AM PST, Charles Givre 
 wrote:  
 
 Hi There, 
Thanks for your interest in Drill.  Can you please explain a bit more about 
what you're trying to do?  My intiial thought is to use a CASE statement rather 
than a CAST function.  IE:

SELECT ...
(CASE 
  WHEN foo = true THEN 1
  ELSE 0
END)
FROM dfs...

Best,
--C 

> On Feb 26, 2020, at 9:03 AM, Зиновьев Олег  wrote:
> 
> Hello.
> 
> I am confused with the implicit Drill cast rules for BIT data type. According 
> to TypeCastRules.java there is the possibility of conversion between BIT and 
> INT (and other number type), e.g. cast(true as int). However, when trying to 
> perform such a conversion, the error "Missing function implementation: 
> [castBIGINT (BIT-REQUIRED)]" occurs.
> 
> As far as I understand, Drill scans the list of functions for conversion (in 
> my case it is castINT) and tries to find the chain of transformations that is 
> the least expensive. However, since there is no direct BIT -> INT conversion 
> function, it selects the function with the least difference in 
> ResolverTypePrecedence.precedenceMap and tries to add a additional cast for 
> the argument (In my case it is [castINT (BIGINT-REQUIRED), since 
> TypeCastRules.java also allows conversion BIT -> BIGINT). At the time of 
> searching for the argument conversion function, the error mentioned above 
> occurs.
> 
> questions:
> 1) Is it correct that BIT -> Number conversions are allowed in TypeCastRules?
> 2) Was the conversion supposed to be done through TinyInt? However, for this 
> type I could not find the conversion functions to other types of numbers.
> 3) the inability to convert BIT to number is an error? Or is this the correct 
> behavior?

Re: Mongo filter push-down limitations?

2020-02-26 Thread Paul Rogers

Hi Dobes,

Sounds like the Mongo filter push-down logic might need some TLC.

You describe the classic "lambda" architecture: historical data in system A, 
current data in system B. In this case, it would be more of a union than a 
join. Drill handles this well. But, the user has to know how to write a query 
that does the union.

At a prior job, we wrote a front end that rewrote queries so the user just asks 
for, say, "testScores", and the rewrite layer figures that, for a time range of 
more than a week ago, go to long-term storage, else go to current storage. If 
current storage is faster, then, of course, some customers want a longer 
retention period in current storage to get faster queries. This means that the 
cut-off point is not fixed: it differs per customer (or data type.)

Would be cool to do this logic in Drill itself. Can probably even do it today 
with a cleverly written storage plugin that, during planning, rewrites itself 
out of the query in favor of the long-term and short-term data sources. 
(Calcite, Drill's query planner, is quite flexible.)


Once Drill has data, it can join it with any other data source. Drill comes 
from the "big data, scan the whole file" tradition, so the most basic join 
requires a scan of both tables. There is "partition filter push-down" for 
directories which works on each table individually. There is also a 
"join-predicate push-down" (JPPD) feature added a while back. A couple of years 
ago, Drill added the ability to push keys into queries (as would be done for an 
old-school DB with indexes.)

I believe, the Mongo plugin was done before most of the above work was added, 
so there might need to be work to get Mongo to work with these newer features.


Thanks,
- Paul

 

On Tuesday, February 25, 2020, 10:23:59 PM PST, Dobes Vandermeer 
 wrote:  
 
 Hi Paul,

A simple filter I tried was: WHERE createdAt > TIMESTAMP "2020-02-25"

This wasn't pushed down.

I think I recall doing another query where it did send a filter to MongoDB so I 
was curious what I could expect to be applied at the mongodb level and what 
would not.

Would drill be able to do joins between queries where it pushes down filters 
for the elements that were found?  By the sounds of it, this may be quite far 
off, which does reduce Drill's appeal vs competitors to some degree.

I had hoped that Drill could intelligently merge historical data saved as 
parquet with the latest data in mongodb, giving a kind of hybrid reporting 
approach that gives current data without overloading mongodb to pull millions 
of historical records.  However, it sounds like this is not supported yet, and 
likely won't be for some time.
On 2/25/2020 8:19:19 PM, Paul Rogers  wrote:
Hi Dobes,

Your use case is exactly the one we hope Drill can serve: integrate data from 
multiple sources. We may have to work on Drill a bit to get it there, however.

A quick check of Mongo shows that it does implement filter push down. Check out 
the class MongoPushDownFilterForScan. The details appear to be in 
MongoFilterBuilder. This particular implementation appears to be rather 
limited: it seems to either push ALL filters, or none. A more advanced 
implementation would push those it can handle, leaving the rest to Drill.


There may be limitations; it depends on what the plugin author implemented. 
What kind of query did you do where you saw no push-down? And, how did you 
check the plan? Using an EXPLAIN PLAN FOR ... command? If filters are, in fact, 
pushed down, there has to be some trace in the JSON plan (in some 
Mongo-specific format.)

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe 
try the simplest possible query such as classID = 10.


Filter push-down is a common operation, most implementations are currently 
(incomplete) copy/pastes of other (incomplete) implementations. We're working 
to fix that. We had a PR for the standard (col RELOP const) cases, but reviwers 
asked that it be made more complete. The PR does handle partial filter 
pushdown. Perhaps, as we move forward, we can apply the same ideas to Mongo.

Thanks,
- Paul



On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer wrote:

Hi,

I am trying to understand drill's performance how we can best use it for our 
project. We use mongo as our primary "live" database and I am looking at 
syncing data to Amazon S3 and using Drill to run reports off of that.

I was hoping that I could have Drill connect directly to mongo for some things.

For example: Our software is used to collect responses from school classroom. I 
thought if I was running a report for students in a given class, I could build 
the list of students at a school using a query to mongodb.

I wanted to verify that drill would push down filters when doing a join, maybe 
first collecting a list of ids it is interested and use that as a filter when 
it scans the next mongo collec

Re: Mongo column types

2020-02-25 Thread Paul Rogers

Hi Dobes,

Looking at the BSON spec [1], it seems that BSON works like JSON: a null isn't 
a null of some type, it is just null. So, if Drill sees, say, 10 nulls, and has 
to create a vector to store them, it doesn't know which type to use. In fact, 
if BSON worked any other way, there could not be a 1:1 translation between the 
two.


This contrasts with the SQL model where a NULL is a NULL INT or a NULL VARCHAR; 
never just a NULL. Makes sense: all values in a SQL column must be of the same 
type so NULLs are of that column type.

And here we have the fundamental conflict between SQL and JSON. If I have a 
field "c" in three JSON objects, JSON does not care if those three fields have 
the same type or wildly differing types (null, Integer, object, say.) SQL, 
however, gets its convenience from types being implicit and constant: the data 
dictionary gives the types so the user does not have to.

Drill has a UNION type (called a "Variant" in some DBs) that can hold, say, a 
NULL a FLOAT8 and an object (MAP). Perfect! But, how do you sum, sort or group 
by this monster? How do you send the results to JDBC or ODBC? How to you chart 
a UNION?


Trying to get SQL to act like JSON (or visa-versa) has been an ongoing conflict 
for the life of the Drill project (and, for me, with a BI tool before that.)


The best we can do is to say that Drill works on the subset of JSON which 
represents a tabular structure (same types on fields of the same name across 
objects.) That is, Drill works if the JSON was created by the equivalent of a 
SQL query (with nested structure to "un-join" tables.) But, we still need to 
deal with untyped nulls.

The "sample the first batch" approach only works if we can guarantee a non-null 
value appears in the first batch, always. I'd not bet on it. A query of one row 
("Johnny's result from the test on Tuesday" when Johnny was absent and so 
"points" is null) will confuse the JDBC client programmed to expect a (possibly 
null) double.


What we need is a way to know that "when column `points` does appear, it will 
be a FLOAT8". This is what the "provided schema" does: you can specify just the 
problematic columns. For files, we put the schema file in the table directory. 
This is very light-weight: when a conflict occurs, add a hint to make the 
problem go away.

But, there is no directory in Mongo where we can stash a schema file. So, what 
we need is some other place to store that annotation. Maybe something 
associated with the storage plugin for something like Mongo: "for this Mongo 
server, `points` is FLOAT8." Or something fancier if, say, students, teachers, 
assignments and tests are all just objects (with some type field) in a single 
Mongo "table."


Thinking a bit more broadly, I'd guess that the data comes from somewhere, 
perhaps a web form. If so, then the UI designer certainly knew what type she 
was gathering. If it is a DB extract, the original source had a type. Scantron? 
Has a type. PowerSchool/SchoolWires/Moodle/Whatever likely store data in a DB, 
so it has a type. If you ETL to Parquet for storage in S3, Parquet needs a 
type. So, somebody knows the type. We just have to let Drill in on the secret.


Would some kind of hint-based model work for your use case? Some idea that 
would be better?


Thanks,
- Paul

[1] http://bsonspec.org/spec.html


 

On Tuesday, February 25, 2020, 10:17:56 PM PST, Dobes Vandermeer 
 wrote:  
 
 Hi Paul,

It seems to me that the type of the columns in a JSON file is "JSON" - e.g. 
map, array, number, string, null, or boolean.  In mongodb it is "BSON", which 
adds dates, integers, and a few other things.

Lacking further guidance from the user, I would expect drill to handle all JSON 
& BSON columns as if they could hold any of those types at any time.  It 
definitely should not distinguish between integers and floats in JSON, because 
JSON does not have this distinction.

I suppose this may seem like a pain, though; perhaps it blows up the algorithms 
drill uses.  I'm still new to drill so I don't really understand all the 
implications of it.  But I do know that this *is* the true data model of JSON & 
BSON.  Trying to lockdown the schema will create impedance mismatches.

Unless this reality is accepted then the pain will never end, I suspect.


If the user does a CAST() on some values then the output of the CAST operation 
can be assumed to be specified type, or there will be an error.  Perhaps 
there's some hope in that direction.




On 2/25/2020 8:05:37 PM, Paul Rogers  wrote:
Hi Dobes,

You've run into the classic drawback of runtime schema inference: if Drill 
never sees a column value in its first sample, then it has no way to "predict 
the future" and guess what type will eventually show up. So, Drill guesses 
"nullable INT" which turns out to almost always be wrong.

Some r

Re: JDBC driver for Java 7 vesion

2020-02-25 Thread Paul Rogers

Hi Prabhakar,

One more thought if you can't upgrade your app server to Java 8, and if 
back-porting Drill to Java 7 is not practical. As it turns out, all versions of 
Java are compatible with your network. So, perhaps you can use a network 
connection to bridge the two.

If your queries are of modest complexity, then the REST API is an option. The 
REST API returns the entire data set in a single response and so works for 
result sizes up to, say, several thousand rows.

A more advanced option is to create a specialized micro-service that speaks 
some Java 7 compatible format on one side (Thrift? Protobuf? REST?) and uses 
Drill's JDBC or client API to talk to Drill on the other side. The 
micro-service would be written in Java 8 so it can work with Drill. This works 
if you have limited needs (run queries, get back results). It would be a bit 
too much to ask to implement an entire JDBC shim (though Calcite Avatica is 
supposed to provide most of it for you.)

Thanks,
- Paul

On Tuesday, February 25, 2020, 12:24:25 AM PST, Prabhakar Bhosaale 
 wrote:  

 Thanks Paul, that helps a lot.

Regards
Prabhakar

On Tue, Feb 25, 2020 at 1:30 PM Paul Rogers 
wrote:

> Hi Prabhakar,
>
> As it turns out, Drill is built for Java 8-13, but we've not built for
> Java 7 in quite some time. (Java 7 reached end of life several years back.)
>
> That said, you can try to clone the project sources and do a build.
> Unfortunately, the JDBC driver tends to use quite a bit of Drill's
> internals and so has a rather large footprint, some of which is likely to
> depend on Java 8.
>
> Further, Drill depends on a large number of libraries, all of which have
> likely been upgraded to Java 8. You'd have to find old Java 7 versions, and
> then figure out how to change Drill code to work with those old versions.
>
> One might well ask, is it possible for you to upgrade to a supported Java
> version? Drill must not be the only library where you have the Java 8
> dependency problem.
>
>
> Thanks,
> - Paul
>
>
>
>    On Monday, February 24, 2020, 11:31:15 PM PST, Prabhakar Bhosaale <
> bhosale@gmail.com> wrote:
>
>  Hi All,
> We are using drill 1.16.0 and we are trying to create JDBC datasource  on
> WAS8.5 with java7. we are getting following error.
> "exception: java.sql.SQLException: java.lang.UnsupportedClassVersionError:
> JVMCFRE003 bad major version; class=org/apache/drill/jdbc/Driver, offset=6"
>
> So where can i get the JDBC driver compiled for java7?  thx
>
> Regards
> Prabhakar
>

Re: Mongo filter push-down limitations?

2020-02-25 Thread Paul Rogers

Hi Dobes,

Your use case is exactly the one we hope Drill can serve: integrate data from 
multiple sources. We may have to work on Drill a bit to get it there, however.

A quick check of Mongo shows that it does implement filter push down. Check out 
the class MongoPushDownFilterForScan. The details appear to be in 
MongoFilterBuilder. This particular implementation appears to be rather 
limited: it seems to either push ALL filters, or none. A more advanced 
implementation would push those it can handle, leaving the rest to Drill.


There may be limitations; it depends on what the plugin author implemented. 
What kind of query did you do where you saw no push-down? And, how did you 
check the plan? Using an EXPLAIN PLAN FOR ... command? If filters are, in fact, 
pushed down, there has to be some trace in the JSON plan (in some 
Mongo-specific format.)

Given the all-or-nothing limitation of the Mongo plugin implementation, maybe 
try the simplest possible query such as classID = 10.


Filter push-down is a common operation, most implementations are currently 
(incomplete) copy/pastes of other (incomplete) implementations. We're working 
to fix that. We had a PR for the standard (col RELOP const) cases, but reviwers 
asked that it be made more complete. The PR does handle partial filter 
pushdown. Perhaps, as we move forward, we can apply the same ideas to Mongo.

Thanks,
- Paul

 

On Tuesday, February 25, 2020, 5:27:53 PM PST, Dobes Vandermeer 
 wrote:  
 
 Hi,

I am trying to understand drill's performance how we can best use it for our 
project.  We use mongo as our primary "live" database and I am looking at 
syncing data to Amazon S3 and using Drill to run reports off of that.

I was hoping that I could have Drill connect directly to mongo for some things.

For example: Our software is used to collect responses from school classroom.  
I thought if I was running a report for students in a given class, I could 
build the list of students at a school using a query to mongodb.

I wanted to verify that drill would push down filters when doing a join, maybe 
first collecting a list of ids it is interested and use that as a filter when 
it scans the next mongo collection.

However, when I look at the physical plan I don't see any evidence that it 
would do this, it shows the filter as null in this case.

I also tried a query where I filtered on createdAt > 
date_sub(current_timestamp, interval "1" day) and it didn't apply that as a 
push-down filter (according to the physical plan tab) whereas I had hoped it 
would have calculated the resulting timestamp and applied that as a filter when 
scanning the collection.

Is there some rule I can use to predict when a filter will be propagated to the 
mongo query?

Re: Mongo column types

2020-02-25 Thread Paul Rogers

Hi Dobes,

You've run into the classic drawback of runtime schema inference: if Drill 
never sees a column value in its first sample, then it has no way to "predict 
the future" and guess what type will eventually show up. So, Drill guesses 
"nullable INT" which turns out to almost always be wrong.

Some record readers pick the type on the very first row (a sample size of 1.) 
The newer JSON reader we're working on uses the first batch (a few thousand 
rows) as its sample size.

Still, if you request "points", the reader is obligated to provide a column 
even if has to make something up. So, it makes up "nullable INT."

This is the "black swan" problem of inductive reasoning: no matter how many 
empty values Drill sees, there could always be a non-empty value of some other 
type.


Worse, one scan may see no value and choose "nullable INT" while another sees 
the actual value and chooses Float8. Now, some poor exchange receiver operator 
will see both types and have no clue what to do.


This is why most DBs require a metastore (AKA data dictionary) to provide table 
descriptions. Instead of infering types, DBs define the types, often via the 
same spec that drives the generative process that created the data.


Drill also has relatively new "provided schema" feature that helps with this 
issue in some (but not all) format plugins. But, it has not yet been added to 
Mongo (or any other storage plugin other than the file system plugin.)

You could try a conditional cast: something like

IF(sqlTypeOf(points) = `INT`, CAST(NULL AS FLOAT4), points)

(I probably have the syntax a bit wrong.) This works if two different scans see 
the different types. But, it will fail if a single scan sees an empty value 
followed by a null value (which is exactly the case you describe) because the 
scan is trying to cope with the data before its even gotten to the Project 
operator where the IF would be applied.

Sorry for the long post, but this is a difficult issue that has frustrated 
users for years. I recently posted a proposed solution design at [1] and would 
welcome feedback.


Thanks,
- Paul

[1] 
https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model


On Tuesday, February 25, 2020, 5:27:01 PM PST, Dobes Vandermeer 
 wrote:  
 
 Hi,


I was experimenting with the mongo storage system and I found that when I query 
a field that doesn't usually have any value, I get this error "You tried to 
write a Float8 type when you are using a ValueWriter of type 
NullableIntWriterImpl."

Based on a bit of googling I found that this means drill has inferred the 
incorrect type for that field.  I was hoping I could override the inferred type 
using CAST or something, but CAST didn't work.  Is there a way to tell drill 
what type a field from mongodb is supposed to be?

Example query:

SELECT _id, CAST(points AS DOUBLE)
FROM mongo.formative.answers AS answer
WHERE answer.createdAt > DATE_SUB(current_timestamp, interval '1' day)
LIMIT 100

In this case "points" isn't set on every row, so I guess drill assumes it is 
"NullableInt" when really it is should be considered a double.  We also have 
many boolean fields that are not set by default that we would want to query.

What's the standard workaround for this case?

Re: JDBC driver for Java 7 vesion

2020-02-25 Thread Paul Rogers

Hi Prabhakar,

As it turns out, Drill is built for Java 8-13, but we've not built for Java 7 
in quite some time. (Java 7 reached end of life several years back.)

That said, you can try to clone the project sources and do a build. 
Unfortunately, the JDBC driver tends to use quite a bit of Drill's internals 
and so has a rather large footprint, some of which is likely to depend on Java 
8.

Further, Drill depends on a large number of libraries, all of which have likely 
been upgraded to Java 8. You'd have to find old Java 7 versions, and then 
figure out how to change Drill code to work with those old versions.

One might well ask, is it possible for you to upgrade to a supported Java 
version? Drill must not be the only library where you have the Java 8 
dependency problem.


Thanks,
- Paul

 

On Monday, February 24, 2020, 11:31:15 PM PST, Prabhakar Bhosaale 
 wrote:  
 
 Hi All,
We are using drill 1.16.0 and we are trying to create JDBC datasource  on
WAS8.5 with java7. we are getting following error.
"exception: java.sql.SQLException: java.lang.UnsupportedClassVersionError:
JVMCFRE003 bad major version; class=org/apache/drill/jdbc/Driver, offset=6"

So where can i get the JDBC driver compiled for java7?  thx

Regards
Prabhakar

Re: Websphere JDBC data source - Class not found exception

2020-02-24 Thread Paul Rogers

Hi Prabhakar,

While it is a bit difficult to debug class path issues via e-mail, here are 
some suggestions.

First, verify that the Drill JDBC driver is indeed on your class path. Given 
that you are using an app server, it is important that the jar be visible to 
the class loader that is calling it.

Second, app servers tend to formalize things like the JDBC registry. Might 
there be some config needed to ensure that the Drill driver is registered? 
Check what you did for, say, MySQL and try that.

Third, some apps need to force the Drill driver class to be loaded and visible. 
See the last section of [1].

Thanks,
- Paul

[1] 
http://drill.apache.org/docs/using-the-jdbc-driver/#example-of-connecting-to-drill-programmatically

On Monday, February 24, 2020, 7:24:41 PM PST, Prabhakar Bhosaale 
 wrote:  

 Hi Team,

Please help with any pointers on below issue mentioned. Thanks in advance.

Regards
Prabhakar

On Mon, Feb 24, 2020 at 10:01 AM Prabhakar Bhosaale 
wrote:

> Hi All,
> we have apache drill version 1.16 and we are trying to create JDBC data
> source on websphere application server. But it gives error "Class not found
> org.apache.drill.jdbc.Driver" when i try to test the connection.
>
> I followed all the instructions that are available on different sites and
> on Drill site but with no success.
>
> The only pre-requisite which is not matching is JDK version. As per listed
> prerequisites it needs JDK 8 where as the websphere is running on JDK7.
>
> is this JDK version causing the error? Any pointers to resolve this will
> help. Thanks
>
> Regards
> Prabhakar
>

Re: Requesting json file with schema

2020-02-23 Thread Paul Rogers

Hi,

Sorry for the delay in responding. Thank you for the helpful background 
information - very helpful indeed. Here are some thoughts about how we could 
extend Drill to help with your use case.


Your challenge seems rather open-ended: you don't know the format of the 
incoming data and don't know how the data will be used. I would guess that this 
presents a large challenge when doing ETL as you don't have the two most 
important pieces of information. Real life is messy. 


Thanks much for the PostgreSQL link. We've recently discussed introducing a 
similar feature in Drill which one could, with some humor, call "let JSON be 
JSON." The idea would be, as in PostreSQL, to simply represent JSON as text and 
allow the user to work with JSON using JSON-oriented functions. The PostreSQL 
link suggest that this is, in fact, a workable approach (though, as you not, 
doing so is slower than converting JSON to a relational structure.) I took the 
liberty of filing DRILL-7598 [1] to request this feature.


Today, however, Drill attempts to map JSON into a relational model so that the 
user can use SQL operations to work on the data. [2] The Drill approach works 
well when the JSON is the output of a relational model (a dump of a relational 
table or query, say.) The approach does not work for "native" JSON in all its 
complexity. JSON is a superset of the relational model and so not all JSON 
files map to tables and columns.

To solve your use case, Drill would need to adopt a solution similar to 
PostgreSQL. In fact, Drill already has some of the pieces (such as the 
CONVERT_TO/CONVERT_FROM operations [3]), but even these attempt to convert JSON 
to or from the relational model. What we need, so solve the general use case, 
are the kind of native JSON functions which PostgreSQL provides.

Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed 
in the Drill "core". All that is needed is someone to provide a set of Drill 
functions (UDFs) to call out to some JSON library to perform the desired 
operations. Anyone looking for a starter project? This one would be  a great 
one.

All that said, I don't think any of this would solve your ETL case. If you 
convert the data to Parquet, it is probably because your consumers want to read 
the data directly from Parquet. Storing JSON in Parquet is probably not what 
consumers expect as it simply pushes ETL complexity onto them. So, you do need 
to solve the problem of converting JSON to a relational model, which can then 
be stored in "plain" Parquet files.

Your e-mail (and several other ongoing discussions) prompted me to write up a 
description of the challenges in Drill's dynamic schema model and how we might 
improve the solution, starting with the scan operator [4]. The idea is to 
provide a way for you to give Drill just enough hints to overcome any 
ambiguities in the JSON. Drill has a "provided schema" feature [5], which, at 
present, is used only for text files (and recently with limited support in 
Avro.) We are working on a project to add such support for JSON. 


You offer a good suggestion: allow the JSON reader to read chunks of JSON as 
text which can be manipulated by those future JSON functions. In your example, 
column "c" would be read as JSON text; Drill would not attempt to parse it into 
a relational structure. As it turns out, the "new" JSON reader we're working on 
originally had a feature to do just that, but we took it out because we were 
not sure it was needed. Sounds like we should restore it as part of our 
"provided schema" support. It could work this way: if you CREATE SCHEMA with 
column "c" as VARCHAR (maybe with a hint to read as JSON), the JSON parser 
would read the entire nested structure as JSON without trying to parse it into 
a relational structure. I filed DRILL-7597 [6] to add the "parse as JSON" 
feature.

If Drill were to include PostgreSQL-like JSON functions, and the ability to 
read selected JSON fields a JSON, would this address your use case? What else 
might be required?

Also, your e-mail mentions "json file with schema." How would you want to 
provide the schema? Would the provided schema feature, described above, work 
for you or is there a better solution?


Thanks,
- Paul

[1] https://issues.apache.org/jira/browse/DRILL-7598

[2] https://drill.apache.org/docs/json-data-model/

[3] 
https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from


[4] 
https://github.com/paul-rogers/drill/wiki/Toward-a-Workable-Dynamic-Schema-Model

[5] https://drill.apache.org/docs/create-or-replace-schema/

[6] https://issues.apache.org/jira/browse/DRILL-7597


On Monday, February 17, 2020, 4:35:31 AM PST, 
userdrill.mail...@laposte.net.INVALID  
wrote:  
 
 Hi, 

For our particular case here, we just want to prepare a dataset (in Parquet 
Format) which will be reused by

Re: Connecting Apache Drill with Snowflake DB

2020-02-20 Thread Paul Rogers

Hi Jeganathan,

What error are you seeing? We recently has another user trying to connect to 
Dremo and had them try some things to track down the problem. Perhaps you can 
start by trying some of those steps. See the Drill archives for February 2020.

The first thing to check is that the driver actually works in Drill so we can 
rule out things like the driver in the wrong place, conflicting dependencies 
and so on. Please check Drill's logs ($DRILL_HOME/log/drillbit.log) to see if 
they provide any information.


The next step is to try simple tasks directly from Sqlline or the Drill web 
console. The prior e-mails suggest the steps to take.

The final step would be to figure out if there is an issue for some specific 
query or workspace.

A quick check of the Snowflake site suggests that they do not have a 
publicly-visible sample data set; looks like I'd need to set up a free trial. 
As a result, I can't easily try out the connection for you.


Thanks,
- Paul

 

On Thursday, February 20, 2020, 3:20:33 PM PST, velu jeganathan 
 wrote:  
 
 Hi,

I am trying to connect Apache Drill to query Snowflake cloud data
warehouse. I was able to successfully create a snowflake storage plugin in
Drill web UI from my windows 10 machine with JSON structure as below. But I
am not able to either explore the tables in the schema in the explorer nor
query the tables. I am using Snowflake JDBC driver from their site:
https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#examples

What am I doing wrong? Thanks in advance for your help.

{
 "type": "jdbc",
 "driver": "net.snowflake.client.jdbc.SnowflakeDriver",
  "url": "jdbc:snowflake://accountname.snowflakecomputing.com/?
    warehouse=wh=test_db=test_schema",
  "username": "user",
  "password": "pwd",
  "caseInsensitiveTableNames": false,
  "enabled": true
  }

Thanks and Regards,
Jeganathan Velu

Re: Drill Hangout Proposal

2020-02-20 Thread Paul Rogers

Hi All,

Thanks much for thinking of us straggler PT folks. 7 AM is fine. I may not turn 
on the camera, however.


As for the foreign language bit, you all speak (and write!) English so well 
we'd never know it was a second language unless you told us. I am always very 
impressed by your English skills. Your English teacher in school should be 
proud.


Thanks,
- Paul

 

On Thursday, February 20, 2020, 7:34:44 AM PST, Igor Guzenko 
 wrote:  
 
 Hi Charles,

Yes, 5 PM is fine for me. Isn't 7 AM still too early for Paul?

Thanks,
Igor

On Thu, Feb 20, 2020 at 5:29 PM Charles Givre  wrote:

> Hi Igor,
> So would 5PM Ukraine time work for you?  That translates to 10AM my time,
> and 7AM in the Bay Area.  Would that work?
> -- C
>
> > On Feb 20, 2020, at 10:27 AM, Igor Guzenko 
> wrote:
> >
> > Hello Charles,
> >
> > The mentioned EET range (4 pm - 8 pm) is the most convenient for me.
> Since
> > morning time in Ukraine poorly correlates with the US time.
> > And after 8 pm holding a conversation in a foreign language isn't the
> > easiest thing to do after a hard working day :)
> >
> > Thanks,
> > Igor
> >
> >
> > On Thu, Feb 20, 2020 at 5:16 PM Charles Givre  wrote:
> >
> >> Hi Igor,
> >> What are your working hours in Ukraine?  The time I proposed: 9:30AM ET
> >> fits within the window you listed below.  It is early for anyone on the
> >> west coast of the US so we could do it an hour later which makes it a
> more
> >> civilized hour in the morning for them.
> >> -- C
> >>
> >>> On Feb 20, 2020, at 10:14 AM, Igor Guzenko  >
> >> wrote:
> >>>
> >>> Hello Charles,
> >>>
> >>> I feel uncomfortable that time is not perfect for everyone. In my
> >> opinion,
> >>> we have a broader time range, perfectly would be to fit within:
> >>>
> >>> EST | 9 am - 2 pm
> >>> PST | 6 am - 10 am
> >>> EET | 4 pm - 8 pm
> >>>
> >>> Although still not so much space for PST, but at least not only 6:30
> >> am...
> >>>
> >>> Kind regards,
> >>> Igor
> >>>
> >>>
> >>> On Thu, Feb 20, 2020 at 4:22 PM Charles Givre 
> wrote:
> >>>
>  Hello all,
>  It is approaching the end of the month and I'd like to propose another
>  Drill hangout.  Since we got a great response last time, I'd like to
>  propose that we hold the next one on Thursday, Feb 27 at 0930 ET.
> >> (Sorry
>  Paul).
> 
>  If this time works for everyone, please respond and I'll start putting
>  together an agenda.  I'd also like to ask for a volunteer to take
> >> minutes.
>  Thanks,
>  -- C
> >>
> >>
>
>

Re: Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query

2020-02-19 Thread Paul Rogers

Hi David,

So I tried to reproduce the problem, but I could not get Dremio to start: it 
consistently crashes when I follow the GitHub community edition instructions. 
[1] Something about "Unable to find injectable based on 
javax.ws.rs.core.SecurityContext."


I don't want to turn our attempt to diagnose a Drill issue into an attempt to 
diagnose my Dremo issue.

What I had planned to do was to start the server, then use the Dremio JDBC 
driver to create a very simple Java program that uses JDBC to connect to Dremio 
and issues the query in question. Then, I planned to muck around with the table 
path: leaving out the schema part, etc. until I got Dremio to accept the query. 
This would tell us what Dremio wants.

Once I had that information, I could then compare that with what Drill is 
sending and figure out how you might adjust your setup to get things to work.

Given that I could not get past the step to get Dremio running, are you 
comfortable enough with Java to run the little experiment outlined above? Once 
you have a SELECT statement that works, we can move on to the next step above: 
working out how to get Drill to issue that statement.

Thanks,
- Paul

[1] https://github.com/dremio/dremio-oss


 

On Tuesday, February 18, 2020, 8:23:23 AM PST, David Du 
 wrote:  
 
 Thanks for your response, from http://localhost:9047/space/Demo, then I run
select * from weather, it works and got data back, then I reconfigured
apache drill plugin:
dremiodemo
{
  "type": "jdbc",
  "driver": "com.dremio.jdbc.Driver",
  "url": "jdbc:dremio:direct=localhost:41010",
  "username": "admin",
  "password": "admin",
  "caseInsensitiveTableNames": true,
  "enabled": true
}

then restarted drill and run commands:

apache drill (dremiodemo.demo)> show *databases*;

+---+

| *              SCHEMA_NAME              * |

+---+

| cp.default                                |

| dfs.default                              |

| dfs.root                                  |

| dfs.tmp                                  |

| dremiodemo.$scratch                      |

| dremiodemo.@admin                        |

| dremiodemo.demo                          |

| dremiodemo.dremio

  dremiodemo.sys                            |

| dremiodemo.sys.cache                      |

| dremiodemo.testspace                      |

| dremiodemo                                |

| information_schema                        |

| qi.admin

apache drill (dremiodemo.demo)> *use* dremiodemo.demo;

+--+-+

| * ok * | *                  summary                  * |

+--+-+

| true | Default schema changed to [dremiodemo.demo] |

+--+-+

1 row selected (0.098 seconds)

apache drill (dremiodemo.demo)> *use* dremiodemo.demo;

+--+-+

| * ok * | *                  summary                  * |

+--+-+

| true | Default schema changed to [dremiodemo.demo] |

+--+-+

1 row selected (0.121 seconds)

apache drill (dremiodemo.demo)> show *tables*;

+-++

| * TABLE_SCHEMA  * | *TABLE_NAME* |

+-++

| dremiodemo.demo | topips    |

| dremiodemo.demo | weather    |

+-++

2 rows selected (0.25 seconds)

apache drill (dremiodemo.demo)> *select* * *from* weather;

Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup
the SQL query.


sql SELECT *

FROM "DREMIO"."Demo"."weather"

plugin dremiodemo

Fragment 0:0


[Error Id: fb821614-d752-4ffb-896a-5a61f7e7cfd5 on
1672851h-t2349.noblis.org:31010] (state=,code=0)

apache drill (dremiodemo.demo)>


So in this test, I removed the schema=Demo part from the dremiodemo
profile, and the weather table shows up from "show tables" command, but the
query select * from weather; returned the same Data_READ ERROR, and the in
the error message: FROM "DREMIO"."Demo"."weather", the word "DREMRIO" is
added by drill, not me, not sure why, in the sqlline.log file, the error
message is:

org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR:
Schema [[Demo]] is not valid with respect to either root schema or current
default schema.


Current default schema:  dremiodemo.demo


and in server.log file of dremio server log folder, this error message is:

Caused by: org.apache.calcite.sql.validate.SqlValidatorException: Table
'DREMIO.Demo.weather' not found



I also tested qi.admin schema, which I configured for postgres database, it
worked pe

Re: Installing/Running Drill (Embedded) as Windows Service

2020-02-19 Thread Paul Rogers

Hi Sean,

I was hoping someone who uses Windows would answer. I'll take a stab.

First, if you need a service, you'll want to install the Drill server (not 
embedded) as a service. Embedded Drill runs inside the command-line Sqlline, 
which is not very useful when running as a service.

Checking the scripts in $DRILL_HOME/bin, it looks like there is no Windows 
batch file to start Drill, only a Linux shell script. So, you'd have to create 
the batch file. Fully recreating drillbit.sh would be a bit of a project. But, 
if you have access to a Linux machine (just run a VM) or a Mac, there may be a 
shortcut. Get your Linux/Mac Drillbit to start. Then, run the script like this:

$DRILL_HOME/bin/drillbit.sh debug

This command, instead of launching Drill, will show the environment and command 
line it would use if it were to actually launch Drill. You can ignore all the 
rest of drillbit.sh; you just need a batch file (or Powershell script, or 
whatever) to set up the environment and issue the Java launch command.

Once you have that replicated on Windows, you can then wrap it in a service 
(the details of which I've not messed with for years.)

All this said, I suspect one reason we do not provide a Drillbit launch script 
is that some of Drill's networking may be Unix-specific (if I remember what 
someone told me years ago.) Still, worth a try. 


Remember that if you launch a Drillbit, you also need ZK [1]. ZK becomes the 
persistent store for things like your plugins and system options.


Might it be easier to just use a Linux box (or VM) to run Drill and ZK? That 
way you are going down the common path. Sill, if you get Windows working, 
please file a JIRA with your solution so we can get it into the project to make 
this easier in future releases.


Thanks,
- Paul

[1] 
https://medium.com/@shaaslam/installing-apache-zookeeper-on-windows-45eda303e835



 

On Thursday, February 13, 2020, 10:22:30 AM PST, Leyne, Sean 
 wrote:  
 
 
Does anyone have instructions on how to get Drill installed/running as a 
Windows Service?

Thanks


Sean

Re: Drill + Accumulo?

2020-02-19 Thread Paul Rogers

Hi Ron,

Given that I know next to nothing about  Accumulo other than what I just 
learned from a Google search, the answer appears to be no. The best approach 
would be to write a connector that exploits Accumulo's new 2.0 AccumuloClient 
API, perhaps along with the new Scan Executors. See [1].

I thought I had read in the past that Accumlo grew out of some other project, 
but the project history of chapter 1 of the O'Reilly Accumulo book [2] suggests 
that Accumulo was built independently. That said, to the degree that Accumulo 
has similarities with HBase, another path is to fork/extend Drill's HBase 
connector.

Yet another solution would be to use a REST API, leveraging the REST connector 
that is being built. However, a quick review of the "Accumlo Clients" page of 
the docs [3] does not suggest that Accumulo ships with a REST API. Perhaps some 
other project has added one? Of course, this just shifts the work to create an 
Accumulo connector to the work to force a generic REST connector to issue the 
requests, and read the responses that the REST proxy might use. Not sure that 
is a huge win.

Accumulo appears to have Hive integration [4], as does Drill. I wonder if that 
is possible path? I'm not very familiar with how Drill reads data from Hive, 
but if we use Hive's record format, and Accumulo can produce that format, there 
might be a path. Not sure how things like filter push-down would be handled.

All this said, Drill is designed to allow connectors. The API is not as simple 
as we'd like (we're working on it), if you need SQL access to Accumulo, writing 
a connector is a possible path.

Finally, if you need a SQL solution today, there is Presto, which already has 
an Accumulo connector. [5].


Thanks,
- Paul

[1] https://accumulo.apache.org/release/accumulo-2.0.0/

[2] https://learning.oreilly.com/library/view/accumulo/9781491947098/

[3] https://accumulo.apache.org/docs/2.x/getting-started/clients

[4] https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration

[5]  https://prestosql.io/docs/current/connector/accumulo.html







 

On Wednesday, February 19, 2020, 2:40:17 PM PST, Ron Cecchini 
 wrote:  
 
 (Keeping in mind that I know next to nothing about Accumulo or Hive, etc...)

Is there currently (i.e. without writing a connector) any way whatsoever to get 
Drill to query an Accumulo db? 

Thanks.

Re: data issue

2020-02-18 Thread Paul Rogers

Hi Vishal,

I think you can add the following to $DRILL_HOME/conf/logback.xml to enable the 
needed logging:

  
    
    
  


Note that if you use a config directory separate from your install (using the 
--site flag to launch Drill) then modify the file in your custom location.

To file a JIRA ticket, just go to Drill's home page [1], Click on Community, 
then Community Resources, then the first entry under Developer Resources: JIRA 
which is [2].

Make sure the Drill project is selected. Then just fill in the type 
(Improvement), title, your version number and a description. There are many 
other fields, but we mostly don't use them.

Would be super-helpful if you can include a few lines of a CSV file that 
exhibits the problem (once you track down the problem using logging.)


Thanks,
- Paul


[1] http://drill.apache.org/
 
[2] https://issues.apache.org/jira/browse/DRILL/

On Tuesday, February 18, 2020, 5:21:26 AM PST, Vishal Jadhav (BLOOMBERG/ 
731 LEX)  wrote:  
 
 Hello Paul,
Yes, I agree that a better error message would be a better solution. I am on 
drill 1.17. Regarding the logs - do I need to add/modify any specific things in 
the logback.xml to produce the trace?
I can file a Jira with the instructions. What is the process for it?
- Vishal

From: user@drill.apache.org At: 02/14/20 17:47:26To:  Vishal Jadhav (BLOOMBERG/ 
731 LEX ) ,  user@drill.apache.org
Subject: Re: data issue

Hi Vishal,

Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, 
a better solution would be for the error to say something like 
"NumerFormatException: Column foo, value "this is not a number"". Feel 
free to file a JIRA ticket to remind us to fix this particular case. Please 
explain the context so we have a good shot at reproducing the issue.


You said that the logs, at trace level, provided no information. Which version 
of Drill are you using? If the latest (and, I think 1.16), there is a log 
message each time the reader opens a file:

package org.apache.drill.exec.store.easy.text.reader;


public class CompliantTextBatchReader ...

  private void openReader(TextOutput output) throws IOException {
    logger.trace("Opening file {}", split.getPath());


Given this, you should see a series of "Opening file" messages when you enable 
trace-level logging for the above class.

As Charles noted, CSV reads columns as text, let's assume that you do have a 
CAST or other conversion. Then, the number format exception says that you are 
trying to convert a column from text to a number, and that value does not 
actually contain a number.

Again, it would be better if the error message told us the column that has the 
problem. Otherwise, if the number of columns in question is small, you can run 
a query to find non-numeric values. Now, it would be nice if Drill has an 
isNumber() function. (Another Jira feature request you can file.)

Since I can't find one, we can roll our own with a regex. Something like:

SELECT foo FROM yourTable WHERE  NOT regexp_matches('\d+')

If the number is a float or decimal, add the proper pattern.

Caveat: I didn't try the above regex, there may be some fiddly bits with 
back-slashes.

Then, you can add file metadata (AKA "implicit") columns to give you the 
information you want:

SELECT filename, foo FROM ...


If if that finds the data, and it is something you must handle, you can add an 
IF function to handle the data.

Thanks,
- Paul

 

    On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 
LEX)  wrote:  
 
 During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal

Re: WebUI not saving changes to storage plug-ins configuration

2020-02-17 Thread Paul Rogers

Hi Sean,

I poked around in the source code. Drill has something called a "plugin 
registry" that holds plugin definitions. This code is case insensitive, but 
should preserve name case.

Also, Drill uses a "CaseInsensitivePersistentStore" to write to the local file 
system in embedded mode. This store converts plugin names to lower case for 
storage. If you edit storage by hand, you should ensure that your plugin names 
are all lower case so that Drill can find them.

Thanks,
- Paul

On Friday, February 14, 2020, 6:54:26 PM PST, Paul Rogers 
 wrote:  

 Hi Sean,

Drill works best when it runs as a server, using ZK for persistent storage and 
coordination. That said, in embedded mode Drill is supposed to save plugins to 
local storage. The changes between your manual edits and what Drill sees are:

* Workspace names are lower-cased. I'm not sure if this is a bug or feature, my 
guess would be a bug.

* Drill uses the name "writable". You used the common misspelling "writeable", 
so Drill ignored your property.

How are you enabling/disabling plugins? Via the web UI? Manually in your file?

Thanks,
- Paul

On Friday, February 14, 2020, 6:42:57 PM PST, Prabhakar Bhosaale 
 wrote:  

 Hi,

I faced this problem when I start the drill with drill-embedded.bat. and it
was resolved by starting drill using  sqlline.bat -u "jdbc:drill:zk=local
Give it a try.

Regards
Prabhakar

On Fri, Feb 14, 2020, 03:09 Leyne, Sean  wrote:

>
> I created a new storage plug-in (copying the system defined 'cp'
> definition), and I am now trying to changes the configuration.
>
> First, enable/disable plug-in action is not having any effect.
>
> Second, when I shutdown the embedded engine, and make manual changes to
> the .sys.drill file in my C:\tmp\drill\sys.storage_plugins folder, the
> changes are only partially accepted.
>
> This is definition in the .sys.drill file:
>
>    {
>      "type" : "file",
>      "connection" : "file:///",
>      "config" : null,
>      "workspaces" : {
>          "CSVFiles": {
>              "location": "d:/drill/CSVFiles/",
>              "writeable": false
>          },
>          "ParquetFiles": {
>              "location": "d:/drill/ParquetFiles/",
>              "writeable": true
>          }
>      },
>      "formats" : {...
>
> This is what the WebUI is displaying
>
>    {
>      "type": "file",
>      "connection": "file:///",
>      "config": null,
>      "workspaces": {
>        "csvfiles": {
>          "location": "d:/drill/CSVFiles/",
>          "writable": false,
>          "defaultInputFormat": null,
>          "allowAccessOutsideWorkspace": false
>        },
>        "parquetfiles": {
>          "location": "d:/drill/ParquetFiles/",
>          "writable": false,
>          "defaultInputFormat": null,
>          "allowAccessOutsideWorkspace": false
>        }
>      },
>      "formats": {
>
> Note: the  "parquetfiles" | "writable" are different.
>
> Finally, trying to edit the "parquetfiles" | "writable" via the WebUI
> plugin configuration editor but the changes are not being saved/don't have
> any effect.
>
> What am I doing wrong?
>
>
> Sean
>
>

Re: WebUI not saving changes to storage plug-ins configuration

2020-02-14 Thread Paul Rogers

Hi Sean,

Drill works best when it runs as a server, using ZK for persistent storage and 
coordination. That said, in embedded mode Drill is supposed to save plugins to 
local storage. The changes between your manual edits and what Drill sees are:

* Workspace names are lower-cased. I'm not sure if this is a bug or feature, my 
guess would be a bug.


* Drill uses the name "writable". You used the common misspelling "writeable", 
so Drill ignored your property.

How are you enabling/disabling plugins? Via the web UI? Manually in your file?


Thanks,
- Paul

 

On Friday, February 14, 2020, 6:42:57 PM PST, Prabhakar Bhosaale 
 wrote:  
 
 Hi,

I faced this problem when I start the drill with drill-embedded.bat. and it
was resolved by starting drill using  sqlline.bat -u "jdbc:drill:zk=local
Give it a try.

Regards
Prabhakar

On Fri, Feb 14, 2020, 03:09 Leyne, Sean  wrote:

>
> I created a new storage plug-in (copying the system defined 'cp'
> definition), and I am now trying to changes the configuration.
>
> First, enable/disable plug-in action is not having any effect.
>
> Second, when I shutdown the embedded engine, and make manual changes to
> the .sys.drill file in my C:\tmp\drill\sys.storage_plugins folder, the
> changes are only partially accepted.
>
> This is definition in the .sys.drill file:
>
>    {
>      "type" : "file",
>      "connection" : "file:///",
>      "config" : null,
>      "workspaces" : {
>          "CSVFiles": {
>              "location": "d:/drill/CSVFiles/",
>              "writeable": false
>          },
>          "ParquetFiles": {
>              "location": "d:/drill/ParquetFiles/",
>              "writeable": true
>          }
>      },
>      "formats" : {...
>
> This is what the WebUI is displaying
>
>    {
>      "type": "file",
>      "connection": "file:///",
>      "config": null,
>      "workspaces": {
>        "csvfiles": {
>          "location": "d:/drill/CSVFiles/",
>          "writable": false,
>          "defaultInputFormat": null,
>          "allowAccessOutsideWorkspace": false
>        },
>        "parquetfiles": {
>          "location": "d:/drill/ParquetFiles/",
>          "writable": false,
>          "defaultInputFormat": null,
>          "allowAccessOutsideWorkspace": false
>        }
>      },
>      "formats": {
>
> Note: the  "parquetfiles" | "writable" are different.
>
> Finally, trying to edit the "parquetfiles" | "writable" via the WebUI
> plugin configuration editor but the changes are not being saved/don't have
> any effect.
>
> What am I doing wrong?
>
>
> Sean
>
>

Re: Requesting json file with schema

2020-02-14 Thread Paul Rogers

Thanks for the explanation, very helpful.

There are two parts to the problem. On the one hand, you want to read an 
ever-changing set of JSON files. Your example with "c5" is exactly the kind of 
"cannot predict the future" issues that can trip up Drill (or, I would argue, 
any tool that tries to convert JSON to records.)

One thing we have discussed (but not implemented) in Drill is the ability to 
read a JSON object as a Java map. That is, store the non-relational JSON 
structure as a more JSON-like Java data structure. The value of your "c" column 
would then be a Java Map which sometimes contains a column "c5" which is also a 
Java Map.

This would allow Drill to handle any arbitrary JSON without the need of a 
schema. So, we we have a solution, right?

Not so fast. We now get to the other part of the problem. Parquet is columnar; 
but it assumes that a file has a unified schema. Even if Drill could say, "here 
is your ragged collection of JSON objects", you'd still need to unify the 
schema to write the Parquet file.

You are presumably creating Parquet to be consumed by other query tools. So, 
whatever tool consumes the Parquet now has the same issue as we had with JSON, 
but now across Parquet files: if some files have one schema, some files another 
schema, most tools (other than Drill), won't even get started; they need a 
consistent schema.

Hive solves this by allowing you to specify an (evolving) schema. You tell Hive 
that c.c5.d and c.c5.e are part of your schema. Hive-compliant tools know to 
fill in nulls when the columns don't exist in some particular Parquet file.

So, taking this whole-system view we see that, in order to use Parquet as a 
relational data source, you will need to know the schema of the data in a form 
that your desired query tool understands. Thus, you need Drill to help you 
build Parquet files that satisfy that schema.

This brings us to the other solution we've discussed: asking you to provide a 
schema (or allow Drill to infer it). That way, even when Drill reads your 
"horses" record, Drill will know to create a c.c5 column as a map. That is, 
Drill will have the information to map your complex JSON into a relational-like 
structure.

Thus, when Drill writes the Parquet file, it will write a consistent schema. 
the same one you must provide to non-Drill query tools. Then, Drill won't write 
Parquet that depends on the columns that did or did not show up in a single ETL 
run.

So, two possible (future) solutions: 1) Java Maps (but wild and crazy Parquet 
schemas), or 2) declared schema to unify JSON files. Will one or the other work 
for your use case?

Or, is there some better solution you might suggest?

And, what is the target tool for the Parquet files you are creating? How will 
you handle schema evolution with those tools?

Thanks,
- Paul

On Friday, February 14, 2020, 7:15:15 AM PST, 
userdrill.mail...@laposte.net.INVALID  
wrote:  

 Hi,

Thanks for all the details.

Come back to one use case : the context is the transformation into Parquet of 
JSONs containing billions 
of records and for which each record have the global same schema but can have 
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green"                        
,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue"            ,"c3":"black"              
,"c5":{"d":"2","e":"3"}}}
...

We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all 
the possibilities and 
it's growing up continuously. So the solution is to read "c" as TEXT and report 
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT    (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'

We can't do that with drill because the "discover/alignement" of the "c" part 
of the json is too heavy in 
terms of resources and request crashes

So we currently use a Spark solution as Spark allow to specify a schema when 
reading a file.

Hope that can help or give ideas,

Regards,

> Hi,
> 
> Welcome to the Drill mailing list.
> 
> You are right. Drill is a SQL engine. It works best when the JSON input files 
> represent rows
> and columns.
> 
> Of course, JSON itself can represent arbitrary data structures: you can use 
> it to serialize
> any Java structure you want. Relational tables and columns represent a small 
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to 
> somehow magically
> convert any arbitrary data structure into tables and columns.
> 
> As described in our book, Learning Apache Drill, even seemingly trivial JSON 
> can violate relational
> rules. For example:
> 
> {a: 10} {a: 10.1}
> 
> Since Drill infers types, and must guess the type on the first row, Drill 
> will guess BIGINT.
> Then, the very next row

Re: data issue

2020-02-14 Thread Paul Rogers

Hi Vishal,

Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, 
a better solution would be for the error to say something like 
"NumerFormatException: Column foo, value "this is not a number"". Feel free to 
file a JIRA ticket to remind us to fix this particular case. Please explain the 
context so we have a good shot at reproducing the issue.


You said that the logs, at trace level, provided no information. Which version 
of Drill are you using? If the latest (and, I think 1.16), there is a log 
message each time the reader opens a file:

package org.apache.drill.exec.store.easy.text.reader;


public class CompliantTextBatchReader ...

  private void openReader(TextOutput output) throws IOException {
    logger.trace("Opening file {}", split.getPath());


Given this, you should see a series of "Opening file" messages when you enable 
trace-level logging for the above class.

As Charles noted, CSV reads columns as text, let's assume that you do have a 
CAST or other conversion. Then, the number format exception says that you are 
trying to convert a column from text to a number, and that value does not 
actually contain a number.

Again, it would be better if the error message told us the column that has the 
problem. Otherwise, if the number of columns in question is small, you can run 
a query to find non-numeric values. Now, it would be nice if Drill has an 
isNumber() function. (Another Jira feature request you can file.)

Since I can't find one, we can roll our own with a regex. Something like:

SELECT foo FROM yourTable WHERE  NOT regexp_matches('\d+')

If the number is a float or decimal, add the proper pattern.

Caveat: I didn't try the above regex, there may be some fiddly bits with 
back-slashes.

Then, you can add file metadata (AKA "implicit") columns to give you the 
information you want:

SELECT filename, foo FROM ...


If if that finds the data, and it is something you must handle, you can add an 
IF function to handle the data.

Thanks,
- Paul

 

On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 
LEX)  wrote:  
 
 During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal

1 2 3 >

1 - 100 of 274 matches

Mail list logo