Hi Avner,

A couple of other options to solve your issue. If your data is in S3, then you 
are on AWS. S3 Select lets you do simple queries against Parquet files in S3, 
returning the results as JSON. Amazon Athena is a hosted Presto that likely has 
better S3 security integration than Drill currently offers. Both are hosted for 
ease-of-management. Both are backed by AWS and thus have far more active users 
than Drill, meaning that others have likely worked out how to solve a use case 
such as yours. Either of these might help you get up and running more quickly 
than working around Drill's current limitations.

Where Drill is a great choice is if you have your own data format or service; 
one that AWS does not support.


Thanks,
- Paul

 

    On Sunday, May 10, 2020, 6:09:07 PM PDT, Paul Rogers 
<[email protected]> wrote:  
 
 Hi Avner,

Drill does not support prepared statements. Nor does Drill support statements 
with parameters. This is true with all interfaces. These would be great 
features; but they've never been implemented.


Drill was designed to operate in a Hadoop-like environment with semi-trusted 
users. (Meaning that, if any user did something malicious, you could sue or 
fire them.) As noted, the file system enforced security. There was no notion of 
the public using Drill to access secure data, with Drill acting as the secure 
gateway. Again, code could be added, but it is not there today. FWIW, in its 
present state, I would not trust Drill on a public web site with sensitive data.


Given how Drill acts today, I'd wager your best bet is to insert your own 
server between your user and Drill. Allow the user to specify queries in some 
simple, non-SQL way. Then, your server can build the SQL and forward it to 
Drill.

Public Internet (Web Browser --> Secure Gateway) --> Private network (App 
server --> Drill)


For example, if I want to know about "Orders", I can specify a date range. Your 
server fills in the storage plugin, table name, WHERE clause, etc. to create 
the SQL. Such an approach allows the Drill REST API is on a private IP address 
within your data center. Only your outward-facing user service is on a public 
IP. With this approach, there is no SQL injection risk because you do not 
directly use the web-provided info in a SQL statement. Of course, you have to 
build your SQL statement correctly, as you are doing. Don't just append web 
text to a SQL statement.


I don't think this is unique to Drill. I'd be surprised if most people allow, 
say, public access to their HBase, Cassandra or MySQL DBs.


Thanks,
- Paul

 

    On Sunday, May 10, 2020, 5:29:13 PM PDT, Avner Levy <[email protected]> 
wrote:  
 
 Hi Charles, Paul,
Thanks for your answers.
I'm interested in a case, where there is a Rest service which
authenticate the service's users, get a request with the user parameters
and build from it the SQL sent to Drill.
The customers are identified by some account ID and they send for example
the name of an entity they are looking for as a parameter in the service's
REST request.
Then the service can build the select (just an example):
SELECT x FROM S3.db.`data/[CUSTOMER_ID]/data.parquet` where
name='CUSTOMER_USER_INPUT]'

In such case, they can still send in CUSTOMER_USER_INPUT the following: "x'
union SELECT x FROM S3.db.`data/[OTHER_CUSTOMER_ID]/data.parquet`".

Usually such stuff are solved with prepared statements, but I believe this
isn't supported over REST.
I would prefer not having to authenticate my end users to Drill since this
creates more work and complexity.
Is there a way to have prepared statements in Drill?
Is it supported in other protocols? (JDBC/ODBC)
Limiting the query folder outside the SQL would do the job as well.
Any feedback is appreciated,
Thanks,
  Avner


On Sun, May 10, 2020 at 5:57 PM Paul Rogers <[email protected]>
wrote:

> Hi Charles,
>
> One of the changes I was looking at was allowing multiple SQL statements
> per REST request to get around the lack of session. The idea would be to
> issue a number of ALTER SESSION, CTTAS, USE and similar statements followed
> by a single query that returns data.
>
>
> A better solution is to enable session support for the REST API. We
> discussed the challenges involved due the disconnected nature of HTTP
> requests.
>
> Another good improvement would be a SQL command way to create configs, not
> just JSON editing. That way it would be easier to automate creation of a
> config. Also, it would be handy to be able to externalize configs so they
> can be stored in locations other than ZK (or local disk, in embedded mode.)
> For this use case, a query for user "X" would work against the "s3-X"
> config would could be retrieved from an external system that knows the
> mapping from user X to the S3 files visible to X, and the security tokens
> to use for that user.
>
> The question for now, however, is how to do this with the code that exists
> in Drill 1.17. I'm hoping someone has worked out a solution.
>
>
> Thanks,
> - Paul
>
>
>
>    On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre <
> [email protected]> wrote:
>
>  Hi Avner, Paul,
> I was reading this and wondering:
>
> 1.  Is it in fact true (I think it is) that Drill does not allow multiple
> queries to be submitted in one REST request?  I seem to remember running
> into that issue when I was trying to do some of the Superset work.
> 2.  If a user is required to be authenticated to execute a query, would
> that not prevent the possibility of a non-authenticated user executing
> arbitrary queries against someone else's data?
> 3.  I would definitely create separate data sources for each tenant, but I
> don't know that it is necessary (or helpful) to create one for each query.
>
> I'd agree with Paul, that Drill's access model needs improvement and that
> would be a good addition to the project.  We might be able to assist with
> that if there's interest.
> Best,
> -- C
>
>
> > On May 10, 2020, at 3:55 PM, Paul Rogers <[email protected]>
> wrote:
> >
> > Hi Avner,
> >
> > Drill was designed for a system in which the user name maps to a
> certificate on the underlying file system, and the file system provides
> complete security. This model has not been extended to the cloud world.
> >
> > What you want is a way to authenticate your user, map the user to a
> storage plugin config for only that client's files, then restrict that user
> to only that config. Further, you'd want the config to obtain S3 keys from
> a vault of some sort. If you have that, you'd not have to worry about SQL
> injection since only an authorized user could muck with the SQL, and they
> could only access their own data -- which they can presumably access anyway.
> >
> >
> > At present, Drill has no out-of-the-box security model for this use
> case; there is no mechanism to associate users with configs, or to
> externalize S3 security keys. Such a system would be a worthwhile addition
> to the project.
> >
> > I wonder, has anyone else found a workaround for this use case? Maybe
> via Kerberos or some such?
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy <
> [email protected]> wrote:
> >
> > Hi,
> > I'm trying to use Apache Drill as a database for providing SQL over S3
> > parquet files.
> > Drill is used for serving multi-tenant data for multiple customers.
> > Since I need to build the SQL string using the REST API I'm vulnerable to
> > SQL injection attacks.
> > I do test all user input and close it between apostrophes and
> > escape apostrophe in the user input by doubling it but I'm still
> concerned
> > about optional SQL attacks.
> > Will adding a different data source (which points to a different folder
> on
> > S3) per tenant is something that will have impact on performance? (I
> might
> > have thousands of those)
> > Does it make sense to create the data source on the fly before query?
> > Is there another way to limit the sent SQL to a specific folder?
> > Thanks,
> >  Avner
>
    

Reply via email to