Hi Avner, 
Can you explain more about what you'd need with the utility escaping function?  
That seems more practical than creating new storage plugins for each user.
--C

> On May 11, 2020, at 10:58 AM, Avner Levy <[email protected]> wrote:
> 
> Thanks Paul,
> It seems I wasn't clear enough in my previous email.
> I have a server in between the end users and drill (exactly as you
> suggested), my concern is SQL attacks on that server by bad parameters.
> I've tried Athena and I think they too don't support prepared statements.
> In addition, they were slower than Drill in my scenarios.
> If Drill had a utility escaping function that clients could use on their
> input and know they are safe it would help reducing the risk.
> For now, I guess it leaves me with the option of creating a data source per
> each client (hoping few thousands of those won't hurt performance) or
> making sure I properly escape all users input.
> Appreciate your help.
> Thanks,
>  Avner
> 
> On Sun, May 10, 2020 at 9:09 PM Paul Rogers <[email protected]>
> wrote:
> 
>> Hi Avner,
>> 
>> Drill does not support prepared statements. Nor does Drill support
>> statements with parameters. This is true with all interfaces. These would
>> be great features; but they've never been implemented.
>> 
>> 
>> Drill was designed to operate in a Hadoop-like environment with
>> semi-trusted users. (Meaning that, if any user did something malicious, you
>> could sue or fire them.) As noted, the file system enforced security. There
>> was no notion of the public using Drill to access secure data, with Drill
>> acting as the secure gateway. Again, code could be added, but it is not
>> there today. FWIW, in its present state, I would not trust Drill on a
>> public web site with sensitive data.
>> 
>> 
>> Given how Drill acts today, I'd wager your best bet is to insert your own
>> server between your user and Drill. Allow the user to specify queries in
>> some simple, non-SQL way. Then, your server can build the SQL and forward
>> it to Drill.
>> 
>> Public Internet (Web Browser --> Secure Gateway) --> Private network (App
>> server --> Drill)
>> 
>> 
>> For example, if I want to know about "Orders", I can specify a date range.
>> Your server fills in the storage plugin, table name, WHERE clause, etc. to
>> create the SQL. Such an approach allows the Drill REST API is on a private
>> IP address within your data center. Only your outward-facing user service
>> is on a public IP. With this approach, there is no SQL injection risk
>> because you do not directly use the web-provided info in a SQL statement.
>> Of course, you have to build your SQL statement correctly, as you are
>> doing. Don't just append web text to a SQL statement.
>> 
>> 
>> I don't think this is unique to Drill. I'd be surprised if most people
>> allow, say, public access to their HBase, Cassandra or MySQL DBs.
>> 
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Sunday, May 10, 2020, 5:29:13 PM PDT, Avner Levy <
>> [email protected]> wrote:
>> 
>> Hi Charles, Paul,
>> Thanks for your answers.
>> I'm interested in a case, where there is a Rest service which
>> authenticate the service's users, get a request with the user parameters
>> and build from it the SQL sent to Drill.
>> The customers are identified by some account ID and they send for example
>> the name of an entity they are looking for as a parameter in the service's
>> REST request.
>> Then the service can build the select (just an example):
>> SELECT x FROM S3.db.`data/[CUSTOMER_ID]/data.parquet` where
>> name='CUSTOMER_USER_INPUT]'
>> 
>> In such case, they can still send in CUSTOMER_USER_INPUT the following: "x'
>> union SELECT x FROM S3.db.`data/[OTHER_CUSTOMER_ID]/data.parquet`".
>> 
>> Usually such stuff are solved with prepared statements, but I believe this
>> isn't supported over REST.
>> I would prefer not having to authenticate my end users to Drill since this
>> creates more work and complexity.
>> Is there a way to have prepared statements in Drill?
>> Is it supported in other protocols? (JDBC/ODBC)
>> Limiting the query folder outside the SQL would do the job as well.
>> Any feedback is appreciated,
>> Thanks,
>>  Avner
>> 
>> 
>> On Sun, May 10, 2020 at 5:57 PM Paul Rogers <[email protected]>
>> wrote:
>> 
>>> Hi Charles,
>>> 
>>> One of the changes I was looking at was allowing multiple SQL statements
>>> per REST request to get around the lack of session. The idea would be to
>>> issue a number of ALTER SESSION, CTTAS, USE and similar statements
>> followed
>>> by a single query that returns data.
>>> 
>>> 
>>> A better solution is to enable session support for the REST API. We
>>> discussed the challenges involved due the disconnected nature of HTTP
>>> requests.
>>> 
>>> Another good improvement would be a SQL command way to create configs,
>> not
>>> just JSON editing. That way it would be easier to automate creation of a
>>> config. Also, it would be handy to be able to externalize configs so they
>>> can be stored in locations other than ZK (or local disk, in embedded
>> mode.)
>>> For this use case, a query for user "X" would work against the "s3-X"
>>> config would could be retrieved from an external system that knows the
>>> mapping from user X to the S3 files visible to X, and the security tokens
>>> to use for that user.
>>> 
>>> The question for now, however, is how to do this with the code that
>> exists
>>> in Drill 1.17. I'm hoping someone has worked out a solution.
>>> 
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>   On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre <
>>> [email protected]> wrote:
>>> 
>>> Hi Avner, Paul,
>>> I was reading this and wondering:
>>> 
>>> 1.  Is it in fact true (I think it is) that Drill does not allow multiple
>>> queries to be submitted in one REST request?  I seem to remember running
>>> into that issue when I was trying to do some of the Superset work.
>>> 2.  If a user is required to be authenticated to execute a query, would
>>> that not prevent the possibility of a non-authenticated user executing
>>> arbitrary queries against someone else's data?
>>> 3.  I would definitely create separate data sources for each tenant, but
>> I
>>> don't know that it is necessary (or helpful) to create one for each
>> query.
>>> 
>>> I'd agree with Paul, that Drill's access model needs improvement and that
>>> would be a good addition to the project.  We might be able to assist with
>>> that if there's interest.
>>> Best,
>>> -- C
>>> 
>>> 
>>>> On May 10, 2020, at 3:55 PM, Paul Rogers <[email protected]>
>>> wrote:
>>>> 
>>>> Hi Avner,
>>>> 
>>>> Drill was designed for a system in which the user name maps to a
>>> certificate on the underlying file system, and the file system provides
>>> complete security. This model has not been extended to the cloud world.
>>>> 
>>>> What you want is a way to authenticate your user, map the user to a
>>> storage plugin config for only that client's files, then restrict that
>> user
>>> to only that config. Further, you'd want the config to obtain S3 keys
>> from
>>> a vault of some sort. If you have that, you'd not have to worry about SQL
>>> injection since only an authorized user could muck with the SQL, and they
>>> could only access their own data -- which they can presumably access
>> anyway.
>>>> 
>>>> 
>>>> At present, Drill has no out-of-the-box security model for this use
>>> case; there is no mechanism to associate users with configs, or to
>>> externalize S3 security keys. Such a system would be a worthwhile
>> addition
>>> to the project.
>>>> 
>>>> I wonder, has anyone else found a workaround for this use case? Maybe
>>> via Kerberos or some such?
>>>> 
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy <
>>> [email protected]> wrote:
>>>> 
>>>> Hi,
>>>> I'm trying to use Apache Drill as a database for providing SQL over S3
>>>> parquet files.
>>>> Drill is used for serving multi-tenant data for multiple customers.
>>>> Since I need to build the SQL string using the REST API I'm vulnerable
>> to
>>>> SQL injection attacks.
>>>> I do test all user input and close it between apostrophes and
>>>> escape apostrophe in the user input by doubling it but I'm still
>>> concerned
>>>> about optional SQL attacks.
>>>> Will adding a different data source (which points to a different folder
>>> on
>>>> S3) per tenant is something that will have impact on performance? (I
>>> might
>>>> have thousands of those)
>>>> Does it make sense to create the data source on the fly before query?
>>>> Is there another way to limit the sent SQL to a specific folder?
>>>> Thanks,
>>>> Avner
>>> 
>> 

Reply via email to