Hi Avner, Can you explain more about what you'd need with the utility escaping function? That seems more practical than creating new storage plugins for each user. --C
> On May 11, 2020, at 10:58 AM, Avner Levy <[email protected]> wrote: > > Thanks Paul, > It seems I wasn't clear enough in my previous email. > I have a server in between the end users and drill (exactly as you > suggested), my concern is SQL attacks on that server by bad parameters. > I've tried Athena and I think they too don't support prepared statements. > In addition, they were slower than Drill in my scenarios. > If Drill had a utility escaping function that clients could use on their > input and know they are safe it would help reducing the risk. > For now, I guess it leaves me with the option of creating a data source per > each client (hoping few thousands of those won't hurt performance) or > making sure I properly escape all users input. > Appreciate your help. > Thanks, > Avner > > On Sun, May 10, 2020 at 9:09 PM Paul Rogers <[email protected]> > wrote: > >> Hi Avner, >> >> Drill does not support prepared statements. Nor does Drill support >> statements with parameters. This is true with all interfaces. These would >> be great features; but they've never been implemented. >> >> >> Drill was designed to operate in a Hadoop-like environment with >> semi-trusted users. (Meaning that, if any user did something malicious, you >> could sue or fire them.) As noted, the file system enforced security. There >> was no notion of the public using Drill to access secure data, with Drill >> acting as the secure gateway. Again, code could be added, but it is not >> there today. FWIW, in its present state, I would not trust Drill on a >> public web site with sensitive data. >> >> >> Given how Drill acts today, I'd wager your best bet is to insert your own >> server between your user and Drill. Allow the user to specify queries in >> some simple, non-SQL way. Then, your server can build the SQL and forward >> it to Drill. >> >> Public Internet (Web Browser --> Secure Gateway) --> Private network (App >> server --> Drill) >> >> >> For example, if I want to know about "Orders", I can specify a date range. >> Your server fills in the storage plugin, table name, WHERE clause, etc. to >> create the SQL. Such an approach allows the Drill REST API is on a private >> IP address within your data center. Only your outward-facing user service >> is on a public IP. With this approach, there is no SQL injection risk >> because you do not directly use the web-provided info in a SQL statement. >> Of course, you have to build your SQL statement correctly, as you are >> doing. Don't just append web text to a SQL statement. >> >> >> I don't think this is unique to Drill. I'd be surprised if most people >> allow, say, public access to their HBase, Cassandra or MySQL DBs. >> >> >> Thanks, >> - Paul >> >> >> >> On Sunday, May 10, 2020, 5:29:13 PM PDT, Avner Levy < >> [email protected]> wrote: >> >> Hi Charles, Paul, >> Thanks for your answers. >> I'm interested in a case, where there is a Rest service which >> authenticate the service's users, get a request with the user parameters >> and build from it the SQL sent to Drill. >> The customers are identified by some account ID and they send for example >> the name of an entity they are looking for as a parameter in the service's >> REST request. >> Then the service can build the select (just an example): >> SELECT x FROM S3.db.`data/[CUSTOMER_ID]/data.parquet` where >> name='CUSTOMER_USER_INPUT]' >> >> In such case, they can still send in CUSTOMER_USER_INPUT the following: "x' >> union SELECT x FROM S3.db.`data/[OTHER_CUSTOMER_ID]/data.parquet`". >> >> Usually such stuff are solved with prepared statements, but I believe this >> isn't supported over REST. >> I would prefer not having to authenticate my end users to Drill since this >> creates more work and complexity. >> Is there a way to have prepared statements in Drill? >> Is it supported in other protocols? (JDBC/ODBC) >> Limiting the query folder outside the SQL would do the job as well. >> Any feedback is appreciated, >> Thanks, >> Avner >> >> >> On Sun, May 10, 2020 at 5:57 PM Paul Rogers <[email protected]> >> wrote: >> >>> Hi Charles, >>> >>> One of the changes I was looking at was allowing multiple SQL statements >>> per REST request to get around the lack of session. The idea would be to >>> issue a number of ALTER SESSION, CTTAS, USE and similar statements >> followed >>> by a single query that returns data. >>> >>> >>> A better solution is to enable session support for the REST API. We >>> discussed the challenges involved due the disconnected nature of HTTP >>> requests. >>> >>> Another good improvement would be a SQL command way to create configs, >> not >>> just JSON editing. That way it would be easier to automate creation of a >>> config. Also, it would be handy to be able to externalize configs so they >>> can be stored in locations other than ZK (or local disk, in embedded >> mode.) >>> For this use case, a query for user "X" would work against the "s3-X" >>> config would could be retrieved from an external system that knows the >>> mapping from user X to the S3 files visible to X, and the security tokens >>> to use for that user. >>> >>> The question for now, however, is how to do this with the code that >> exists >>> in Drill 1.17. I'm hoping someone has worked out a solution. >>> >>> >>> Thanks, >>> - Paul >>> >>> >>> >>> On Sunday, May 10, 2020, 1:05:50 PM PDT, Charles Givre < >>> [email protected]> wrote: >>> >>> Hi Avner, Paul, >>> I was reading this and wondering: >>> >>> 1. Is it in fact true (I think it is) that Drill does not allow multiple >>> queries to be submitted in one REST request? I seem to remember running >>> into that issue when I was trying to do some of the Superset work. >>> 2. If a user is required to be authenticated to execute a query, would >>> that not prevent the possibility of a non-authenticated user executing >>> arbitrary queries against someone else's data? >>> 3. I would definitely create separate data sources for each tenant, but >> I >>> don't know that it is necessary (or helpful) to create one for each >> query. >>> >>> I'd agree with Paul, that Drill's access model needs improvement and that >>> would be a good addition to the project. We might be able to assist with >>> that if there's interest. >>> Best, >>> -- C >>> >>> >>>> On May 10, 2020, at 3:55 PM, Paul Rogers <[email protected]> >>> wrote: >>>> >>>> Hi Avner, >>>> >>>> Drill was designed for a system in which the user name maps to a >>> certificate on the underlying file system, and the file system provides >>> complete security. This model has not been extended to the cloud world. >>>> >>>> What you want is a way to authenticate your user, map the user to a >>> storage plugin config for only that client's files, then restrict that >> user >>> to only that config. Further, you'd want the config to obtain S3 keys >> from >>> a vault of some sort. If you have that, you'd not have to worry about SQL >>> injection since only an authorized user could muck with the SQL, and they >>> could only access their own data -- which they can presumably access >> anyway. >>>> >>>> >>>> At present, Drill has no out-of-the-box security model for this use >>> case; there is no mechanism to associate users with configs, or to >>> externalize S3 security keys. Such a system would be a worthwhile >> addition >>> to the project. >>>> >>>> I wonder, has anyone else found a workaround for this use case? Maybe >>> via Kerberos or some such? >>>> >>>> >>>> Thanks, >>>> - Paul >>>> >>>> >>>> >>>> On Sunday, May 10, 2020, 12:04:16 PM PDT, Avner Levy < >>> [email protected]> wrote: >>>> >>>> Hi, >>>> I'm trying to use Apache Drill as a database for providing SQL over S3 >>>> parquet files. >>>> Drill is used for serving multi-tenant data for multiple customers. >>>> Since I need to build the SQL string using the REST API I'm vulnerable >> to >>>> SQL injection attacks. >>>> I do test all user input and close it between apostrophes and >>>> escape apostrophe in the user input by doubling it but I'm still >>> concerned >>>> about optional SQL attacks. >>>> Will adding a different data source (which points to a different folder >>> on >>>> S3) per tenant is something that will have impact on performance? (I >>> might >>>> have thousands of those) >>>> Does it make sense to create the data source on the fly before query? >>>> Is there another way to limit the sent SQL to a specific folder? >>>> Thanks, >>>> Avner >>> >>
