Re: REST data source?

Charles Givre Wed, 01 Apr 2020 09:41:49 -0700

Hey Rafael, 
Thanks for the feedback.  My original idea was to pull the proxy from the 
environment vars in HTTP_PROXY/HtTTPS_PROXY and ALL_PROXY but that part isn't 
quite done yet. Did you set the proxy info via the plugin config?
-- C



> On Apr 1, 2020, at 10:22 AM, Jaimes, Rafael - 0993 - MITLL 
> <[email protected]> wrote:
> 
> Hi all,
> 
> I built Charles' latest branch including the proxy setup. It appears to be 
> working quite well going through the proxy.
> 
> I'll continue to test and report back if I find any issues.
> 
> Note: Beyond Paul's repo recommendations, I had to skip checkstyle to get the 
> maven build to complete. You're probably already aware of that, I think it's 
> just specific to this branch.
> 
> Thanks!
> Rafael
> 
> -----Original Message-----
> From: Paul Rogers <[email protected]>
> Sent: Wednesday, April 1, 2020 1:29 AM
> To: user <[email protected]>
> Subject: Re: REST data source?
> 
> Thanks, Charles.
> 
> As Charles suggested, I pushed a commit that replaces the "old" JSON reader 
> with the new EVF-based one. Eventually this will allow us to use a "provided 
> schema" to handle any JSON ambiguities.
> 
> As we've been discussing, I'll try to add the ability to specify a path to 
> data: "response/payload/records" or whatever. With the present commit, that 
> path can be parsed in code, but I think a simple path spec would be easier.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Tuesday, March 31, 2020, 10:00:52 PM PDT, Charles Givre 
> <[email protected]> wrote:
> 
> Hello all,
> I pushed some updates to the REST PR to include initial work on proxy 
> configuration.  I haven't updated the docs yet (until this is finalized).  It 
> adds new config variables as shown below:
> 
> {
>  "type": "http",
>  "cacheResults": true,
>  "connections": {},
>  "timeout": 0,
>  "proxyHost": null,
>  "proxyPort": 0,
>  "proxyType": null,
>  "proxyUsername": null,
>  "proxyPassword": null,
>  "enabled": true
> }
> I started on getting Drill to recognize the proxy info from the environment, 
> but haven't quite finished that.  The plan is for the plugin config to 
> override environment vars.
> Feedback is welcome.
> 
> @paul-rogers, I think you can push to my branch (or submit a PR?) and that 
> will be included in the main PR.
> -- C
> 
> 
> 
>> On Mar 31, 2020, at 10:40 PM, Rafael Jaimes III <[email protected]> wrote:
>> 
>> Yes your initial assessment was correct, there is extra material other
>> than the data field.
>> The returned JSON has some top-level fields that don't go any deeper,
>> akin to your "status" : ok field. In the example I'm running now, one
>> is called MessageState which is set to "NEW". There's another field
>> called MessageData, which, obviously, holds most of the data. There
>> are some other top-level fields, and one is called MessageHeader which
>> is nested. There's a lot of stuff here, and this is just one "table" I'm 
>> querying against now.
>> Not sure how it will differ with the other services.
>> 
>> The service is definitely returning multiple records - I believe it's
>> a JSON array and Drill+HTTP/plugin appears to handle it quite well.
>> 
>> You're right, Drill is handling most of the structure by modifying my
>> SELECT statement as you suggested.
>> 
>> For filter pushdown, expressions of that form would be great. That's
>> what I had in mind too.
>> 
>> Thanks,
>> Rafael
>> 
>> On Tue, Mar 31, 2020 at 10:14 PM Paul Rogers
>> <[email protected]>
>> wrote:
>> 
>>> Hi Rafael,
>>> 
>>> Thanks much for the info. We had already implemented filter push-down
>>> for other plugins, and for a few custom REST APIs, so should be
>>> possible to port it over to the HTTP plugin. If you can supply code,
>>> then you can convert filters to anything you want, a specialized JSON 
>>> request body, etc.
>>> To do this generically, we have to make some assumptions, such as
>>> either 1) all fields can be pushed as query parameters, or 2) only
>>> those in some config list. Either way, we know how to create
>>> name=value pairs in either a GET or POST format.
>>> 
>>> You mentioned that your "payload" objects are structured. Drill can
>>> already handle this; your query can map them to the top level:
>>> 
>>> SELECT t.characteristic.color.name AS color_name,
>>> t.characteristic.color.confidence AS color_confidence, ...  FROM
>>> yourTable AS t
>>> 
>>> You'll get that "out of box." Drill does assume that data is in
>>> "record
>>> format": a single list of objects which represent records. Code would
>>> be needed to handle, say, two separate lists of objects or other,
>>> more-general, JSON structures.
>>> 
>>> 
>>> My specific question was more around the response from your web service.
>>> Does that have extra material besides just the data records? Something 
>>> like:
>>> 
>>> 
>>> { "status": "ok", "data": [ {characteristic: ... }, {...}] }
>>> 
>>> Or, is the response directly an array of objects:
>>> 
>>> [ {characteristic: ... }, {...}]
>>> 
>>> 
>>> If it is just an array, then the "out of the box" plugin will work.
>>> If there is other stuff, then you'll need the new feature to tell
>>> Drill how to find the field to your data. The present version needs
>>> code, but I'm thinking we can just use an array of names in the plugin 
>>> config:
>>> 
>>> dataPath: [ "data" ],
>>> 
>>> Or, in your case, do you get a single record per HTTP request? If a
>>> single record, then either your queries will be super-simple, or
>>> performance will be horrible when requesting multiple records. (The
>>> HTTP plugin only does one request and assumes it will get back a set
>>> of records as a JSON array or as whitespace-separated JSON objects as
>>> in a JSON file.)
>>> 
>>> Can you clarify a bit which of these cases your data follows?
>>> 
>>> I like your idea of optionally supplying a parser class for the "hard"
>>> cases:
>>> 
>>> messageParserClass: "com.mycompany.drill.MyMessageParser",
>>> 
>>> As long as the class is on the classpath, Java will find it.
>>> 
>>> Finally, on the filter push-down, the existing code we're thinking of
>>> using can handle expressions of the form:
>>> 
>>> column op constant
>>> 
>>> Where "op" is one of the relational operators: =, !=, < etc. Also
>>> handles the obvious variations (const op constant, column BETWEEN
>>> const1 AND const2, column IN (const1, const2, ...)).
>>> 
>>> The code cannot handle expressions (due to a limitation in Drill itself).
>>> That is, this won't work as a filter push-down: col = 10 + 2 or col +
>>> 2 = 10. Nor can it handle multi-column expressions: column1 = column2, etc.
>>> 
>>> 
>>> I'll write up something more specific so you can see exactly what we
>>> propose.
>>> 
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>   On Tuesday, March 31, 2020, 6:39:57 PM PDT, Rafael Jaimes III <
>>> [email protected]> wrote:
>>> 
>>> Either a text description of the parse path or specifying the class
>>> with the message parser could work.
>>> I think the latter would be better, if it were simple as dropping the
>>> JAR in 3rdparty after Drill is already built.
>>> That way we can just continually add parsers ad-hoc.
>>> 
>>> An example JSON response includes about 4 top-level fields, then 2 of
>>> those fields have many sub-fields.
>>> For example a field could be nested 3 levels deep and say:
>>> 
>>> Characteristic:
>>> 
>>> Color:
>>> 
>>>     Color name: "Red"
>>> 
>>>     Confidence: 100
>>> 
>>> Physical:
>>> 
>>>     Size: 405
>>> 
>>>     Confidence:  95
>>> 
>>> As you can imagine, it would be difficult to flatten this because of
>>> repeated sub-field names like "Confidence".
>>> 
>>> I don't think it would be easily exportable into a CSV.
>>> At least for me pandas dataframe is the ultimate destination for all
>>> of this, which also don't handle nested fields well either.
>>> I'll have to handle some parsing on my end.
>>> 
>>> Filter pushdown would be huge and much desired.
>>> Our other end-users are accustomed to using SQL in that manner and
>>> the REST API we use fully support AND, OR, BETWEEN, =, <, >, etc (I
>>> can get a full list if you're interested).
>>> For example I think "between" is a ",". Converting the SQL statement
>>> into the URL format would be awesome and help streamline querying
>>> across data sources.
>>> This is one of the main reasons why we're so interested in Drill.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Rafael
>>> 
>

Re: REST data source?

Reply via email to