Re: question about JSON query

Jason Altekruse Fri, 09 Jan 2015 10:14:28 -0800

I believe that Jim may have given the appropriate query to satisfy the
needs of the original question, but for anyone who finds this thread I
wanted to give a quick clarification about kvgen. The purpose of this
function is to allow queries against maps where the keys themselves
represent data rather than a schema. For example say you had statistics
about the number of interactions between the users on a social network
stored in a JSON document store. Each user might have a record like this:


{
   user_id : 12345
   user_interactions : {
       "00001" : 10,
       "05678" : 25,
       "11111" : 5
   }
}

Here the record is a summary of a single users interactions with other
users. The record contains the user_id as well as a map between other
users' ids and the number of interactions recorded between each and the
given user_id (if this was a complete dataset there should be the same
number stored in each of those users interaction summary, with 12345 listed
as their friend with the same number of interactions)

If you wanted to run a query to find the best friend of every user, you
could use the following query:

SELECT t.flat_interactions.key from (
      select flatten(kvgen(user_interactions)) as flat_interactions from
dfs.`/tmp/user_table.json`
) as t
order by t.flat_interactions.`value` DESC limit 1

Hope this helps, also see the wiki for more explanation on kvgen and
flatten:

https://cwiki.apache.org/confluence/display/DRILL/FLATTEN+Function
https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function


On Thu, Jan 1, 2015 at 10:00 AM, Jim Bates <[email protected]> wrote:

> Is this what you're looking for?
>
> select a.`t`.`ProductDescription` as `ProductDescription`,
> a.`t`.`AvailabilityZone` as `AvailabilityZone`, a.`t`.`Timestamp` as
> `Timestamp`, a.`t`.`InstanceType` as `InstanceType`, a.`t`.`SpotPrice` as
> `SpotPrice` from (select flatten(SpotPriceHistory) as t from
> dfs.`/data/tmp_json/1.json`) a;
>
> +--------------------+------------------+------------+--------------+------------+
> | ProductDescription | AvailabilityZone | Timestamp  | InstanceType |
> SpotPrice  |
>
> +--------------------+------------------+------------+--------------+------------+
> | Linux/UNIX         | us-east-1a       | 2015-01-01T13:26:16.000Z |
> cc2.8xlarge  | 0.256900   |
> | Linux/UNIX         | us-east-1a       | 2015-01-01T13:01:38.000Z |
> cc2.8xlarge  | 0.256800   |
>
> +--------------------+------------------+------------+--------------+------------+
>
> On Thu, Jan 1, 2015 at 11:28 AM, Vince Gonzalez <[email protected]>
> wrote:
>
> > Mufy, your suggestion didn't work, but this got me further. Still
> wondering
> > if this is the best way.
> >
> > 0: jdbc:drill:zk=local> create table spot_price_history as (select
> > flatten(SpotPriceHistory) as t from dfs.`/tmp/spot_price_history.json`);
> > +------------+---------------------------+
> > |  Fragment  | Number of records written |
> > +------------+---------------------------+
> > | 0_0        | 638                       |
> > +------------+---------------------------+
> > 1 row selected (0.353 seconds)
> > 0: jdbc:drill:zk=local> select * from spot_price_history limit 5;
> > +------------+
> > |     t      |
> > +------------+
> > |
> >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:26:16.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > |
> > |
> >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:01:38.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256800"}
> > |
> > |
> >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:21:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > |
> > |
> >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:13:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.257000"}
> > |
> > |
> >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T11:57:05.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > |
> > +------------+
> > 5 rows selected (0.052 seconds)
> > 0: jdbc:drill:zk=local> select t['SpotPrice'] from spot_price_history
> limit
> > 5;
> > +------------+
> > |   EXPR$0   |
> > +------------+
> > | 0.256900   |
> > | 0.256800   |
> > | 0.256900   |
> > | 0.257000   |
> > | 0.256900   |
> > +------------+
> > 5 rows selected (0.072 seconds)
> >
> > ---
> >  Vince Gonzalez
> >  Systems Engineer
> >  212.694.3879
> >
> >  mapr.com
> >
> > On Thu, Jan 1, 2015 at 10:06 AM, mufy <[email protected]> wrote:
> >
> > > Have you tried something like -
> > >
> > > select SpotPriceHistory[0].Timestamp from ...
> > >
> > >
> > > ---
> > > Mufeed Usman
> > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > > <http://mufeed.livejournal.com>
> > >
> > >
> > >
> > >
> > > On Thu, Jan 1, 2015 at 7:25 PM, Vince Gonzalez <[email protected]>
> > wrote:
> > >
> > > > I've got some data with the following structure:
> > > >
> > > > {
> > > >     "SpotPriceHistory": [
> > > >         {
> > > >             "ProductDescription": "Linux/UNIX",
> > > >             "AvailabilityZone": "us-east-1a",
> > > >             "Timestamp": "2015-01-01T13:26:16.000Z",
> > > >             "InstanceType": "cc2.8xlarge",
> > > >             "SpotPrice": "0.256900"
> > > >         },
> > > >         {
> > > >             "ProductDescription": "Linux/UNIX",
> > > >             "AvailabilityZone": "us-east-1a",
> > > >             "Timestamp": "2015-01-01T13:01:38.000Z",
> > > >             "InstanceType": "cc2.8xlarge",
> > > >             "SpotPrice": "0.256800"
> > > >         },
> > > >         ...
> > > >     ]
> > > > }
> > > >
> > > > In exploring it with Drill, I'm able to get as far as:
> > > >
> > > > > select * from (select kvgen(flatten(SpotPriceHistory)) as t from
> > > > dfs.`/tmp/spot_price_history.json`);
> > > >
> > > > Which yields rows that each contain a list of key-value pairs, which
> > > seems
> > > > like what I want:
> > > >
> > > > +------------+
> > > > |     t      |
> > > > +------------+
> > > > |
> > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:48:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.256900"}]
> > > > |
> > > > |
> > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:40:49.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257000"}]
> > > > |
> > > > |
> > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:36:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257100"}]
> > > > |
> > > >
> > > > But I have not figured out how to pull those keys up into column
> names
> > so
> > > > that I can do a query like:
> > > >
> > > > > select SpotPrice,Timestamp from t
> > > >
> > > > I'll appreciate any tips.
> > > >
> > > > Thanks,
> > > >
> > > > --vince
> > > >
> > >
> >
>

Re: question about JSON query

Reply via email to