Great example.

This also comes up in open TSDB where column names are time offsets within
a window. Reading data from HBase or MapR DB gives you a map and having
kvgen makes everything slick as a whistle.


On Fri, Jan 9, 2015 at 12:10 PM, Jason Altekruse <[email protected]>
wrote:

> I believe that Jim may have given the appropriate query to satisfy the
> needs of the original question, but for anyone who finds this thread I
> wanted to give a quick clarification about kvgen. The purpose of this
> function is to allow queries against maps where the keys themselves
> represent data rather than a schema. For example say you had statistics
> about the number of interactions between the users on a social network
> stored in a JSON document store. Each user might have a record like this:
>
> {
>    user_id : 12345
>    user_interactions : {
>        "00001" : 10,
>        "05678" : 25,
>        "11111" : 5
>    }
> }
>
> Here the record is a summary of a single users interactions with other
> users. The record contains the user_id as well as a map between other
> users' ids and the number of interactions recorded between each and the
> given user_id (if this was a complete dataset there should be the same
> number stored in each of those users interaction summary, with 12345 listed
> as their friend with the same number of interactions)
>
> If you wanted to run a query to find the best friend of every user, you
> could use the following query:
>
> SELECT t.flat_interactions.key from (
>       select flatten(kvgen(user_interactions)) as flat_interactions from
> dfs.`/tmp/user_table.json`
> ) as t
> order by t.flat_interactions.`value` DESC limit 1
>
> Hope this helps, also see the wiki for more explanation on kvgen and
> flatten:
>
> https://cwiki.apache.org/confluence/display/DRILL/FLATTEN+Function
> https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function
>
>
> On Thu, Jan 1, 2015 at 10:00 AM, Jim Bates <[email protected]> wrote:
>
> > Is this what you're looking for?
> >
> > select a.`t`.`ProductDescription` as `ProductDescription`,
> > a.`t`.`AvailabilityZone` as `AvailabilityZone`, a.`t`.`Timestamp` as
> > `Timestamp`, a.`t`.`InstanceType` as `InstanceType`, a.`t`.`SpotPrice` as
> > `SpotPrice` from (select flatten(SpotPriceHistory) as t from
> > dfs.`/data/tmp_json/1.json`) a;
> >
> >
> +--------------------+------------------+------------+--------------+------------+
> > | ProductDescription | AvailabilityZone | Timestamp  | InstanceType |
> > SpotPrice  |
> >
> >
> +--------------------+------------------+------------+--------------+------------+
> > | Linux/UNIX         | us-east-1a       | 2015-01-01T13:26:16.000Z |
> > cc2.8xlarge  | 0.256900   |
> > | Linux/UNIX         | us-east-1a       | 2015-01-01T13:01:38.000Z |
> > cc2.8xlarge  | 0.256800   |
> >
> >
> +--------------------+------------------+------------+--------------+------------+
> >
> > On Thu, Jan 1, 2015 at 11:28 AM, Vince Gonzalez <[email protected]>
> > wrote:
> >
> > > Mufy, your suggestion didn't work, but this got me further. Still
> > wondering
> > > if this is the best way.
> > >
> > > 0: jdbc:drill:zk=local> create table spot_price_history as (select
> > > flatten(SpotPriceHistory) as t from
> dfs.`/tmp/spot_price_history.json`);
> > > +------------+---------------------------+
> > > |  Fragment  | Number of records written |
> > > +------------+---------------------------+
> > > | 0_0        | 638                       |
> > > +------------+---------------------------+
> > > 1 row selected (0.353 seconds)
> > > 0: jdbc:drill:zk=local> select * from spot_price_history limit 5;
> > > +------------+
> > > |     t      |
> > > +------------+
> > > |
> > >
> > >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:26:16.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > > |
> > > |
> > >
> > >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:01:38.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256800"}
> > > |
> > > |
> > >
> > >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:21:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > > |
> > > |
> > >
> > >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:13:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.257000"}
> > > |
> > > |
> > >
> > >
> >
> {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T11:57:05.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"}
> > > |
> > > +------------+
> > > 5 rows selected (0.052 seconds)
> > > 0: jdbc:drill:zk=local> select t['SpotPrice'] from spot_price_history
> > limit
> > > 5;
> > > +------------+
> > > |   EXPR$0   |
> > > +------------+
> > > | 0.256900   |
> > > | 0.256800   |
> > > | 0.256900   |
> > > | 0.257000   |
> > > | 0.256900   |
> > > +------------+
> > > 5 rows selected (0.072 seconds)
> > >
> > > ---
> > >  Vince Gonzalez
> > >  Systems Engineer
> > >  212.694.3879
> > >
> > >  mapr.com
> > >
> > > On Thu, Jan 1, 2015 at 10:06 AM, mufy <[email protected]> wrote:
> > >
> > > > Have you tried something like -
> > > >
> > > > select SpotPriceHistory[0].Timestamp from ...
> > > >
> > > >
> > > > ---
> > > > Mufeed Usman
> > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> |
> My
> > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> LiveJournal
> > > > <http://mufeed.livejournal.com>
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 1, 2015 at 7:25 PM, Vince Gonzalez <[email protected]>
> > > wrote:
> > > >
> > > > > I've got some data with the following structure:
> > > > >
> > > > > {
> > > > >     "SpotPriceHistory": [
> > > > >         {
> > > > >             "ProductDescription": "Linux/UNIX",
> > > > >             "AvailabilityZone": "us-east-1a",
> > > > >             "Timestamp": "2015-01-01T13:26:16.000Z",
> > > > >             "InstanceType": "cc2.8xlarge",
> > > > >             "SpotPrice": "0.256900"
> > > > >         },
> > > > >         {
> > > > >             "ProductDescription": "Linux/UNIX",
> > > > >             "AvailabilityZone": "us-east-1a",
> > > > >             "Timestamp": "2015-01-01T13:01:38.000Z",
> > > > >             "InstanceType": "cc2.8xlarge",
> > > > >             "SpotPrice": "0.256800"
> > > > >         },
> > > > >         ...
> > > > >     ]
> > > > > }
> > > > >
> > > > > In exploring it with Drill, I'm able to get as far as:
> > > > >
> > > > > > select * from (select kvgen(flatten(SpotPriceHistory)) as t from
> > > > > dfs.`/tmp/spot_price_history.json`);
> > > > >
> > > > > Which yields rows that each contain a list of key-value pairs,
> which
> > > > seems
> > > > > like what I want:
> > > > >
> > > > > +------------+
> > > > > |     t      |
> > > > > +------------+
> > > > > |
> > > > >
> > > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:48:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.256900"}]
> > > > > |
> > > > > |
> > > > >
> > > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:40:49.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257000"}]
> > > > > |
> > > > > |
> > > > >
> > > > >
> > > >
> > >
> >
> [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:36:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257100"}]
> > > > > |
> > > > >
> > > > > But I have not figured out how to pull those keys up into column
> > names
> > > so
> > > > > that I can do a query like:
> > > > >
> > > > > > select SpotPrice,Timestamp from t
> > > > >
> > > > > I'll appreciate any tips.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --vince
> > > > >
> > > >
> > >
> >
>

Reply via email to