Great example. This also comes up in open TSDB where column names are time offsets within a window. Reading data from HBase or MapR DB gives you a map and having kvgen makes everything slick as a whistle.
On Fri, Jan 9, 2015 at 12:10 PM, Jason Altekruse <[email protected]> wrote: > I believe that Jim may have given the appropriate query to satisfy the > needs of the original question, but for anyone who finds this thread I > wanted to give a quick clarification about kvgen. The purpose of this > function is to allow queries against maps where the keys themselves > represent data rather than a schema. For example say you had statistics > about the number of interactions between the users on a social network > stored in a JSON document store. Each user might have a record like this: > > { > user_id : 12345 > user_interactions : { > "00001" : 10, > "05678" : 25, > "11111" : 5 > } > } > > Here the record is a summary of a single users interactions with other > users. The record contains the user_id as well as a map between other > users' ids and the number of interactions recorded between each and the > given user_id (if this was a complete dataset there should be the same > number stored in each of those users interaction summary, with 12345 listed > as their friend with the same number of interactions) > > If you wanted to run a query to find the best friend of every user, you > could use the following query: > > SELECT t.flat_interactions.key from ( > select flatten(kvgen(user_interactions)) as flat_interactions from > dfs.`/tmp/user_table.json` > ) as t > order by t.flat_interactions.`value` DESC limit 1 > > Hope this helps, also see the wiki for more explanation on kvgen and > flatten: > > https://cwiki.apache.org/confluence/display/DRILL/FLATTEN+Function > https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function > > > On Thu, Jan 1, 2015 at 10:00 AM, Jim Bates <[email protected]> wrote: > > > Is this what you're looking for? > > > > select a.`t`.`ProductDescription` as `ProductDescription`, > > a.`t`.`AvailabilityZone` as `AvailabilityZone`, a.`t`.`Timestamp` as > > `Timestamp`, a.`t`.`InstanceType` as `InstanceType`, a.`t`.`SpotPrice` as > > `SpotPrice` from (select flatten(SpotPriceHistory) as t from > > dfs.`/data/tmp_json/1.json`) a; > > > > > +--------------------+------------------+------------+--------------+------------+ > > | ProductDescription | AvailabilityZone | Timestamp | InstanceType | > > SpotPrice | > > > > > +--------------------+------------------+------------+--------------+------------+ > > | Linux/UNIX | us-east-1a | 2015-01-01T13:26:16.000Z | > > cc2.8xlarge | 0.256900 | > > | Linux/UNIX | us-east-1a | 2015-01-01T13:01:38.000Z | > > cc2.8xlarge | 0.256800 | > > > > > +--------------------+------------------+------------+--------------+------------+ > > > > On Thu, Jan 1, 2015 at 11:28 AM, Vince Gonzalez <[email protected]> > > wrote: > > > > > Mufy, your suggestion didn't work, but this got me further. Still > > wondering > > > if this is the best way. > > > > > > 0: jdbc:drill:zk=local> create table spot_price_history as (select > > > flatten(SpotPriceHistory) as t from > dfs.`/tmp/spot_price_history.json`); > > > +------------+---------------------------+ > > > | Fragment | Number of records written | > > > +------------+---------------------------+ > > > | 0_0 | 638 | > > > +------------+---------------------------+ > > > 1 row selected (0.353 seconds) > > > 0: jdbc:drill:zk=local> select * from spot_price_history limit 5; > > > +------------+ > > > | t | > > > +------------+ > > > | > > > > > > > > > {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:26:16.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"} > > > | > > > | > > > > > > > > > {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T13:01:38.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256800"} > > > | > > > | > > > > > > > > > {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:21:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"} > > > | > > > | > > > > > > > > > {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T12:13:21.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.257000"} > > > | > > > | > > > > > > > > > {"ProductDescription":"Linux/UNIX","AvailabilityZone":"us-east-1a","Timestamp":"2015-01-01T11:57:05.000Z","InstanceType":"cc2.8xlarge","SpotPrice":"0.256900"} > > > | > > > +------------+ > > > 5 rows selected (0.052 seconds) > > > 0: jdbc:drill:zk=local> select t['SpotPrice'] from spot_price_history > > limit > > > 5; > > > +------------+ > > > | EXPR$0 | > > > +------------+ > > > | 0.256900 | > > > | 0.256800 | > > > | 0.256900 | > > > | 0.257000 | > > > | 0.256900 | > > > +------------+ > > > 5 rows selected (0.072 seconds) > > > > > > --- > > > Vince Gonzalez > > > Systems Engineer > > > 212.694.3879 > > > > > > mapr.com > > > > > > On Thu, Jan 1, 2015 at 10:06 AM, mufy <[email protected]> wrote: > > > > > > > Have you tried something like - > > > > > > > > select SpotPriceHistory[0].Timestamp from ... > > > > > > > > > > > > --- > > > > Mufeed Usman > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | > My > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > LiveJournal > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > > > > > On Thu, Jan 1, 2015 at 7:25 PM, Vince Gonzalez <[email protected]> > > > wrote: > > > > > > > > > I've got some data with the following structure: > > > > > > > > > > { > > > > > "SpotPriceHistory": [ > > > > > { > > > > > "ProductDescription": "Linux/UNIX", > > > > > "AvailabilityZone": "us-east-1a", > > > > > "Timestamp": "2015-01-01T13:26:16.000Z", > > > > > "InstanceType": "cc2.8xlarge", > > > > > "SpotPrice": "0.256900" > > > > > }, > > > > > { > > > > > "ProductDescription": "Linux/UNIX", > > > > > "AvailabilityZone": "us-east-1a", > > > > > "Timestamp": "2015-01-01T13:01:38.000Z", > > > > > "InstanceType": "cc2.8xlarge", > > > > > "SpotPrice": "0.256800" > > > > > }, > > > > > ... > > > > > ] > > > > > } > > > > > > > > > > In exploring it with Drill, I'm able to get as far as: > > > > > > > > > > > select * from (select kvgen(flatten(SpotPriceHistory)) as t from > > > > > dfs.`/tmp/spot_price_history.json`); > > > > > > > > > > Which yields rows that each contain a list of key-value pairs, > which > > > > seems > > > > > like what I want: > > > > > > > > > > +------------+ > > > > > | t | > > > > > +------------+ > > > > > | > > > > > > > > > > > > > > > > > > > > [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:48:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.256900"}] > > > > > | > > > > > | > > > > > > > > > > > > > > > > > > > > [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:40:49.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257000"}] > > > > > | > > > > > | > > > > > > > > > > > > > > > > > > > > [{"key":"ProductDescription","value":"Linux/UNIX"},{"key":"AvailabilityZone","value":"us-east-1a"},{"key":"Timestamp","value":"2014-12-20T11:36:46.000Z"},{"key":"InstanceType","value":"cc2.8xlarge"},{"key":"SpotPrice","value":"0.257100"}] > > > > > | > > > > > > > > > > But I have not figured out how to pull those keys up into column > > names > > > so > > > > > that I can do a query like: > > > > > > > > > > > select SpotPrice,Timestamp from t > > > > > > > > > > I'll appreciate any tips. > > > > > > > > > > Thanks, > > > > > > > > > > --vince > > > > > > > > > > > > > > >
