I’m using the yelp academic data set <https://www.yelp.com/academic_dataset> for my testing purposes.
Say I want to list all businesses in the “Pets” category, this query doesn’t work: 0: jdbc:drill:zk=sen11:5181,sen12:5181> select `name`, flatten(`categories`) from maprfs.cmatta.`yelp/yelp_academic_dataset_business.json` WHERE flatten(`categories`) = 'Pets' limit 10; Query failed: Query stopped., Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [flatten(VARCHAR-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. [ daf9f62a-f39b-4745-a0e3-0160875f5cb7 on se-node13.se.lab:31010 ] Error: exception while executing query: Failure while executing query. (state=,code=0) But this query with a sub-select does work: 0: jdbc:drill:zk=sen11:5181,sen12:5181> select flat.`name`, flat.`categories` from (select `name`, flatten(`categories`) as categories from maprfs.cmatta.`yelp/yelp_academic_dataset_business.json`) as flat WHERE lower(flat.`categories`) = 'pets' limit 10; +------------+------------+ | name | categories | +------------+------------+ | Loving Hands Pet Care | Pets | | Amec Mid-City Animal Hospital | Pets | | PetSmart | Pets | | A Dog's Life Photography | Pets | | Goober Pet Direct | Pets | | Pet Planet | Pets | | All-Star Animal Hospital | Pets | | Team Canine, Inc | Pets | | Foothills Pet Resort | Pets | | Petco | Pets | +------------+------------+ 10 rows selected (0.141 seconds) Now that I can pick out which businesses are in the “Pets” category, I may want to get the average star rating and order them by the number of reviews. However, using the above technique in an aggregation query doesn’t seem to work: 0: jdbc:drill:zk=sen11:5181,sen12:5181> select . . . . . . . . . . . . . . . . . . . > a.`name`, . . . . . . . . . . . . . . . . . . . > count(b.`text`), . . . . . . . . . . . . . . . . . . . > avg(b.`stars`) . . . . . . . . . . . . . . . . . . . > FROM ( . . . . . . . . . . . . . . . . . . . > SELECT `business_id`, `name`, flatten(`categories`) as `category` . . . . . . . . . . . . . . . . . . . > FROM maprfs.`cmatta`.`yelp/yelp_academic_dataset_business.json`) a . . . . . . . . . . . . . . . . . . . > JOIN maprfs.`cmatta`.`yelp/yelp_academic_dataset_review.json` b . . . . . . . . . . . . . . . . . . . > ON a.`business_id` = b.`business_id` . . . . . . . . . . . . . . . . . . . > WHERE a.`category` = 'Pets' . . . . . . . . . . . . . . . . . . . > GROUP BY a.`name` . . . . . . . . . . . . . . . . . . . > ORDER BY count(*) DESC . . . . . . . . . . . . . . . . . . . > LIMIT 10; Query failed: Query failed: Failure while running fragment., Failure while trying to materialize incoming schema. Errors: Error in expression at index 2. Error: Missing function implementation: [flatten(VARCHAR-REPEATED)]. Full expression: null.. [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ] [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ] Error: exception while executing query: Failure while executing query. (state=,code=0) What does *Error: Missing function implementation: [flatten(VARCHAR-REPEATED)]* in this context? Am I going about this the wrong way? Chris Matta [email protected] 215-701-3146
