Hi Rob, Thanks for putting so much effort into getting Drill set up for your use case, we know that there are still some sharp edges in Drill and detailed information about use cases that are hard to set up help us to improve the docs and core project.
As a quick answer, I think you might have run into a few issues with syntactic limitations of Drill today. To do repeated flattens like this it requires you to put each flatten in a subquery with a table and column alias assigned for each intermediate result. There is an issue opened to make this use case simpler: https://issues.apache.org/jira/browse/DRILL-2783 select flatten(t.flat_entries.entries) from (select flatten(entries) as flat_entries from dfs.`/path/to/sample_json.json`) t; Here are answers to a few of your questions: > 1) Where does Drill get the employee.json file from? I have searched my whole computer after downloading the tar file and looked in the sample data directory and cannot find it anywhere. > I have attached a screenshot after doing a find command which shows the content of the sample data directory. *This is taken out of one of the jar files we depend on. I believe it is one of the jars for calcite our sql parsing and planning/optimization engine. This is why you need to use the cp (classpath) storage plugin to access it.* > 2) I have the json file I wish to query in the bin directory but whenever I try to query it errors - > based upon reading the tutes and also looking at the structure of the select statement agains the employee.json file I THINK it's right but it fails every time. > I have attached a cut down version of the sample json i am trying to query. It is called sample_json.json *As previous comments have said, it would be easiest to debug with the error message. Considering what you said about putting it in the bin directory, you might be trying to run a query without a full path to get to the file. Drill is currently designed to query filesystems or a few database systems like Hbase and Mongo. To query a file on your local filesystem you can use the syntax provided above by Neeraja. A default storage plugin called dfs (for distributed filesystem, we consider the local filesystem as a stand in for Drills main data source distributed databases and filesystems) is configured to point to you local disk at the root directory. Therefor you can query any file on your machine with dfs.`full/path/to/file.json` * *We do not currently have a shortcut to read files in the directory where you launched Drill.* 3) The aforementioned json file contains nested data and the problem is the nested level also contain similar names as some of the higher levels. This is the way the log information is returned in JSON format from the REST API call. How can one flatten this? The deepest level is 2 considering that level one starts at 0. ie. there are 3 levels at most. In cases where there are 3 levels, we want to flatten it out, in cases where there are 2, then only 2 will be flattened out. *This is a little more complicated, we can definitely do a static number of flattens. The answer I provided above will allow you to flatten two levels. * *There are some limitations in both how we represent lists in the Drill data model, as well as the json reader implementation that prevent flattening three times with the sample data you provided. The third level of nesting contains only empty lists.When the array support was added the decision was made to consider empty lists as equivalent as non-existent (null) ones. If we read only empty lists we don't bother populating them in the schema. Unfortunately this conflicts with our behavior to materialize a default non-repeated type when one of these 'missing' columns is discovered later. As we get the schema at read time, this discovery happens late in the query when we hit an expression evaluation or operation like flatten.We could just populate the empty lists at read, but it is particularly messy with JSON, because empty lists are typeless and we currently do not have a concept of a typeless lists in Drill. These are all problems we would like to solve soon, and this is probably more detail that you need or want, but its the best I can do to explain why we can give you two levels and it will suddenly fail if you try to flatten 3 levels with this sample file.As an intermediate solution, if you fill **in **one record at the top of each file that contains the full schema throughout the 3 levels of nesting, this will give us a set of types of all upcoming empty lists. If you put in a value you are very unlikely to hit in your production data in one of the fields, you could use this to filter these records in your final result.* *Unfortunately you will need to inject one of these records at the start of each file, as we discover the schema of each file independently.* 4) When trying to executed drill_embedded from the bin directory it starts up but then when trying to run the same query against employee.json that works from the UI, the drill_embedded command simply hangs until i have to kill the process. See the screenshot "drill embedded under cygwin hangs". Note JAVA_HOME has been set. *I don't know why this would be failing. The mail list reject binary attachments, if you can upload the image to a public host and share a link we will take a look.* 5)Then when trying sqlline from the bin directory it starts up, I run the same employee query, it hangs and then starts togive log4j errors. See attached screenshot "sqlline_hangs_then_gives_log4j_error" *The log4j errors do happen in some environments but should not have an impact on Drill.* 6) Finally the sqlline bat file works under the Windows command prompt and returns results when querying employee.json (but no other file) but I cannot exit despite entering the !quit command. see attached screenshot: "Unable_to_exit_sqlline_when_running_from_Windows_command_promtp" *Also not sure about this one, most of the dev team works in unix environments, but we have a few community members that prefer windows. We make sure to at least have some basic testing on all environments before a release goes out.* On Tue, Jun 9, 2015 at 8:36 AM, Andries Engelbrecht < [email protected]> wrote: > Rob, > > What errors are in your drillbit.log and drillbit.out files? > This will help to understand why you are seeing the issues with getting > Drill running. > > To “spool” the query output you can user !record <filename> where it > writes to the relative path from where sqlline was started > I would recommend that you rather use CTAS (Create Tables AS) to get the > output you want, writing a file(s) to the file system and defining the > filesystem storage plug in. > Also remember to set the storage format to the desired output format > before running CTAS. > alter session set `store.format` = ‘csv’; > > > —Andries > > > On Jun 9, 2015, at 6:09 AM, Neeraja Rentachintala < > [email protected]> wrote: > > > Rob > > Before answering specific questions, I just wanted to check how did you > > setup Drill. > > Did you just download the tar from the Drill site and launching in > embedded > > mode? > > I just did the same and am able to query the file you attached. I just > put > > in my local file system under the directory and referenced in the query. > > > > 0: jdbc:drill:zk=local> select flatten(t.entries.entries) from > > dfs.`/Users/nrentachintala/Downloads/yelp/sample_json.json` t limit 5; > > > > > *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+* > > > > *| ** > > > > EXPR$0 > > > > ** |* > > > > > *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+* > > > > *| > *{"@type":"activityLogEntry","id":"444312323","type":"BBBDDD","objectName":"lll","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:41.333Z","endTime":"2344-34-13T33:33:11.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No > > errors encountered."}* |* > > > > *| > *{"@type":"activityLogEntry","id":"444312322","type":"BBBDDD","objectName":"mmm","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:11.333Z","endTime":"2344-34-13T33:33:13.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No > > errors encountered."}* |* > > > > *| > *{"@type":"activityLogEntry","id":"444312322","type":"FFFGGG","objectName":"","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:13.333Z","endTime":"2344-34-13T33:33:24.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No > > errors encountered."}* |* > > > > *| > *{"@type":"activityLogEntry","id":"444312322","type":"FFFGGG","objectName":"","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:24.333Z","endTime":"2344-34-13T33:33:22.333Z","state":4,"failedSourceRows":3,"successSourceRows":3,"failedTargetRows":3,"successTargetRows":3,"errorMsg":"No > > errors encountered."}* |* > > > > *| > *{"@type":"activityLogEntry","id":"444312323","type":"BBBDDD","objectName":"qqq","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:22.333Z","endTime":"2344-34-13T33:31:14.333Z","state":4,"failedSourceRows":3,"successSourceRows":422,"failedTargetRows":3,"successTargetRows":422,"errorMsg":"No > > errors encountered."}* |* > > > > > *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+* > > > > > > 0: jdbc:drill:zk=local> select t1.entries.objectName, t1.entries.runId > from > > (select flatten(t.entries.entries) entries from `sample_json.json` t) t1; > > > > *+---------+---------+* > > > > *| **EXPR$0 ** | **EXPR$1 ** |* > > > > *+---------+---------+* > > > > *| *lll * | *3 * |* > > > > *| *mmm * | *3 * |* > > > > *| * * | *3 * |* > > > > *| * * | *3 * |* > > > > *| *qqq * | *3 * |* > > > > *| *ooo * | *3 * |* > > > > *| *ppp * | *3 * |* > > > > *| * * | *3 * |* > > > > *| * * | *3 * |* > > > > *| * * | *3 * |* > > > > *+---------+---------+* > > > > > > On Tue, Jun 9, 2015 at 12:41 AM, Rob E <[email protected]> wrote: > > > >> Hello, > >> > >> We are trying to 'flatten' out a multi level (nested) json file that is > >> the result of a REST API call. We were going to try flatten out this to > a > >> CSV, load it into Oracle tables and then do traditional SQL queries. The > >> initial approach was going to be done using a combination of CSVKIT > >> (in2csv) and Perl. Until I came across Apache Drill.... > >> > >> I have spent the past 4 days going through the setup and tutorials but > >> have hit a roadlock on a number of different front which I have listed > >> below and hope you can help. > >> > >> 1) Where does Drill get the employee.json file from? I have searched my > >> whole computer after downloading the tar file and looked in the sample > data > >> directory and cannot find it anywhere. I have attached a screenshot > after > >> doing a find command which shows the content of the sample data > directory. > >> > >> 2) I have the json file I wish to query in the bin directory but > whenever > >> I try to query it errors - based upon reading the tutes and also > looking at > >> the structure of the select statement agains the employee.json file I > THINK > >> it's right but it fails every time. I have attached a cut down version > of > >> the sample json i am trying to query. It is called sample_json.json > >> > >> 3) The aforementioned json file contains nested data and the problem is > >> the nested level also contain similar names as some of the higher > levels. > >> This is the way the log information is returned in JSON format from the > >> REST API call. How can one flatten this? The deepest level is 2 > considering > >> that leve one starts at 0. ie. there are 3 levels at most. In cases > where > >> there are 3 levels, we want to flatten it out, in cases where there are > 2, > >> then only 2 will be flattened out. > >> > >> 4) When trying to executed drill_embedded from the bin directory it > starts > >> up but then when trying to run the same query against employee.json that > >> works from the UI, the drill_embedded command simply hangs until i have > to > >> kill the process. See the screenshot "drill embedded under cygwin > hangs". > >> Note JAVA_HOME has been set. > >> > >> 5)Then when trying sqlline from the bin directory it starts up, I run > the > >> same employee query, it hangs and then starts togive log4j errors. See > >> attached screenshot "sqlline_hangs_then_gives_log4j_error" > >> > >> 6) Finally the sqlline bat file works under the Windows command prompt > and > >> returns results when querying employee.json (but no other file) but I > >> cannot exit despite entering the !quit command. see attached screenshot: > >> "Unable_to_exit_sqlline_when_running_from_Windows_command_promtp" > >> > >> Please note I've gone through the documentation and believe I've > followed > >> the instructions as best I can and managed to solve a few issues by > >> searching online. However, even after trying towatch some Drill youtube > >> videos and looking on forums like stackoverflow still can't it to work > as > >> nicely as the tutes say and videos demonstrated. > >> > >> Basically all we want to do is flatten out a nested JSON file and Drill > >> seems far more suited to this task - if only we can get it working > properly > >> - than coding Python or Perl scripts. THe csvkit (in2csv) works fine but > >> not for nested JSON data which is our use case. > >> > >> One final question - how can one 'spool' results to a file? > >> > >> We've tried giving it a go under Windows, Mac OS and also cygwin but > each > >> time have failed somewhere despite trying to follow the documentation > and > >> tutorial material to the best of our ability. > >> > >> It just looks soooooo much easier reaing the online doco then actually > >> trying to implement. > >> > >> We'd gratefully welcome some feedback on the aforementioned teething > >> issues we're experiencing as we'd certainly like to fly the Apache Drill > >> flag. > >> > >> thanks and regards, > >> Rob. > >> > >
