Query on setting up Apache Drill and nested query for json file

Rob E Tue, 09 Jun 2015 05:23:39 -0700

Hello,

We are trying to 'flatten' out a multi level (nested) json file that is the
result of a REST API call. We were going to try flatten out this to a CSV,
load it into Oracle tables and then do traditional SQL queries. The initial
approach was going to be done using a combination of CSVKIT (in2csv) and
Perl. Until I came across Apache Drill....


I have spent the past 4 days going through the setup and tutorials but have
hit a roadlock on a number of different front which I have listed below and
hope you can help.

1) Where does Drill get the employee.json file from? I have searched my
whole computer after downloading the tar file and looked in the sample data
directory and cannot find it anywhere. I have attached a screenshot after
doing a find command which shows the content of the sample data directory.

2) I have the json file I wish to query in the bin directory but whenever I
try to query it errors - based upon reading the tutes and also looking at
the structure of the select statement agains the employee.json file I THINK
it's right but it fails every time. I have attached a cut down version of
the sample json i am trying to query. It is called sample_json.json

3) The aforementioned json file contains nested data and the problem is the
nested level also contain similar names as some of the higher levels. This
is the way the log information is returned in JSON format from the REST API
call. How can one flatten this? The deepest level is 2 considering that
leve one starts at 0. ie. there are 3 levels at most. In cases where there
are 3 levels, we want to flatten it out, in cases where there are 2, then
only 2 will be flattened out.

4) When trying to executed drill_embedded from the bin directory it starts
up but then when trying to run the same query against employee.json that
works from the UI, the drill_embedded command simply hangs until i have to
kill the process. See the screenshot "drill embedded under cygwin hangs".
Note JAVA_HOME has been set.

5)Then when trying sqlline from the bin directory it starts up, I run the
same employee query, it hangs and then starts togive log4j errors. See
attached screenshot "sqlline_hangs_then_gives_log4j_error"

6) Finally the sqlline bat file works under the Windows command prompt and
returns results when querying employee.json (but no other file) but I
cannot exit despite entering the !quit command. see attached screenshot:
"Unable_to_exit_sqlline_when_running_from_Windows_command_promtp"

Please note I've gone through the documentation and believe I've followed
the instructions as best I can and managed to solve a few issues by
searching online. However, even after trying towatch some Drill youtube
videos and looking on forums like stackoverflow still can't it to work as
nicely as the tutes say and videos demonstrated.

Basically all we want to do is flatten out a nested JSON file and Drill
seems far more suited to this task - if only we can get it working properly
- than coding Python or Perl scripts. THe csvkit (in2csv) works fine but
not for nested JSON data which is our use case.

One final question - how can one 'spool' results to a file?

We've tried giving it a go under Windows, Mac OS and also cygwin but each
time have failed somewhere despite trying to follow the documentation and
tutorial material to the best of our ability.

It just looks soooooo much easier reaing the online doco then actually
trying to implement.

We'd gratefully welcome some feedback on the aforementioned teething issues
we're experiencing as we'd certainly like to fly the Apache Drill flag.

thanks and regards,
Rob.

sample_json.json
Description: application/json

Query on setting up Apache Drill and nested query for json file

Reply via email to