Re: Query on setting up Apache Drill and nested query for json file

Jason Altekruse Tue, 09 Jun 2015 11:43:32 -0700

Hi Rob,

Thanks for putting so much effort into getting Drill set up for your use
case, we know that there are still some sharp edges in Drill and detailed
information about use cases that are hard to set up help us to improve the
docs and core project.

As a quick answer, I think you might have run into a few issues with
syntactic limitations of Drill today. To do repeated flattens like this it
requires you to put each flatten in a subquery with a table and column
alias assigned for each intermediate result. There is an issue opened to
make this use case simpler: https://issues.apache.org/jira/browse/DRILL-2783

select flatten(t.flat_entries.entries) from (select flatten(entries) as
flat_entries from dfs.`/path/to/sample_json.json`) t;

 Here are answers to a few of your questions:

> 1) Where does Drill get the employee.json file from? I have searched my
whole computer after downloading the tar file and looked in the sample data
directory and cannot find it anywhere.
> I have attached a screenshot after doing a find command which shows the
content of the sample data directory.

*This is taken out of one of the jar files we depend on. I believe it is
one of the jars for calcite our sql parsing and planning/optimization
engine. This is why you need to use the cp (classpath) storage plugin to
access it.*

> 2) I have the json file I wish to query in the bin directory but whenever
I try to query it errors -
> based upon reading the tutes and also looking at the structure of the
select statement agains the employee.json file I THINK it's right but it
fails every time.
> I have attached a cut down version of the sample json i am trying to
query. It is called sample_json.json

*As previous comments have said, it would be easiest to debug with the
error message. Considering what you said about putting it in the bin
directory, you might be trying to run a query without a full path to get to
the file. Drill is currently designed to query filesystems or a few
database systems like Hbase and Mongo. To query a file on your local
filesystem you can use the syntax provided above by Neeraja. A default
storage plugin called dfs (for distributed filesystem, we consider the
local filesystem as a stand in for Drills main data source distributed
databases and filesystems) is configured to point to you local disk at the
root directory. Therefor you can query any file on your machine with
dfs.`full/path/to/file.json` *

*We do not currently have a shortcut to read files in the directory where
you launched Drill.*

 3) The aforementioned json file contains nested data and the problem is
the nested level also contain similar names as some of the higher levels.
This is the way the log information is returned in JSON format from the
REST API call. How can one flatten this? The deepest level is 2 considering
that level one starts at 0. ie. there are 3 levels at most. In cases where
there are 3 levels, we want to flatten it out, in cases where there are 2,
then only 2 will be flattened out.

*This is a little more complicated, we can definitely do a static number of
flattens. The answer I provided above will allow you to flatten two levels.
*

*There are some limitations in both how we represent lists in the Drill
data model, as well as the json reader implementation that prevent
flattening three times with the sample data you provided. The third level
of nesting contains only empty lists.When the array support was added the
decision was made to consider empty lists as equivalent as non-existent
(null) ones. If we read only empty lists we don't bother populating them in
the schema. Unfortunately this conflicts with our behavior to materialize a
default non-repeated type when one of these 'missing' columns is discovered
later. As we get the schema at read time, this discovery happens late in
the query when we hit an expression evaluation or operation like flatten.We
could just populate the empty lists at read, but it is particularly messy
with JSON, because empty lists are typeless and we currently do not have a
concept of a typeless lists in Drill. These are all problems we would like
to solve soon, and this is probably more detail that you need or want, but
its the best I can do to explain why we can give you two levels and it will
suddenly fail if you try to flatten 3 levels with this sample file.As an
intermediate solution, if you fill **in **one record at the top of each
file that contains the full schema throughout the 3 levels of nesting, this
will give us a set of types of all upcoming empty lists. If you put in a
value you are very unlikely to hit in your production data in one of the
fields, you could use this to filter these records in your final
result.* *Unfortunately
you will need to inject one of these records at the start of each file, as
we discover the schema of each file independently.*

4) When trying to executed drill_embedded from the bin directory it starts
up but then when trying to run the same query against employee.json that
works from the UI, the drill_embedded command simply hangs until i have to
kill the process. See the screenshot "drill embedded under cygwin hangs".
Note JAVA_HOME has been set.

*I don't know why this would be failing. The mail list reject binary
attachments, if you can upload the image to a public host and share a link
we will take a look.*

5)Then when trying sqlline from the bin directory it starts up, I run the
same employee query, it hangs and then starts togive log4j errors. See
attached screenshot "sqlline_hangs_then_gives_log4j_error"

*The log4j errors do happen in some environments but should not have an
impact on Drill.*

 6) Finally the sqlline bat file works under the Windows command prompt and
returns results when querying employee.json (but no other file) but I
cannot exit despite entering the !quit command. see attached screenshot:
"Unable_to_exit_sqlline_when_running_from_Windows_command_promtp"

*Also not sure about this one, most of the dev team works in unix
environments, but we have a few community members that prefer windows. We
make sure to at least have some basic testing on all environments before a
release goes out.*

On Tue, Jun 9, 2015 at 8:36 AM, Andries Engelbrecht <
[email protected]> wrote:

> Rob,
>
> What errors are in your drillbit.log and drillbit.out files?
> This will help to understand why you are seeing the issues with getting
> Drill running.
>
> To “spool” the query output you can user !record <filename>  where it
> writes to the relative path from where sqlline was started
> I would recommend that you rather use CTAS (Create Tables AS) to get the
> output you want, writing a file(s) to the file system and defining the
> filesystem storage plug in.
> Also remember to set the storage format to the desired output format
> before running CTAS.
> alter session set `store.format` = ‘csv’;
>
>
> —Andries
>
>
> On Jun 9, 2015, at 6:09 AM, Neeraja Rentachintala <
> [email protected]> wrote:
>
> > Rob
> > Before answering specific questions, I just wanted to check how did you
> > setup Drill.
> > Did you just download the tar from the Drill site and launching in
> embedded
> > mode?
> > I just did the same and am able to query the file you attached. I just
> put
> > in my local file system under the directory and referenced in the query.
> >
> > 0: jdbc:drill:zk=local> select flatten(t.entries.entries) from
> > dfs.`/Users/nrentachintala/Downloads/yelp/sample_json.json` t limit 5;
> >
> >
> *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+*
> >
> > *| **
> >
> >                      EXPR$0
> >
> >                                               ** |*
> >
> >
> *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+*
> >
> > *|
> *{"@type":"activityLogEntry","id":"444312323","type":"BBBDDD","objectName":"lll","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:41.333Z","endTime":"2344-34-13T33:33:11.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No
> > errors encountered."}* |*
> >
> > *|
> *{"@type":"activityLogEntry","id":"444312322","type":"BBBDDD","objectName":"mmm","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:11.333Z","endTime":"2344-34-13T33:33:13.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No
> > errors encountered."}* |*
> >
> > *|
> *{"@type":"activityLogEntry","id":"444312322","type":"FFFGGG","objectName":"","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:13.333Z","endTime":"2344-34-13T33:33:24.333Z","state":4,"failedSourceRows":3,"successSourceRows":4,"failedTargetRows":3,"successTargetRows":4,"errorMsg":"No
> > errors encountered."}* |*
> >
> > *|
> *{"@type":"activityLogEntry","id":"444312322","type":"FFFGGG","objectName":"","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:24.333Z","endTime":"2344-34-13T33:33:22.333Z","state":4,"failedSourceRows":3,"successSourceRows":3,"failedTargetRows":3,"successTargetRows":3,"errorMsg":"No
> > errors encountered."}* |*
> >
> > *|
> *{"@type":"activityLogEntry","id":"444312323","type":"BBBDDD","objectName":"qqq","runId":3,"runtimeEnvironmentId":"333K2E24333333333332","startTime":"2344-34-13T33:33:22.333Z","endTime":"2344-34-13T33:31:14.333Z","state":4,"failedSourceRows":3,"successSourceRows":422,"failedTargetRows":3,"successTargetRows":422,"errorMsg":"No
> > errors encountered."}* |*
> >
> >
> *+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+*
> >
> >
> > 0: jdbc:drill:zk=local> select t1.entries.objectName, t1.entries.runId
> from
> > (select flatten(t.entries.entries) entries from `sample_json.json` t) t1;
> >
> > *+---------+---------+*
> >
> > *| **EXPR$0 ** | **EXPR$1 ** |*
> >
> > *+---------+---------+*
> >
> > *| *lll    * | *3      * |*
> >
> > *| *mmm    * | *3      * |*
> >
> > *| *       * | *3      * |*
> >
> > *| *       * | *3      * |*
> >
> > *| *qqq    * | *3      * |*
> >
> > *| *ooo    * | *3      * |*
> >
> > *| *ppp    * | *3      * |*
> >
> > *| *       * | *3      * |*
> >
> > *| *       * | *3      * |*
> >
> > *| *       * | *3      * |*
> >
> > *+---------+---------+*
> >
> >
> > On Tue, Jun 9, 2015 at 12:41 AM, Rob E <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> We are trying to 'flatten' out a multi level (nested) json file that is
> >> the result of a REST API call. We were going to try flatten out this to
> a
> >> CSV, load it into Oracle tables and then do traditional SQL queries. The
> >> initial approach was going to be done using a combination of CSVKIT
> >> (in2csv) and Perl. Until I came across Apache Drill....
> >>
> >> I have spent the past 4 days going through the setup and tutorials but
> >> have hit a roadlock on a number of different front which I have listed
> >> below and hope you can help.
> >>
> >> 1) Where does Drill get the employee.json file from? I have searched my
> >> whole computer after downloading the tar file and looked in the sample
> data
> >> directory and cannot find it anywhere. I have attached a screenshot
> after
> >> doing a find command which shows the content of the sample data
> directory.
> >>
> >> 2) I have the json file I wish to query in the bin directory but
> whenever
> >> I try to query it errors - based upon reading the tutes and also
> looking at
> >> the structure of the select statement agains the employee.json file I
> THINK
> >> it's right but it fails every time. I have attached a cut down version
> of
> >> the sample json i am trying to query. It is called sample_json.json
> >>
> >> 3) The aforementioned json file contains nested data and the problem is
> >> the nested level also contain similar names as some of the higher
> levels.
> >> This is the way the log information is returned in JSON format from the
> >> REST API call. How can one flatten this? The deepest level is 2
> considering
> >> that leve one starts at 0. ie. there are 3 levels at most. In cases
> where
> >> there are 3 levels, we want to flatten it out, in cases where there are
> 2,
> >> then only 2 will be flattened out.
> >>
> >> 4) When trying to executed drill_embedded from the bin directory it
> starts
> >> up but then when trying to run the same query against employee.json that
> >> works from the UI, the drill_embedded command simply hangs until i have
> to
> >> kill the process. See the screenshot "drill embedded under cygwin
> hangs".
> >> Note JAVA_HOME has been set.
> >>
> >> 5)Then when trying sqlline from the bin directory it starts up, I run
> the
> >> same employee query, it hangs and then starts togive log4j errors. See
> >> attached screenshot "sqlline_hangs_then_gives_log4j_error"
> >>
> >> 6) Finally the sqlline bat file works under the Windows command prompt
> and
> >> returns results when querying employee.json (but no other file) but I
> >> cannot exit despite entering the !quit command. see attached screenshot:
> >> "Unable_to_exit_sqlline_when_running_from_Windows_command_promtp"
> >>
> >> Please note I've gone through the documentation and believe I've
> followed
> >> the instructions as best I can and managed to solve a few issues by
> >> searching online. However, even after trying towatch some Drill youtube
> >> videos and looking on forums like stackoverflow still can't it to work
> as
> >> nicely as the tutes say and videos demonstrated.
> >>
> >> Basically all we want to do is flatten out a nested JSON file and Drill
> >> seems far more suited to this task - if only we can get it working
> properly
> >> - than coding Python or Perl scripts. THe csvkit (in2csv) works fine but
> >> not for nested JSON data which is our use case.
> >>
> >> One final question - how can one 'spool' results to a file?
> >>
> >> We've tried giving it a go under Windows, Mac OS and also cygwin but
> each
> >> time have failed somewhere despite trying to follow the documentation
> and
> >> tutorial material to the best of our ability.
> >>
> >> It just looks soooooo much easier reaing the online doco then actually
> >> trying to implement.
> >>
> >> We'd gratefully welcome some feedback on the aforementioned teething
> >> issues we're experiencing as we'd certainly like to fly the Apache Drill
> >> flag.
> >>
> >> thanks and regards,
> >> Rob.
> >>
>
>

Re: Query on setting up Apache Drill and nested query for json file

Reply via email to