Re: Accessing underlying scheme of input

Aman Sinha Fri, 02 Mar 2018 11:49:22 -0800

Erol,
yes indeed Drill is internally creating the schema for Json data.  The top
level field's data type can be found using the TYPEOF(column) function that
Gautam mentioned earlier.
However, I understand you are looking for the nested schema as well, so I
would recommend the following approach:


- Run a CTAS and write a small sample of the table out in Parquet format.
- Run parquet-tools [1] with the 'schema' option to explore the schema.
- NOTE: Although this will show the Parquet schema, it at least will give
an idea of the original JSON schema.

Here's an example:

*// My original JSON data consisting of maps and arrays: *

0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;

*+-----+---------------+-----------------------------------+*

*| **a1 ** | **     b1      ** | **               c1                ** |*

*+-----+---------------+-----------------------------------+*

*| *10 * | *{"id":[1,2]} * | *{"d1":5,"e1":{"f1":["CA","TX"]}} * |*

*+-----+---------------+-----------------------------------+*



0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
from t2.json;

*+-----------+----------------------------+*

*| **Fragment ** | **Number of records written ** |*

*+-----------+----------------------------+*

*| *0_0      * | *1                         * |*

*+-----------+----------------------------+*


*// Analyze using parquet-tools*

[root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
schema tt3/0_0_0.parquet

message root {

  optional int64 a1;

  optional group b1 {

    repeated int64 id;

  }

  optional group c1 {

    optional int64 d1;

    optional group e1 {

      repeated binary f1 (UTF8);

    }

  }

}

The meaning of these types is defined here [2].


[1] http://parquet.apache.org/downloads/
[2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

On Thu, Mar 1, 2018 at 6:46 PM, Erol Akarsu <[email protected]> wrote:

> Boaz,
>
> Thanks for full explanation.
> I got impression that Apache Drill is internally creating json schema for
> input type. I thought we would be able to look deep into that schema,
> Maybe it is not in "describe" level but it is API level we can achieve it.
> That do you think?
>
> Erol Akarsu
>
> On Thu, Mar 1, 2018 at 7:04 PM, Boaz Ben-Zvi <[email protected]> wrote:
>
> > From the docs (https://drill.apache.org/docs/describe/): “Currently,
> > DESCRIBE does not support tables created in a file system.”
> > Seems that it only works well for Hive and HBase tables.
> >
> > The create view statement does not explore the actual schema of the
> > query’s table(s); it only parses and keeps the query.
> > Hence it can not tell what ‘*’ (or any list of column names) means:
> >
> >  0: jdbc:drill:zk=local> create view pview as select * from
> > dfs.`/data/PARQUET-1M.parquet`;
> > +-------+-------------------------------------------------------+
> > |  ok   |                        summary                        |
> > +-------+-------------------------------------------------------+
> > | true  | View 'pview' created successfully in 'dfs.tmp' schema  |
> > +-------+-------------------------------------------------------+
> > 1 row selected (0.274 seconds)
> > 0: jdbc:drill:zk=local> describe pview;
> > +--------------+---------------+--------------+
> > | COLUMN_NAME  |   DATA_TYPE   | IS_NULLABLE  |
> > +--------------+---------------+--------------+
> > | **           | DYNAMIC_STAR  | YES          |
> > +--------------+---------------+--------------+
> > 1 row selected (0.419 seconds)
> > 0: jdbc:drill:zk=local> create view pview2 as select
> > row_count,just_garbage from  dfs.`/data/PARQUET-1M.parquet`;
> > +-------+---------------------------------------------------------+
> > |  ok   |                         summary                         |
> > +-------+---------------------------------------------------------+
> > | true  | View 'pview2' created successfully in 'dfs.tmp' schema  |
> > +-------+---------------------------------------------------------+
> > 1 row selected (0.3 seconds)
> > 0: jdbc:drill:zk=local> describe pview2;
> > +---------------+------------+--------------+
> > |  COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> > +---------------+------------+--------------+
> > | row_count     | ANY        | YES          |
> > | just_garbage  | ANY        | YES          |
> > +---------------+------------+--------------+
> > 2 rows selected (0.533 seconds)
> > 0: jdbc:drill:zk=local>
> >
> > Thanks,
> >
> >       Boaz
> >
> > On 3/1/18, 3:05 PM, "Padma Penumarthy" <[email protected]> wrote:
> >
> >     Not sure why it is not showing the fields. It does not work for me
> > either.
> >     Does anyone know more ? Is this broken ?
> >
> >     Thanks
> >     Padma
> >
> >     > On Mar 1, 2018, at 2:54 PM, Erol Akarsu <[email protected]> wrote:
> >     >
> >     > Somehow, after "user dfs.tmp", I was able to create view. But
> > "describe"
> >     > for view does not give much information. I was expecting "describe"
> > command
> >     > would give type  definitions of fields " employee_id  |
> > full_name     |
> >     > first_name  | last_name  | position_id  |   position_title    |
> > store_id  |
> >     > depart ". But it does give a very generic field type.
> >     >
> >     >
> >     > 0: jdbc:drill:zk=local> create view mydonuts2 as SELECT * FROM
> >     > cp.`employee.json` LIMIT 3;
> >     > +-------+---------------------------------------------------
> > ---------+
> >     > |  ok   |                          summary
> >  |
> >     > +-------+---------------------------------------------------
> > ---------+
> >     > | true  | View 'mydonuts2' created successfully in 'dfs.tmp' schema
> > |
> >     > +-------+---------------------------------------------------
> > ---------+
> >     > 1 row selected (0.283 seconds)
> >     > 0: jdbc:drill:zk=local> describe mydonuts2;
> >     > +--------------+------------+--------------+
> >     > | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> >     > +--------------+------------+--------------+
> >     > | *            | ANY        | YES          |
> >     > +--------------+------------+--------------+
> >     > 1 row selected (0.388 seconds)
> >     > 0: jdbc:drill:zk=local>  SELECT * FROM cp.`employee.json` LIMIT 3;
> >     > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> >     > | employee_id  |    full_name     | first_name  | last_name  |
> > position_id
> >     > |   position_title    | store_id  | depart |
> >     > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> >     > | 1            | Sheri Nowmer     | Sheri       | Nowmer     | 1
> >     > | President           | 0         | 1      |
> >     > | 2            | Derrick Whelply  | Derrick     | Whelply    | 2
> >     > | VP Country Manager  | 0         | 1      |
> >     > | 4            | Michael Spence   | Michael     | Spence     | 2
> >     > | VP Country Manager  | 0         | 1      |
> >     > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> >     > 3 rows selected (0.579 seconds)
> >     >
> >     > On Thu, Mar 1, 2018 at 3:18 PM, Erol Akarsu <[email protected]>
> > wrote:
> >     >
> >     >> Padma,
> >     >>
> >     >> I have not created any user. I just installed the system and run
> > drill
> >     >> with  "sqlline.bat -u "jdbc:drill:zk=local"
> >     >> Therefore, what is shortest procedure to achieve what you have
> > described
> >     >> in previous email?
> >     >>
> >     >> Thanks
> >     >>
> >     >> Erol Akarsu
> >     >>
> >     >> On Thu, Mar 1, 2018 at 3:00 PM, Padma Penumarthy <
> > [email protected]>
> >     >> wrote:
> >     >>
> >     >>> Check if you have permissions to root directory or not.
> >     >>> You may have to specify the complete directory path (for which
> you
> > have
> >     >>> permissions for) in the create view command.
> >     >>>
> >     >>> For example:
> >     >>>
> >     >>> 0: jdbc:drill:zk=local> create view dfs.root.`/Users/ppenumarthy/
> > parquet/test-view`
> >     >>> as select * from dfs.root.`/Users/ppenumarthy/
> > parquet/0_0_0.parquet`;
> >     >>> +-------+---------------------------------------------------
> >     >>> -------------------------------------+
> >     >>> |  ok   |                                        summary
> >     >>>                       |
> >     >>> +-------+---------------------------------------------------
> >     >>> -------------------------------------+
> >     >>> | true  | View '/Users/ppenumarthy/parquet/test-view' created
> >     >>> successfully in 'dfs.root' schema  |
> >     >>> +-------+---------------------------------------------------
> >     >>> -------------------------------------+
> >     >>> 1 row selected (0.148 seconds)
> >     >>> 0: jdbc:drill:zk=local>
> >     >>>
> >     >>>
> >     >>> Thanks
> >     >>> Padma
> >     >>>
> >     >>> On Mar 1, 2018, at 11:37 AM, Erol Akarsu <[email protected]
> > <mailto:eaka
> >     >>> [email protected]>> wrote:
> >     >>>
> >     >>> Padma,
> >     >>>
> >     >>> I have changed dfs storage plugin through web interface as below.
> > But I am
> >     >>> getting same error response.
> >     >>>
> >     >>> {
> >     >>> "type": "file",
> >     >>> "enabled": true,
> >     >>> "connection": "file:///",
> >     >>> "config": null,
> >     >>> "workspaces": {
> >     >>>   "root": {
> >     >>>     "location": "/",
> >     >>>     "writable": true,
> >     >>>     "defaultInputFormat": null,
> >     >>>     "allowAccessOutsideWorkspace": true
> >     >>>   },
> >     >>>   "tmp": {
> >     >>>     "location": "/tmp",
> >     >>>     "writable": true,
> >     >>>     "defaultInputFormat": null,
> >     >>>     "allowAccessOutsideWorkspace": true
> >     >>>   }
> >     >>> },
> >     >>>
> >     >>> On Thu, Mar 1, 2018 at 1:15 PM, Padma Penumarthy <
> > [email protected]
> >     >>> <mailto:[email protected]>>
> >     >>> wrote:
> >     >>>
> >     >>> Make "writable": true for the workspace (dfs.root) in the storage
> > plugin
> >     >>> configuration.
> >     >>>
> >     >>> Thanks
> >     >>> Padma
> >     >>>
> >     >>>
> >     >>> On Mar 1, 2018, at 10:10 AM, Erol Akarsu <[email protected]
> > <mailto:eaka
> >     >>> [email protected]><mailto:eaka
> >     >>> [email protected]<mailto:[email protected]>>> wrote:
> >     >>>
> >     >>> Thanks Padma.
> >     >>>
> >     >>> I am getting problem while creating view
> >     >>>
> >     >>> 0: jdbc:drill:zk=local> create view mydonuts as SELECT * FROM
> >     >>> cp.`employee.json` LIMIT 3;
> >     >>> Error: VALIDATION ERROR: Root schema is immutable. Creating or
> > dropping
> >     >>> tables/views is not allowed in root schema.Select a schema using
> > 'USE
> >     >>> schema' command.
> >     >>>
> >     >>>
> >     >>> [Error Id: 68a31047-5a4e-4768-8722-55648d9a80f6 on
> > DESKTOP-8OANV3A:31010]
> >     >>> (state=,code=0)
> >     >>> 0: jdbc:drill:zk=local>
> >     >>>
> >     >>> On Thu, Mar 1, 2018 at 12:49 PM, Padma Penumarthy <
> > [email protected]
> >     >>> <mailto:[email protected]>
> >     >>> <mailto:[email protected]>>
> >     >>> wrote:
> >     >>>
> >     >>> Try creating a view and use describe.
> >     >>>
> >     >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
> >     >>> apache.org_docs_describe_&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> >     >>> XVqW14B9eGK9QR_fKKCb5H5LxKnqNMmq1U7RdNlhq1c&
> > m=blTmu-WQJa5RUrxqG46o20B-a-
> >     >>> UP0yCHKGqcBo1ETlI&s=SxKqC4w_I5bbnNAtIvzO_9qptLP-
> KsHi0iIbjtzyrc8&e=
> >     >>>
> >     >>> Thanks
> >     >>> Padma
> >     >>>
> >     >>>
> >     >>> On Mar 1, 2018, at 9:22 AM, Erol Akarsu <[email protected]
> <mailto:
> > eaka
> >     >>> [email protected]><mailto:eaka
> >     >>> [email protected]<mailto:[email protected]>><mailto:eaka
> >     >>> [email protected]<mailto:[email protected]><mailto:[email protected]>>>
> wrote:
> >     >>>
> >     >>> When Use limit 0 query,  I am getting only field names. I am
> > looking for
> >     >>> json schema for input that will describe input type
> >     >>>
> >     >>> 0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;
> >     >>>
> >     >>> +-----------+-------+-------+------------+-------------+
> >     >>> | trans_id  | date  | time  | user_info  | trans_info  |
> >     >>> +-----------+-------+-------+------------+-------------+
> >     >>> +-----------+-------+-------+------------+-------------+
> >     >>> No
> >     >>>
> >     >>> On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu <[email protected]
> > <mailto:
> >     >>> [email protected]><mailto:
> >     >>> [email protected]<mailto:[email protected]>><mailto:
> >     >>> [email protected]<mailto:[email protected]><mailto:eakarsu@
> > gmail.com>>>
> >     >>> wrote:
> >     >>>
> >     >>> I am sorry Sorabh
> >     >>> Can you give an example? I am still learning Drill
> >     >>> Thanks
> >     >>>
> >     >>> On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia <
> > [email protected]
> >     >>> <mailto:[email protected]>
> >     >>> <mailto:[email protected]>
> >     >>> <mailto:[email protected]>>
> >     >>> wrote:
> >     >>>
> >     >>> Hi Erol,
> >     >>>
> >     >>> You can run limit 0 query from client to retrieve just the schema
> > for
> >     >>> your input.
> >     >>>
> >     >>>
> >     >>> Thanks,
> >     >>> Sorabh
> >     >>>
> >     >>> ________________________________
> >     >>> From: Erol Akarsu <[email protected]<mailto:[email protected]
> > ><mailto:
> >     >>> [email protected]><mailto:
> >     >>> [email protected]<mailto:[email protected]>>>
> >     >>> Sent: Thursday, March 1, 2018 5:28:52 AM
> >     >>> To: [email protected]<mailto:[email protected]><mailto:u
> >     >>> [email protected]><mailto:u
> >     >>> [email protected]<mailto:[email protected]>>
> >     >>> Subject: Accessing underlying scheme of input
> >     >>>
> >     >>> I know Apache drill is creating a json schema for input data file
> > or hdfs
> >     >>> input before user query on it.
> >     >>> I like to know whether or not Apache drill has API that will help
> > user to
> >     >>> obtain that  derived schema for say an json file or excel file or
> > hive
> >     >>> input.
> >     >>> I appreciate your help
> >     >>>
> >     >>> Erol Akarsu
> >     >>>
> >     >>> Sent from Mail for Windows 10
> >     >>>
> >     >>> --
> >     >>>
> >     >>> Erol Akarsu
> >     >>>
> >     >>>
> >     >>>
> >     >>>
> >     >>> --
> >     >>>
> >     >>> Erol Akarsu
> >     >>>
> >     >>>
> >     >>>
> >     >>>
> >     >>> --
> >     >>>
> >     >>> Erol Akarsu
> >     >>>
> >     >>>
> >     >>>
> >     >>>
> >     >>> --
> >     >>>
> >     >>> Erol Akarsu
> >     >>>
> >     >>>
> >     >>
> >     >>
> >     >> --
> >     >>
> >     >> Erol Akarsu
> >     >>
> >     >>
> >     >
> >     >
> >     > --
> >     >
> >     > Erol Akarsu
> >
> >
> >
> >
>
>
> --
>
> Erol Akarsu
>

Re: Accessing underlying scheme of input

Reply via email to