Erol,
yes indeed Drill is internally creating the schema for Json data. The top
level field's data type can be found using the TYPEOF(column) function that
Gautam mentioned earlier.
However, I understand you are looking for the nested schema as well, so I
would recommend the following approach:
- Run a CTAS and write a small sample of the table out in Parquet format.
- Run parquet-tools [1] with the 'schema' option to explore the schema.
- NOTE: Although this will show the Parquet schema, it at least will give
an idea of the original JSON schema.
Here's an example:
*// My original JSON data consisting of maps and arrays: *
0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;
*+-----+---------------+-----------------------------------+*
*| **a1 ** | ** b1 ** | ** c1 ** |*
*+-----+---------------+-----------------------------------+*
*| *10 * | *{"id":[1,2]} * | *{"d1":5,"e1":{"f1":["CA","TX"]}} * |*
*+-----+---------------+-----------------------------------+*
0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
from t2.json;
*+-----------+----------------------------+*
*| **Fragment ** | **Number of records written ** |*
*+-----------+----------------------------+*
*| *0_0 * | *1 * |*
*+-----------+----------------------------+*
*// Analyze using parquet-tools*
[root@aman1 ~]# java -jar parquet-tools/target/parquet-tools-1.9.0.jar
schema tt3/0_0_0.parquet
message root {
optional int64 a1;
optional group b1 {
repeated int64 id;
}
optional group c1 {
optional int64 d1;
optional group e1 {
repeated binary f1 (UTF8);
}
}
}
The meaning of these types is defined here [2].
[1] http://parquet.apache.org/downloads/
[2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
On Thu, Mar 1, 2018 at 6:46 PM, Erol Akarsu <[email protected]> wrote:
> Boaz,
>
> Thanks for full explanation.
> I got impression that Apache Drill is internally creating json schema for
> input type. I thought we would be able to look deep into that schema,
> Maybe it is not in "describe" level but it is API level we can achieve it.
> That do you think?
>
> Erol Akarsu
>
> On Thu, Mar 1, 2018 at 7:04 PM, Boaz Ben-Zvi <[email protected]> wrote:
>
> > From the docs (https://drill.apache.org/docs/describe/): “Currently,
> > DESCRIBE does not support tables created in a file system.”
> > Seems that it only works well for Hive and HBase tables.
> >
> > The create view statement does not explore the actual schema of the
> > query’s table(s); it only parses and keeps the query.
> > Hence it can not tell what ‘*’ (or any list of column names) means:
> >
> > 0: jdbc:drill:zk=local> create view pview as select * from
> > dfs.`/data/PARQUET-1M.parquet`;
> > +-------+-------------------------------------------------------+
> > | ok | summary |
> > +-------+-------------------------------------------------------+
> > | true | View 'pview' created successfully in 'dfs.tmp' schema |
> > +-------+-------------------------------------------------------+
> > 1 row selected (0.274 seconds)
> > 0: jdbc:drill:zk=local> describe pview;
> > +--------------+---------------+--------------+
> > | COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
> > +--------------+---------------+--------------+
> > | ** | DYNAMIC_STAR | YES |
> > +--------------+---------------+--------------+
> > 1 row selected (0.419 seconds)
> > 0: jdbc:drill:zk=local> create view pview2 as select
> > row_count,just_garbage from dfs.`/data/PARQUET-1M.parquet`;
> > +-------+---------------------------------------------------------+
> > | ok | summary |
> > +-------+---------------------------------------------------------+
> > | true | View 'pview2' created successfully in 'dfs.tmp' schema |
> > +-------+---------------------------------------------------------+
> > 1 row selected (0.3 seconds)
> > 0: jdbc:drill:zk=local> describe pview2;
> > +---------------+------------+--------------+
> > | COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
> > +---------------+------------+--------------+
> > | row_count | ANY | YES |
> > | just_garbage | ANY | YES |
> > +---------------+------------+--------------+
> > 2 rows selected (0.533 seconds)
> > 0: jdbc:drill:zk=local>
> >
> > Thanks,
> >
> > Boaz
> >
> > On 3/1/18, 3:05 PM, "Padma Penumarthy" <[email protected]> wrote:
> >
> > Not sure why it is not showing the fields. It does not work for me
> > either.
> > Does anyone know more ? Is this broken ?
> >
> > Thanks
> > Padma
> >
> > > On Mar 1, 2018, at 2:54 PM, Erol Akarsu <[email protected]> wrote:
> > >
> > > Somehow, after "user dfs.tmp", I was able to create view. But
> > "describe"
> > > for view does not give much information. I was expecting "describe"
> > command
> > > would give type definitions of fields " employee_id |
> > full_name |
> > > first_name | last_name | position_id | position_title |
> > store_id |
> > > depart ". But it does give a very generic field type.
> > >
> > >
> > > 0: jdbc:drill:zk=local> create view mydonuts2 as SELECT * FROM
> > > cp.`employee.json` LIMIT 3;
> > > +-------+---------------------------------------------------
> > ---------+
> > > | ok | summary
> > |
> > > +-------+---------------------------------------------------
> > ---------+
> > > | true | View 'mydonuts2' created successfully in 'dfs.tmp' schema
> > |
> > > +-------+---------------------------------------------------
> > ---------+
> > > 1 row selected (0.283 seconds)
> > > 0: jdbc:drill:zk=local> describe mydonuts2;
> > > +--------------+------------+--------------+
> > > | COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
> > > +--------------+------------+--------------+
> > > | * | ANY | YES |
> > > +--------------+------------+--------------+
> > > 1 row selected (0.388 seconds)
> > > 0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 3;
> > > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> > > | employee_id | full_name | first_name | last_name |
> > position_id
> > > | position_title | store_id | depart |
> > > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> > > | 1 | Sheri Nowmer | Sheri | Nowmer | 1
> > > | President | 0 | 1 |
> > > | 2 | Derrick Whelply | Derrick | Whelply | 2
> > > | VP Country Manager | 0 | 1 |
> > > | 4 | Michael Spence | Michael | Spence | 2
> > > | VP Country Manager | 0 | 1 |
> > > +--------------+------------------+-------------+-----------
> > -+--------------+---------------------+-----------+--------+
> > > 3 rows selected (0.579 seconds)
> > >
> > > On Thu, Mar 1, 2018 at 3:18 PM, Erol Akarsu <[email protected]>
> > wrote:
> > >
> > >> Padma,
> > >>
> > >> I have not created any user. I just installed the system and run
> > drill
> > >> with "sqlline.bat -u "jdbc:drill:zk=local"
> > >> Therefore, what is shortest procedure to achieve what you have
> > described
> > >> in previous email?
> > >>
> > >> Thanks
> > >>
> > >> Erol Akarsu
> > >>
> > >> On Thu, Mar 1, 2018 at 3:00 PM, Padma Penumarthy <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> Check if you have permissions to root directory or not.
> > >>> You may have to specify the complete directory path (for which
> you
> > have
> > >>> permissions for) in the create view command.
> > >>>
> > >>> For example:
> > >>>
> > >>> 0: jdbc:drill:zk=local> create view dfs.root.`/Users/ppenumarthy/
> > parquet/test-view`
> > >>> as select * from dfs.root.`/Users/ppenumarthy/
> > parquet/0_0_0.parquet`;
> > >>> +-------+---------------------------------------------------
> > >>> -------------------------------------+
> > >>> | ok | summary
> > >>> |
> > >>> +-------+---------------------------------------------------
> > >>> -------------------------------------+
> > >>> | true | View '/Users/ppenumarthy/parquet/test-view' created
> > >>> successfully in 'dfs.root' schema |
> > >>> +-------+---------------------------------------------------
> > >>> -------------------------------------+
> > >>> 1 row selected (0.148 seconds)
> > >>> 0: jdbc:drill:zk=local>
> > >>>
> > >>>
> > >>> Thanks
> > >>> Padma
> > >>>
> > >>> On Mar 1, 2018, at 11:37 AM, Erol Akarsu <[email protected]
> > <mailto:eaka
> > >>> [email protected]>> wrote:
> > >>>
> > >>> Padma,
> > >>>
> > >>> I have changed dfs storage plugin through web interface as below.
> > But I am
> > >>> getting same error response.
> > >>>
> > >>> {
> > >>> "type": "file",
> > >>> "enabled": true,
> > >>> "connection": "file:///",
> > >>> "config": null,
> > >>> "workspaces": {
> > >>> "root": {
> > >>> "location": "/",
> > >>> "writable": true,
> > >>> "defaultInputFormat": null,
> > >>> "allowAccessOutsideWorkspace": true
> > >>> },
> > >>> "tmp": {
> > >>> "location": "/tmp",
> > >>> "writable": true,
> > >>> "defaultInputFormat": null,
> > >>> "allowAccessOutsideWorkspace": true
> > >>> }
> > >>> },
> > >>>
> > >>> On Thu, Mar 1, 2018 at 1:15 PM, Padma Penumarthy <
> > [email protected]
> > >>> <mailto:[email protected]>>
> > >>> wrote:
> > >>>
> > >>> Make "writable": true for the workspace (dfs.root) in the storage
> > plugin
> > >>> configuration.
> > >>>
> > >>> Thanks
> > >>> Padma
> > >>>
> > >>>
> > >>> On Mar 1, 2018, at 10:10 AM, Erol Akarsu <[email protected]
> > <mailto:eaka
> > >>> [email protected]><mailto:eaka
> > >>> [email protected]<mailto:[email protected]>>> wrote:
> > >>>
> > >>> Thanks Padma.
> > >>>
> > >>> I am getting problem while creating view
> > >>>
> > >>> 0: jdbc:drill:zk=local> create view mydonuts as SELECT * FROM
> > >>> cp.`employee.json` LIMIT 3;
> > >>> Error: VALIDATION ERROR: Root schema is immutable. Creating or
> > dropping
> > >>> tables/views is not allowed in root schema.Select a schema using
> > 'USE
> > >>> schema' command.
> > >>>
> > >>>
> > >>> [Error Id: 68a31047-5a4e-4768-8722-55648d9a80f6 on
> > DESKTOP-8OANV3A:31010]
> > >>> (state=,code=0)
> > >>> 0: jdbc:drill:zk=local>
> > >>>
> > >>> On Thu, Mar 1, 2018 at 12:49 PM, Padma Penumarthy <
> > [email protected]
> > >>> <mailto:[email protected]>
> > >>> <mailto:[email protected]>>
> > >>> wrote:
> > >>>
> > >>> Try creating a view and use describe.
> > >>>
> > >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
> > >>> apache.org_docs_describe_&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> > >>> XVqW14B9eGK9QR_fKKCb5H5LxKnqNMmq1U7RdNlhq1c&
> > m=blTmu-WQJa5RUrxqG46o20B-a-
> > >>> UP0yCHKGqcBo1ETlI&s=SxKqC4w_I5bbnNAtIvzO_9qptLP-
> KsHi0iIbjtzyrc8&e=
> > >>>
> > >>> Thanks
> > >>> Padma
> > >>>
> > >>>
> > >>> On Mar 1, 2018, at 9:22 AM, Erol Akarsu <[email protected]
> <mailto:
> > eaka
> > >>> [email protected]><mailto:eaka
> > >>> [email protected]<mailto:[email protected]>><mailto:eaka
> > >>> [email protected]<mailto:[email protected]><mailto:[email protected]>>>
> wrote:
> > >>>
> > >>> When Use limit 0 query, I am getting only field names. I am
> > looking for
> > >>> json schema for input that will describe input type
> > >>>
> > >>> 0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;
> > >>>
> > >>> +-----------+-------+-------+------------+-------------+
> > >>> | trans_id | date | time | user_info | trans_info |
> > >>> +-----------+-------+-------+------------+-------------+
> > >>> +-----------+-------+-------+------------+-------------+
> > >>> No
> > >>>
> > >>> On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu <[email protected]
> > <mailto:
> > >>> [email protected]><mailto:
> > >>> [email protected]<mailto:[email protected]>><mailto:
> > >>> [email protected]<mailto:[email protected]><mailto:eakarsu@
> > gmail.com>>>
> > >>> wrote:
> > >>>
> > >>> I am sorry Sorabh
> > >>> Can you give an example? I am still learning Drill
> > >>> Thanks
> > >>>
> > >>> On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia <
> > [email protected]
> > >>> <mailto:[email protected]>
> > >>> <mailto:[email protected]>
> > >>> <mailto:[email protected]>>
> > >>> wrote:
> > >>>
> > >>> Hi Erol,
> > >>>
> > >>> You can run limit 0 query from client to retrieve just the schema
> > for
> > >>> your input.
> > >>>
> > >>>
> > >>> Thanks,
> > >>> Sorabh
> > >>>
> > >>> ________________________________
> > >>> From: Erol Akarsu <[email protected]<mailto:[email protected]
> > ><mailto:
> > >>> [email protected]><mailto:
> > >>> [email protected]<mailto:[email protected]>>>
> > >>> Sent: Thursday, March 1, 2018 5:28:52 AM
> > >>> To: [email protected]<mailto:[email protected]><mailto:u
> > >>> [email protected]><mailto:u
> > >>> [email protected]<mailto:[email protected]>>
> > >>> Subject: Accessing underlying scheme of input
> > >>>
> > >>> I know Apache drill is creating a json schema for input data file
> > or hdfs
> > >>> input before user query on it.
> > >>> I like to know whether or not Apache drill has API that will help
> > user to
> > >>> obtain that derived schema for say an json file or excel file or
> > hive
> > >>> input.
> > >>> I appreciate your help
> > >>>
> > >>> Erol Akarsu
> > >>>
> > >>> Sent from Mail for Windows 10
> > >>>
> > >>> --
> > >>>
> > >>> Erol Akarsu
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Erol Akarsu
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Erol Akarsu
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Erol Akarsu
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >>
> > >> Erol Akarsu
> > >>
> > >>
> > >
> > >
> > > --
> > >
> > > Erol Akarsu
> >
> >
> >
> >
>
>
> --
>
> Erol Akarsu
>