Re: Accessing underlying scheme of input

Aman Sinha Fri, 02 Mar 2018 12:03:06 -0800

Looks like the formatting got messed up for the query output.  Here it is
in simple text format:


// My original JSON data consisting of maps and arrays:

0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;

+-----+---------------+-----------------------------------+
| a1  |      b1       |                c1                 |
+-----+---------------+-----------------------------------+
| 10  | {"id":[1,2]}  | {"d1":5,"e1":{"f1":["CA","TX"]}}  |
+-----+---------------+-----------------------------------+



0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
from t2.json;

+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+


// Analyze using parquet-tools

[root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
schema tt3/0_0_0.parquet

message root {
  optional int64 a1;
  optional group b1 {
    repeated int64 id;
  }
  optional group c1 {
    optional int64 d1;
    optional group e1 {
      repeated binary f1 (UTF8);
    }
  }
}


On Fri, Mar 2, 2018 at 11:48 AM, Aman Sinha <[email protected]> wrote:

> Erol,
> yes indeed Drill is internally creating the schema for Json data.  The top
> level field's data type can be found using the TYPEOF(column) function that
> Gautam mentioned earlier.
> However, I understand you are looking for the nested schema as well, so I
> would recommend the following approach:
>
> - Run a CTAS and write a small sample of the table out in Parquet format.
> - Run parquet-tools [1] with the 'schema' option to explore the schema.
> - NOTE: Although this will show the Parquet schema, it at least will give
> an idea of the original JSON schema.
>
> Here's an example:
>
> *// My original JSON data consisting of maps and arrays: *
>
> 0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;
>
> *+-----+---------------+-----------------------------------+*
>
> *| **a1 ** | **     b1      ** | **               c1                ** |*
>
> *+-----+---------------+-----------------------------------+*
>
> *| *10 * | *{"id":[1,2]} * | *{"d1":5,"e1":{"f1":["CA","TX"]}} * |*
>
> *+-----+---------------+-----------------------------------+*
>
>
>
> 0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
> from t2.json;
>
> *+-----------+----------------------------+*
>
> *| **Fragment ** | **Number of records written ** |*
>
> *+-----------+----------------------------+*
>
> *| *0_0      * | *1                         * |*
>
> *+-----------+----------------------------+*
>
>
> *// Analyze using parquet-tools*
>
> [root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
> schema tt3/0_0_0.parquet
>
> message root {
>
>   optional int64 a1;
>
>   optional group b1 {
>
>     repeated int64 id;
>
>   }
>
>   optional group c1 {
>
>     optional int64 d1;
>
>     optional group e1 {
>
>       repeated binary f1 (UTF8);
>
>     }
>
>   }
>
> }
>
> The meaning of these types is defined here [2].
>
>
> [1] http://parquet.apache.org/downloads/
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
>
> On Thu, Mar 1, 2018 at 6:46 PM, Erol Akarsu <[email protected]> wrote:
>
>> Boaz,
>>
>> Thanks for full explanation.
>> I got impression that Apache Drill is internally creating json schema for
>> input type. I thought we would be able to look deep into that schema,
>> Maybe it is not in "describe" level but it is API level we can achieve it.
>> That do you think?
>>
>> Erol Akarsu
>>
>> On Thu, Mar 1, 2018 at 7:04 PM, Boaz Ben-Zvi <[email protected]> wrote:
>>
>> > From the docs (https://drill.apache.org/docs/describe/): “Currently,
>> > DESCRIBE does not support tables created in a file system.”
>> > Seems that it only works well for Hive and HBase tables.
>> >
>> > The create view statement does not explore the actual schema of the
>> > query’s table(s); it only parses and keeps the query.
>> > Hence it can not tell what ‘*’ (or any list of column names) means:
>> >
>> >  0: jdbc:drill:zk=local> create view pview as select * from
>> > dfs.`/data/PARQUET-1M.parquet`;
>> > +-------+-------------------------------------------------------+
>> > |  ok   |                        summary                        |
>> > +-------+-------------------------------------------------------+
>> > | true  | View 'pview' created successfully in 'dfs.tmp' schema  |
>> > +-------+-------------------------------------------------------+
>> > 1 row selected (0.274 seconds)
>> > 0: jdbc:drill:zk=local> describe pview;
>> > +--------------+---------------+--------------+
>> > | COLUMN_NAME  |   DATA_TYPE   | IS_NULLABLE  |
>> > +--------------+---------------+--------------+
>> > | **           | DYNAMIC_STAR  | YES          |
>> > +--------------+---------------+--------------+
>> > 1 row selected (0.419 seconds)
>> > 0: jdbc:drill:zk=local> create view pview2 as select
>> > row_count,just_garbage from  dfs.`/data/PARQUET-1M.parquet`;
>> > +-------+---------------------------------------------------------+
>> > |  ok   |                         summary                         |
>> > +-------+---------------------------------------------------------+
>> > | true  | View 'pview2' created successfully in 'dfs.tmp' schema  |
>> > +-------+---------------------------------------------------------+
>> > 1 row selected (0.3 seconds)
>> > 0: jdbc:drill:zk=local> describe pview2;
>> > +---------------+------------+--------------+
>> > |  COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
>> > +---------------+------------+--------------+
>> > | row_count     | ANY        | YES          |
>> > | just_garbage  | ANY        | YES          |
>> > +---------------+------------+--------------+
>> > 2 rows selected (0.533 seconds)
>> > 0: jdbc:drill:zk=local>
>> >
>> > Thanks,
>> >
>> >       Boaz
>> >
>> > On 3/1/18, 3:05 PM, "Padma Penumarthy" <[email protected]> wrote:
>> >
>> >     Not sure why it is not showing the fields. It does not work for me
>> > either.
>> >     Does anyone know more ? Is this broken ?
>> >
>> >     Thanks
>> >     Padma
>> >
>> >     > On Mar 1, 2018, at 2:54 PM, Erol Akarsu <[email protected]>
>> wrote:
>> >     >
>> >     > Somehow, after "user dfs.tmp", I was able to create view. But
>> > "describe"
>> >     > for view does not give much information. I was expecting
>> "describe"
>> > command
>> >     > would give type  definitions of fields " employee_id  |
>> > full_name     |
>> >     > first_name  | last_name  | position_id  |   position_title    |
>> > store_id  |
>> >     > depart ". But it does give a very generic field type.
>> >     >
>> >     >
>> >     > 0: jdbc:drill:zk=local> create view mydonuts2 as SELECT * FROM
>> >     > cp.`employee.json` LIMIT 3;
>> >     > +-------+---------------------------------------------------
>> > ---------+
>> >     > |  ok   |                          summary
>> >  |
>> >     > +-------+---------------------------------------------------
>> > ---------+
>> >     > | true  | View 'mydonuts2' created successfully in 'dfs.tmp'
>> schema
>> > |
>> >     > +-------+---------------------------------------------------
>> > ---------+
>> >     > 1 row selected (0.283 seconds)
>> >     > 0: jdbc:drill:zk=local> describe mydonuts2;
>> >     > +--------------+------------+--------------+
>> >     > | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
>> >     > +--------------+------------+--------------+
>> >     > | *            | ANY        | YES          |
>> >     > +--------------+------------+--------------+
>> >     > 1 row selected (0.388 seconds)
>> >     > 0: jdbc:drill:zk=local>  SELECT * FROM cp.`employee.json` LIMIT 3;
>> >     > +--------------+------------------+-------------+-----------
>> > -+--------------+---------------------+-----------+--------+
>> >     > | employee_id  |    full_name     | first_name  | last_name  |
>> > position_id
>> >     > |   position_title    | store_id  | depart |
>> >     > +--------------+------------------+-------------+-----------
>> > -+--------------+---------------------+-----------+--------+
>> >     > | 1            | Sheri Nowmer     | Sheri       | Nowmer     | 1
>> >     > | President           | 0         | 1      |
>> >     > | 2            | Derrick Whelply  | Derrick     | Whelply    | 2
>> >     > | VP Country Manager  | 0         | 1      |
>> >     > | 4            | Michael Spence   | Michael     | Spence     | 2
>> >     > | VP Country Manager  | 0         | 1      |
>> >     > +--------------+------------------+-------------+-----------
>> > -+--------------+---------------------+-----------+--------+
>> >     > 3 rows selected (0.579 seconds)
>> >     >
>> >     > On Thu, Mar 1, 2018 at 3:18 PM, Erol Akarsu <[email protected]>
>> > wrote:
>> >     >
>> >     >> Padma,
>> >     >>
>> >     >> I have not created any user. I just installed the system and run
>> > drill
>> >     >> with  "sqlline.bat -u "jdbc:drill:zk=local"
>> >     >> Therefore, what is shortest procedure to achieve what you have
>> > described
>> >     >> in previous email?
>> >     >>
>> >     >> Thanks
>> >     >>
>> >     >> Erol Akarsu
>> >     >>
>> >     >> On Thu, Mar 1, 2018 at 3:00 PM, Padma Penumarthy <
>> > [email protected]>
>> >     >> wrote:
>> >     >>
>> >     >>> Check if you have permissions to root directory or not.
>> >     >>> You may have to specify the complete directory path (for which
>> you
>> > have
>> >     >>> permissions for) in the create view command.
>> >     >>>
>> >     >>> For example:
>> >     >>>
>> >     >>> 0: jdbc:drill:zk=local> create view
>> dfs.root.`/Users/ppenumarthy/
>> > parquet/test-view`
>> >     >>> as select * from dfs.root.`/Users/ppenumarthy/
>> > parquet/0_0_0.parquet`;
>> >     >>> +-------+---------------------------------------------------
>> >     >>> -------------------------------------+
>> >     >>> |  ok   |                                        summary
>> >     >>>                       |
>> >     >>> +-------+---------------------------------------------------
>> >     >>> -------------------------------------+
>> >     >>> | true  | View '/Users/ppenumarthy/parquet/test-view' created
>> >     >>> successfully in 'dfs.root' schema  |
>> >     >>> +-------+---------------------------------------------------
>> >     >>> -------------------------------------+
>> >     >>> 1 row selected (0.148 seconds)
>> >     >>> 0: jdbc:drill:zk=local>
>> >     >>>
>> >     >>>
>> >     >>> Thanks
>> >     >>> Padma
>> >     >>>
>> >     >>> On Mar 1, 2018, at 11:37 AM, Erol Akarsu <[email protected]
>> > <mailto:eaka
>> >     >>> [email protected]>> wrote:
>> >     >>>
>> >     >>> Padma,
>> >     >>>
>> >     >>> I have changed dfs storage plugin through web interface as
>> below.
>> > But I am
>> >     >>> getting same error response.
>> >     >>>
>> >     >>> {
>> >     >>> "type": "file",
>> >     >>> "enabled": true,
>> >     >>> "connection": "file:///",
>> >     >>> "config": null,
>> >     >>> "workspaces": {
>> >     >>>   "root": {
>> >     >>>     "location": "/",
>> >     >>>     "writable": true,
>> >     >>>     "defaultInputFormat": null,
>> >     >>>     "allowAccessOutsideWorkspace": true
>> >     >>>   },
>> >     >>>   "tmp": {
>> >     >>>     "location": "/tmp",
>> >     >>>     "writable": true,
>> >     >>>     "defaultInputFormat": null,
>> >     >>>     "allowAccessOutsideWorkspace": true
>> >     >>>   }
>> >     >>> },
>> >     >>>
>> >     >>> On Thu, Mar 1, 2018 at 1:15 PM, Padma Penumarthy <
>> > [email protected]
>> >     >>> <mailto:[email protected]>>
>> >     >>> wrote:
>> >     >>>
>> >     >>> Make "writable": true for the workspace (dfs.root) in the
>> storage
>> > plugin
>> >     >>> configuration.
>> >     >>>
>> >     >>> Thanks
>> >     >>> Padma
>> >     >>>
>> >     >>>
>> >     >>> On Mar 1, 2018, at 10:10 AM, Erol Akarsu <[email protected]
>> > <mailto:eaka
>> >     >>> [email protected]><mailto:eaka
>> >     >>> [email protected]<mailto:[email protected]>>> wrote:
>> >     >>>
>> >     >>> Thanks Padma.
>> >     >>>
>> >     >>> I am getting problem while creating view
>> >     >>>
>> >     >>> 0: jdbc:drill:zk=local> create view mydonuts as SELECT * FROM
>> >     >>> cp.`employee.json` LIMIT 3;
>> >     >>> Error: VALIDATION ERROR: Root schema is immutable. Creating or
>> > dropping
>> >     >>> tables/views is not allowed in root schema.Select a schema using
>> > 'USE
>> >     >>> schema' command.
>> >     >>>
>> >     >>>
>> >     >>> [Error Id: 68a31047-5a4e-4768-8722-55648d9a80f6 on
>> > DESKTOP-8OANV3A:31010]
>> >     >>> (state=,code=0)
>> >     >>> 0: jdbc:drill:zk=local>
>> >     >>>
>> >     >>> On Thu, Mar 1, 2018 at 12:49 PM, Padma Penumarthy <
>> > [email protected]
>> >     >>> <mailto:[email protected]>
>> >     >>> <mailto:[email protected]>>
>> >     >>> wrote:
>> >     >>>
>> >     >>> Try creating a view and use describe.
>> >     >>>
>> >     >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
>> >     >>> apache.org_docs_describe_&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
>> >     >>> XVqW14B9eGK9QR_fKKCb5H5LxKnqNMmq1U7RdNlhq1c&
>> > m=blTmu-WQJa5RUrxqG46o20B-a-
>> >     >>> UP0yCHKGqcBo1ETlI&s=SxKqC4w_I5bbnNAtIvzO_9qptLP-KsHi0iIbjtzy
>> rc8&e=
>> >     >>>
>> >     >>> Thanks
>> >     >>> Padma
>> >     >>>
>> >     >>>
>> >     >>> On Mar 1, 2018, at 9:22 AM, Erol Akarsu <[email protected]
>> <mailto:
>> > eaka
>> >     >>> [email protected]><mailto:eaka
>> >     >>> [email protected]<mailto:[email protected]>><mailto:eaka
>> >     >>> [email protected]<mailto:[email protected]><mailto:[email protected]>>>
>> wrote:
>> >     >>>
>> >     >>> When Use limit 0 query,  I am getting only field names. I am
>> > looking for
>> >     >>> json schema for input that will describe input type
>> >     >>>
>> >     >>> 0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;
>> >     >>>
>> >     >>> +-----------+-------+-------+------------+-------------+
>> >     >>> | trans_id  | date  | time  | user_info  | trans_info  |
>> >     >>> +-----------+-------+-------+------------+-------------+
>> >     >>> +-----------+-------+-------+------------+-------------+
>> >     >>> No
>> >     >>>
>> >     >>> On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu <[email protected]
>> > <mailto:
>> >     >>> [email protected]><mailto:
>> >     >>> [email protected]<mailto:[email protected]>><mailto:
>> >     >>> [email protected]<mailto:[email protected]><mailto:eakarsu@
>> > gmail.com>>>
>> >     >>> wrote:
>> >     >>>
>> >     >>> I am sorry Sorabh
>> >     >>> Can you give an example? I am still learning Drill
>> >     >>> Thanks
>> >     >>>
>> >     >>> On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia <
>> > [email protected]
>> >     >>> <mailto:[email protected]>
>> >     >>> <mailto:[email protected]>
>> >     >>> <mailto:[email protected]>>
>> >     >>> wrote:
>> >     >>>
>> >     >>> Hi Erol,
>> >     >>>
>> >     >>> You can run limit 0 query from client to retrieve just the
>> schema
>> > for
>> >     >>> your input.
>> >     >>>
>> >     >>>
>> >     >>> Thanks,
>> >     >>> Sorabh
>> >     >>>
>> >     >>> ________________________________
>> >     >>> From: Erol Akarsu <[email protected]<mailto:[email protected]
>> > ><mailto:
>> >     >>> [email protected]><mailto:
>> >     >>> [email protected]<mailto:[email protected]>>>
>> >     >>> Sent: Thursday, March 1, 2018 5:28:52 AM
>> >     >>> To: [email protected]<mailto:[email protected]><mailto:
>> u
>> >     >>> [email protected]><mailto:u
>> >     >>> [email protected]<mailto:[email protected]>>
>> >     >>> Subject: Accessing underlying scheme of input
>> >     >>>
>> >     >>> I know Apache drill is creating a json schema for input data
>> file
>> > or hdfs
>> >     >>> input before user query on it.
>> >     >>> I like to know whether or not Apache drill has API that will
>> help
>> > user to
>> >     >>> obtain that  derived schema for say an json file or excel file
>> or
>> > hive
>> >     >>> input.
>> >     >>> I appreciate your help
>> >     >>>
>> >     >>> Erol Akarsu
>> >     >>>
>> >     >>> Sent from Mail for Windows 10
>> >     >>>
>> >     >>> --
>> >     >>>
>> >     >>> Erol Akarsu
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>> --
>> >     >>>
>> >     >>> Erol Akarsu
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>> --
>> >     >>>
>> >     >>> Erol Akarsu
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>>
>> >     >>> --
>> >     >>>
>> >     >>> Erol Akarsu
>> >     >>>
>> >     >>>
>> >     >>
>> >     >>
>> >     >> --
>> >     >>
>> >     >> Erol Akarsu
>> >     >>
>> >     >>
>> >     >
>> >     >
>> >     > --
>> >     >
>> >     > Erol Akarsu
>> >
>> >
>> >
>> >
>>
>>
>> --
>>
>> Erol Akarsu
>>
>
>

Re: Accessing underlying scheme of input

Reply via email to