Re: Monitoring long / stuck CTAS

Matt Thu, 28 May 2015 08:46:07 -0700

Is 300MM records too much to do in a single CTAS statement?


After almost 23 hours I killed the query (^c) and it returned:

~~~
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_20      | 13568824                   |
| 1_15      | 12411822                   |
| 1_7       | 12470329                   |
| 1_12      | 13693867                   |
| 1_5       | 13292136                   |
| 1_18      | 13874321                   |
| 1_16      | 13303094                   |
| 1_9       | 13639049                   |
| 1_10      | 13698380                   |
| 1_22      | 13501073                   |
| 1_8       | 13533736                   |
| 1_2       | 13549402                   |
| 1_21      | 13665183                   |
| 1_0       | 13544745                   |
| 1_4       | 13532957                   |
| 1_19      | 12767473                   |
| 1_17      | 13670687                   |
| 1_13      | 13469515                   |
| 1_23      | 12517632                   |
| 1_6       | 13634338                   |
| 1_14      | 13611322                   |
| 1_3       | 13061900                   |
| 1_11      | 12760978                   |
+-----------+----------------------------+
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is 306,772,763 which is close to the320,843,454 in the source file:

~~~

0: jdbc:drill:zk=es05:2181> select count(*) FROMroot.`sample_201501.dat`;

+------------+
|   EXPR$0   |
+------------+
| 320843454  |
+------------+
1 row selected (384.665 seconds)
~~~

It represents one month of data, 4 key columns and 38 numeric measurecolumns, which could also be partitioned daily. The test here was tocreate monthly Parquet files to see how the min/max stats on Parquetchunks help with range select performance.

Instead of a small number of large monthly RDBMS tables, I am attemptingto determine how many Parquet files should be used with Drill / HDFS.




On 27 May 2015, at 15:17, Matt wrote:

Attempting to create a Parquet backed table with a CTAS from an 44GBtab delimited file in HDFS. The process seemed to be running, as CPUand IO was seen on all 4 nodes in this cluster, and .parquet filesbeing created in the expected path.
In however in the last two hours or so, all nodes show near zero CPUor IO, and the Last Modified date on the .parquet have not changed.Same time delay shown in the Last Progress column in the activefragment profile.
What approach can I take to determine what is happening (or not)?

Re: Monitoring long / stuck CTAS

Reply via email to