Re: Monitoring long / stuck CTAS

Jason Altekruse Thu, 28 May 2015 09:24:53 -0700

That is correct. I guess it could be possible that HDFS might run out of
heap, but I'm guessing that is unlikely the cause of the failure you are
seeing. We should not be taxing zookeeper enough to be causing any issues
there.


On Thu, May 28, 2015 at 9:17 AM, Matt <[email protected]> wrote:

> To make sure I am adjusting the correct config, these are heap parameters
> within the Drill configure path, not for Hadoop or Zookeeper?
>
>
> > On May 28, 2015, at 12:08 PM, Jason Altekruse <[email protected]>
> wrote:
> >
> > There should be no upper limit on the size of the tables you can create
> > with Drill. Be advised that Drill does currently operate entirely
> > optimistically in regards to available resources. If a network connection
> > between two drillbits fails during a query, we will not currently
> > re-schedule the work to make use of remaining nodes and network
> connections
> > that are still live. While we have had a good amount of success using
> Drill
> > for data conversion, be aware that these conditions could cause long
> > running queries to fail.
> >
> > That being said, it isn't the only possible cause for such a failure. In
> > the case of a network failure we would expect to see a message returned
> to
> > you that part of the query was unsuccessful and that it had been
> cancelled.
> > Andries has a good suggestion in regards to checking the heap memory,
> this
> > should also be detected and reported back to you at the CLI, but we may
> be
> > failing to propagate the error back to the head node for the query. I
> > believe writing parquet may still be the most heap-intensive operation in
> > Drill, despite our efforts to refactor the write path to use direct
> memory
> > instead of on-heap for large buffers needed in the process of creating
> > parquet files.
> >
> >> On Thu, May 28, 2015 at 8:43 AM, Matt <[email protected]> wrote:
> >>
> >> Is 300MM records too much to do in a single CTAS statement?
> >>
> >> After almost 23 hours I killed the query (^c) and it returned:
> >>
> >> ~~~
> >> +-----------+----------------------------+
> >> | Fragment  | Number of records written  |
> >> +-----------+----------------------------+
> >> | 1_20      | 13568824                   |
> >> | 1_15      | 12411822                   |
> >> | 1_7       | 12470329                   |
> >> | 1_12      | 13693867                   |
> >> | 1_5       | 13292136                   |
> >> | 1_18      | 13874321                   |
> >> | 1_16      | 13303094                   |
> >> | 1_9       | 13639049                   |
> >> | 1_10      | 13698380                   |
> >> | 1_22      | 13501073                   |
> >> | 1_8       | 13533736                   |
> >> | 1_2       | 13549402                   |
> >> | 1_21      | 13665183                   |
> >> | 1_0       | 13544745                   |
> >> | 1_4       | 13532957                   |
> >> | 1_19      | 12767473                   |
> >> | 1_17      | 13670687                   |
> >> | 1_13      | 13469515                   |
> >> | 1_23      | 12517632                   |
> >> | 1_6       | 13634338                   |
> >> | 1_14      | 13611322                   |
> >> | 1_3       | 13061900                   |
> >> | 1_11      | 12760978                   |
> >> +-----------+----------------------------+
> >> 23 rows selected (82294.854 seconds)
> >> ~~~
> >>
> >> The sum of those record counts is  306,772,763 which is close to the
> >> 320,843,454 in the source file:
> >>
> >> ~~~
> >> 0: jdbc:drill:zk=es05:2181> select count(*)  FROM
> root.`sample_201501.dat`;
> >> +------------+
> >> |   EXPR$0   |
> >> +------------+
> >> | 320843454  |
> >> +------------+
> >> 1 row selected (384.665 seconds)
> >> ~~~
> >>
> >>
> >> It represents one month of data, 4 key columns and 38 numeric measure
> >> columns, which could also be partitioned daily. The test here was to
> create
> >> monthly Parquet files to see how the min/max stats on Parquet chunks
> help
> >> with range select performance.
> >>
> >> Instead of a small number of large monthly RDBMS tables, I am attempting
> >> to determine how many Parquet files should be used with Drill / HDFS.
> >>
> >>
> >>
> >>
> >> On 27 May 2015, at 15:17, Matt wrote:
> >>
> >> Attempting to create a Parquet backed table with a CTAS from an 44GB tab
> >>> delimited file in HDFS. The process seemed to be running, as CPU and
> IO was
> >>> seen on all 4 nodes in this cluster, and .parquet files being created
> in
> >>> the expected path.
> >>>
> >>> In however in the last two hours or so, all nodes show near zero CPU or
> >>> IO, and the Last Modified date on the .parquet have not changed. Same
> time
> >>> delay shown in the Last Progress column in the active fragment profile.
> >>>
> >>> What approach can I take to determine what is happening (or not)?
> >>
>

Re: Monitoring long / stuck CTAS

Reply via email to