It should execute multi threaded, need to check on text file. Did you check the log files for any errors?
On May 28, 2015, at 10:36 AM, Matt <[email protected]> wrote: >> The time seems pretty long for that file size. What type of file is it? > > Tab delimited UTF-8 text. > > I left the query to run overnight to see if it would complete, but 24 hours > for an import like this would indeed be too long. > >> Is the CTAS running single threaded? > > In the first hour, with this being the only client connected to the cluster, > I observed activity on all 4 nodes. > > Is multi-threaded query execution the default? I would not have changed > anything deliberately to force single thread execution. > > > On 28 May 2015, at 13:06, Andries Engelbrecht wrote: > >> The time seems pretty long for that file size. What type of file is it? >> >> Is the CTAS running single threaded? >> >> —Andries >> >> >> On May 28, 2015, at 9:37 AM, Matt <[email protected]> wrote: >> >>>> How large is the data set you are working with, and your cluster/nodes? >>> >>> Just testing with that single 44GB source file currently, and my test >>> cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB Ext4 >>> volume (RAID-10). >>> >>> Drill defaults left as come in v1.0. I will be adjusting memory and >>> retrying the CTAS. >>> >>> I know I can / should assign individual disks to HDFS, but as a test >>> cluster there are apps that expect data volumes to work on. A dedicated >>> Hadoop production cluster would have a disk layout specific to the task. >>> >>> >>> On 28 May 2015, at 12:26, Andries Engelbrecht wrote: >>> >>>> Just check the drillbit.log and drillbit.out files in the log directory. >>>> Before adjusting memory, see if that is an issue first. It was for me, but >>>> as Jason mentioned there can be other causes as well. >>>> >>>> You adjust memory allocation in the drill-env.sh files, and have to >>>> restart the drill bits. >>>> >>>> How large is the data set you are working with, and your cluster/nodes? >>>> >>>> —Andries >>>> >>>> >>>> On May 28, 2015, at 9:17 AM, Matt <[email protected]> wrote: >>>> >>>>> To make sure I am adjusting the correct config, these are heap parameters >>>>> within the Drill configure path, not for Hadoop or Zookeeper? >>>>> >>>>> >>>>>> On May 28, 2015, at 12:08 PM, Jason Altekruse <[email protected]> >>>>>> wrote: >>>>>> >>>>>> There should be no upper limit on the size of the tables you can create >>>>>> with Drill. Be advised that Drill does currently operate entirely >>>>>> optimistically in regards to available resources. If a network connection >>>>>> between two drillbits fails during a query, we will not currently >>>>>> re-schedule the work to make use of remaining nodes and network >>>>>> connections >>>>>> that are still live. While we have had a good amount of success using >>>>>> Drill >>>>>> for data conversion, be aware that these conditions could cause long >>>>>> running queries to fail. >>>>>> >>>>>> That being said, it isn't the only possible cause for such a failure. In >>>>>> the case of a network failure we would expect to see a message returned >>>>>> to >>>>>> you that part of the query was unsuccessful and that it had been >>>>>> cancelled. >>>>>> Andries has a good suggestion in regards to checking the heap memory, >>>>>> this >>>>>> should also be detected and reported back to you at the CLI, but we may >>>>>> be >>>>>> failing to propagate the error back to the head node for the query. I >>>>>> believe writing parquet may still be the most heap-intensive operation in >>>>>> Drill, despite our efforts to refactor the write path to use direct >>>>>> memory >>>>>> instead of on-heap for large buffers needed in the process of creating >>>>>> parquet files. >>>>>> >>>>>>> On Thu, May 28, 2015 at 8:43 AM, Matt <[email protected]> wrote: >>>>>>> >>>>>>> Is 300MM records too much to do in a single CTAS statement? >>>>>>> >>>>>>> After almost 23 hours I killed the query (^c) and it returned: >>>>>>> >>>>>>> ~~~ >>>>>>> +-----------+----------------------------+ >>>>>>> | Fragment | Number of records written | >>>>>>> +-----------+----------------------------+ >>>>>>> | 1_20 | 13568824 | >>>>>>> | 1_15 | 12411822 | >>>>>>> | 1_7 | 12470329 | >>>>>>> | 1_12 | 13693867 | >>>>>>> | 1_5 | 13292136 | >>>>>>> | 1_18 | 13874321 | >>>>>>> | 1_16 | 13303094 | >>>>>>> | 1_9 | 13639049 | >>>>>>> | 1_10 | 13698380 | >>>>>>> | 1_22 | 13501073 | >>>>>>> | 1_8 | 13533736 | >>>>>>> | 1_2 | 13549402 | >>>>>>> | 1_21 | 13665183 | >>>>>>> | 1_0 | 13544745 | >>>>>>> | 1_4 | 13532957 | >>>>>>> | 1_19 | 12767473 | >>>>>>> | 1_17 | 13670687 | >>>>>>> | 1_13 | 13469515 | >>>>>>> | 1_23 | 12517632 | >>>>>>> | 1_6 | 13634338 | >>>>>>> | 1_14 | 13611322 | >>>>>>> | 1_3 | 13061900 | >>>>>>> | 1_11 | 12760978 | >>>>>>> +-----------+----------------------------+ >>>>>>> 23 rows selected (82294.854 seconds) >>>>>>> ~~~ >>>>>>> >>>>>>> The sum of those record counts is 306,772,763 which is close to the >>>>>>> 320,843,454 in the source file: >>>>>>> >>>>>>> ~~~ >>>>>>> 0: jdbc:drill:zk=es05:2181> select count(*) FROM >>>>>>> root.`sample_201501.dat`; >>>>>>> +------------+ >>>>>>> | EXPR$0 | >>>>>>> +------------+ >>>>>>> | 320843454 | >>>>>>> +------------+ >>>>>>> 1 row selected (384.665 seconds) >>>>>>> ~~~ >>>>>>> >>>>>>> >>>>>>> It represents one month of data, 4 key columns and 38 numeric measure >>>>>>> columns, which could also be partitioned daily. The test here was to >>>>>>> create >>>>>>> monthly Parquet files to see how the min/max stats on Parquet chunks >>>>>>> help >>>>>>> with range select performance. >>>>>>> >>>>>>> Instead of a small number of large monthly RDBMS tables, I am attempting >>>>>>> to determine how many Parquet files should be used with Drill / HDFS. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 27 May 2015, at 15:17, Matt wrote: >>>>>>> >>>>>>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab >>>>>>>> delimited file in HDFS. The process seemed to be running, as CPU and >>>>>>>> IO was >>>>>>>> seen on all 4 nodes in this cluster, and .parquet files being created >>>>>>>> in >>>>>>>> the expected path. >>>>>>>> >>>>>>>> In however in the last two hours or so, all nodes show near zero CPU or >>>>>>>> IO, and the Last Modified date on the .parquet have not changed. Same >>>>>>>> time >>>>>>>> delay shown in the Last Progress column in the active fragment profile. >>>>>>>> >>>>>>>> What approach can I take to determine what is happening (or not)? >>>>>>>
