What Ted just talked about is also explained in this On Demand Training
https://www.mapr.com/services/mapr-academy/mapr-distribution-essentials-training-course-on-demand
(which is free)
On Fri, May 29, 2015 at 5:29 PM, Ted Dunning wrote:
> There are two methods to support HBase table API's.
There are two methods to support HBase table API's. The first is to simply
run HBase. That is just like, well, running HBase.
The more interesting alternative is to use a special client API that talks
a special table-oriented wire protocol to the file system which implements
a column-family / col
I have another test case that queries a table using a filter of a range
of dates and customer key, that SUMs 38 columns. The returned record set
encompasses all 42 columns in the table - not a good design for parquet
files or any RDBMS, but a modeling problem that is not yet fully in my
control
See below:
> On May 27, 2015, at 12:17 PM, Matt wrote:
>
> Attempting to create a Parquet backed table with a CTAS from an 44GB tab
> delimited file in HDFS. The process seemed to be running, as CPU and IO was
> seen on all 4 nodes in this cluster, and .parquet files being created in the
> ex
> 1) it isn't HDFS.
Is MapR-FS a replacement or stand-in for HDFS?
On 29 May 2015, at 5:55, Ted Dunning wrote:
> Apologies for the plug, but using MapR FS would help you a lot here. The
> trick is that you can run an NFS server on every node and mount that server
> as localhost.
>
> The benefi
Could you expand on the HBase table integration? How does that work?
On Fri, May 29, 2015 at 5:55 AM, Ted Dunning wrote:
>
> 4) you get the use of the HBase API without having to run HBase. Tables
> are integrated directly into MapR FS.
>
>
>
>
>
> On Thu, May 28, 2015 at 9:37 AM, Matt wrote:
Apologies for the plug, but using MapR FS would help you a lot here. The
trick is that you can run an NFS server on every node and mount that server
as localhost.
The benefits are:
1) the entire cluster appears as a conventional POSIX style file system in
addition to being available via HDFS API
Bumping memory to:
DRILL_MAX_DIRECT_MEMORY="16G"
DRILL_HEAP="8G"
The 44GB file imported successfully in 25 minutes - acceptable on this
hardware.
I don't know if the default memory setting was to blame or not.
On 28 May 2015, at 14:22, Andries Engelbrecht wrote:
That is the Drill direct m
That is a good point. The difference between the number of source rows,
and those that made it into the parquet files is about the same count as
the other fragments.
Indeed the query profile does show fragment 1_1 as CANCELED while the
others all have State FINISHED. Additionally the other fra
I think the problem might be related to a single laggard, looks like we
are waiting for one minor fragment to complete. Based on the output you
provided looks like the fragment 1_1 hasn't completed. You might want to
find out where the fragment was scheduled and what is going on in that
node. I
That is the Drill direct memory per node.
DRILL_HEAP is for the heap size per node.
More info here
http://drill.apache.org/docs/configuring-drill-memory/
—Andries
On May 28, 2015, at 11:09 AM, Matt wrote:
> Referencing http://drill.apache.org/docs/configuring-drill-memory/
>
> Is DRILL_MAX_
Referencing http://drill.apache.org/docs/configuring-drill-memory/
Is DRILL_MAX_DIRECT_MEMORY the limit for each node, or the cluster?
The root page on a drillbit at port 8047 list for nodes, with the 16G
Maximum Direct Memory equal to DRILL_MAX_DIRECT_MEMORY, thus uncertain
if that is a node
Did you check the log files for any errors?
No messages related to this query containing errors or warning, nor
nothing mentioning memory or heap. Querying now to determine what is
missing in the parquet destination.
drillbit.out on the master shows no error messages, and what looks like
th
It should execute multi threaded, need to check on text file.
Did you check the log files for any errors?
On May 28, 2015, at 10:36 AM, Matt wrote:
>> The time seems pretty long for that file size. What type of file is it?
>
> Tab delimited UTF-8 text.
>
> I left the query to run overnight t
CPU and IO went to near zero on the master and all nodes after about 1
hour. I am do not know if the bulk of rows were written within that hour
or after.
Is there any way you can read the table and try to validate if all of
the data was written?
A simple join will show me where it stopped, a
The time seems pretty long for that file size. What type of file is
it?
Tab delimited UTF-8 text.
I left the query to run overnight to see if it would complete, but 24
hours for an import like this would indeed be too long.
Is the CTAS running single threaded?
In the first hour, with this
He mentioned in his original post that he saw CPU and IO on all of the
nodes for a while when the query was active, but it suddenly dropped down
to low CPU usage and stopped producing files. It seems like we are failing
to detect an error an cancel the query.
It is possible that the failure happen
The time seems pretty long for that file size. What type of file is it?
Is the CTAS running single threaded?
—Andries
On May 28, 2015, at 9:37 AM, Matt wrote:
>> How large is the data set you are working with, and your cluster/nodes?
>
> Just testing with that single 44GB source file current
How large is the data set you are working with, and your
cluster/nodes?
Just testing with that single 44GB source file currently, and my test
cluster is made from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB
Ext4 volume (RAID-10).
Drill defaults left as come in v1.0. I will be adjusting m
Just check the drillbit.log and drillbit.out files in the log directory.
Before adjusting memory, see if that is an issue first. It was for me, but as
Jason mentioned there can be other causes as well.
You adjust memory allocation in the drill-env.sh files, and have to restart the
drill bits.
H
That is correct. I guess it could be possible that HDFS might run out of
heap, but I'm guessing that is unlikely the cause of the failure you are
seeing. We should not be taxing zookeeper enough to be causing any issues
there.
On Thu, May 28, 2015 at 9:17 AM, Matt wrote:
> To make sure I am adju
I did not note any memory errors or warnings in a quick scan of the logs, but
to double check, is there a specific log I would find such warnings in?
> On May 28, 2015, at 12:01 PM, Andries Engelbrecht
> wrote:
>
> I have used a single CTAS to create tables using parquet with 1.5B rows.
>
>
To make sure I am adjusting the correct config, these are heap parameters
within the Drill configure path, not for Hadoop or Zookeeper?
> On May 28, 2015, at 12:08 PM, Jason Altekruse
> wrote:
>
> There should be no upper limit on the size of the tables you can create
> with Drill. Be advised
There should be no upper limit on the size of the tables you can create
with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network connection
between two drillbits fails during a query, we will not currently
re-schedule the work
I have used a single CTAS to create tables using parquet with 1.5B rows.
It did consume a lot of heap memory on the Drillbits and I had to increase the
heap size. Check your logs to see if you are running out of heap memory.
I used 128MB parquet block size.
This was with Drill 0.9 , so I’m sure
Is 300MM records too much to do in a single CTAS statement?
After almost 23 hours I killed the query (^c) and it returned:
~~~
+---++
| Fragment | Number of records written |
+---++
| 1_20 | 13568824
Attempting to create a Parquet backed table with a CTAS from an 44GB tab
delimited file in HDFS. The process seemed to be running, as CPU and IO
was seen on all 4 nodes in this cluster, and .parquet files being
created in the expected path.
In however in the last two hours or so, all nodes sho
27 matches
Mail list logo