How many columns do you have?
Do you understand about columnar data stores and how selecting only a
single column means that much less data needs to be read? If your data
consists, say, of integers, then Drill only needs to read 160MB to satisfy
your query which is quite reasonable to be read in
We are currently running(testing) with Veritas CFS (attached to EMC SAN
storage) which is visible across 6 servers. We also have a single test MapR
node, but that's a small sandbox. The production implementation will be with a
10 node HDFS cluster
The data files are 20 GB to 40 GB in size.
I see that Drill 1.1.0 declares support for Hive 1.0, which is not yet
provided by Amazon EMR. Any chance Hive 0.13 will still work? Can you
characterize when 0.13 would or would not work?
In general I think users will want to upgrade Drill much more frequently
than they are able to upgrade Hive.
The feature was added late in the release cycle, and it wasn't tested as
thoroughly as the default option. I think it should be perfectly ok to use;
just be aware that it may lead to decreased performance when running CTAS
operations.
On the other hand, this could drastically reduce the number of
No. A very simple model like that breaks down on many levels. The most
important level that reality intrudes in is the fact that your I/O probably
can't really be threaded so widely.
What kind of storage are you using? How big is your data?
Sent from my iPhone
On Jul 7, 2015, at 6:38,
You might also want to check out the new partitioned Parquet creation that
was launched with 1.1.0: https://drill.apache.org/docs/partition-by-clause/
This would increase your read speed if your queries tend to use predicates.
Chris Matta
cma...@mapr.com
215-701-3146
On Tue, Jul 7, 2015 at