Re: Recursive CTE Support in Drill

2015-07-20 Thread Alexander Zarei
Thanks for more elaboration Ted, Jacques and Jason!

@Ted that is a very cool idea. I tried the cross join but figured cross
join is not supported in drill yet but we have DRILL-786 for it. The new
method looks very promising. It seems it is an implicit cross join, isn't
it? I just tried it out and it worked like a charm. I will go on with this
method.

@Jaques, yes as Jason said, we discussed this before and I have talked to
my colleagues to help me with modifying the ODBC driver so it sends a plan.
Also thanks for the query. I tied it out for tow tables and it worked find
but extending it to three tables gives me a syntax error.

 select * from

((select column1, 1 as join_keyb from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1

  Join

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2

  on t1.join_key=t2.join_key) t12

Join

(SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 on t12.join_keyb=t3.join_key)



*The other syntax was easier for me to use the join three times so I could
test it with *


 select t1.column1 from

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2,

 (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3

 where

 t1.join_key=t2.join_key and t1.join_key=t3.join_key


Thank you very much for your time Ted, Jacques and Jason!

Thanks,
Alex

On Fri, Jul 17, 2015 at 2:09 PM, Jason Altekruse altekruseja...@gmail.com
wrote:

 Jacques,

 Alexander has brought up this problem previously in one of the hangouts and
 said that submitting a physical plan was not possible through ODBC. If he
 is able to modify the driver code to make it possible to submit one, that
 would be an option, as I believe the C++ client is capable of submitting
 plans. The issue I seem to recall him mentioning is that the ODBC driver
 was running a little sanity checking on they sql query to try to prevent
 submitting complete garbage queries to a server. I think he had concerns
 that a JSON formatted physical plan would fail these checks and he would
 have to disable them along with trying to allow submitting two types of
 queries from ODBC.

 On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com
 wrote:

  Removing cross posting
 
  Alexander,
 
  There is currently no way for Drill to generate a large amount of data
  using SQL.  However, you can generate large generic data by using the
  MockStoragePlugin if you submit a plan.  You can find an example plan
 using
  this at [1].
 
  I heard someone might be working on extending the MockStoragePlugin to
  support SQL which would provide the outcome you requested.
 
  [1]
 
 
 https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
 
  On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  
   Also, just doing a Cartesian join of three copies of 1000 records will
   give you a billion records with negligible I/o.
  
   Sent from my iPhone
  
On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com
 
   wrote:
   
@Alexander If you want to test the speed of the ODBC driver you can
 do
   that
without a new storage plugin.
   
If you get the entire dataset into memory, it will be returned from
   Drill a
quickly as we can possibly send it to the client. One way to do this
 is
   to
insert a sort; we cannot send along any of the data until the compete
   sort
is done. As long as you don't read so much data that we will start
   spilling
the sort to disk, all of the records will be in memory. To take the
  read
and sort time out of your test, just make sure to record the time you
   first
receive data from Drill, not the query start time.
   
There is one gotcha here. To make the BI tools more responsive, we
implemented a feature that will send along one empty batch of records
   with
the schema information populated. This schema is generated by
 applying
   all
of the transformations that happen throughout the query. For example,
  the
join operator handles this schema population by sending along the
  schema
merged from the two sides of the join, project will similarly add or
   remove
column based on the expressions and columns requested. You will want
 to
make sure you record your start time when you receive the first batch
   with
actual records. This can give you an accurate measurement of the ODBC
performance, removing the bottleneck of the disk.
   
On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei 
   alexanderz.si...@gmail.com
wrote:
   
Thanks for the answers.
   
@Ted my only goal is to pump a large amount of data without having
 to
   read
from Hard Disk. I am measuring the ODBC driver performance and I
 need
  a
higher data transfer rate. So any method that helps

Re: Recursive CTE Support in Drill

2015-07-16 Thread Alexander Zarei
Thanks for the answers.

@Ted my only goal is to pump a large amount of data without having to read
from Hard Disk. I am measuring the ODBC driver performance and I need a
higher data transfer rate. So any method that helps pumping data out of
Drill faster would help. The log-synth seems a good way to generate data
for testing. However, I'd need a ram only option which hopefully provides a
higher throughput.

@Jacques How involved is it to write a dummy plugin that returns one
hardcoded row repeatedly 12 million times?

Thanks,
Alex

On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 It may be easy, but it is completely opaque about what really needs to
 happen.

 For instance,

 1) how is schema exposed?

 2) which classes do I really need to implement?

 3) how do I express partitioning of a format?

 4) how do I test it?

 Just a bit of documentation and comments would go a very, very long way.

 Even answers on the mailing list that have more details than oh, that's
 easy.  I would be happy to transcribe answers into the code if I could
 just get some.



 On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org
 wrote:

  Creating an EasyFormatPlugin is pretty simple.  They were designed to get
  rid of much of the scaffolding required for a standard FormatPlugin.
 
  JSON
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
 
  Text
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
 
  AVRO
 
 
 https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
 
  In all cases, the connection code is pretty light.  A fully schematized
  format like log-synth should be even simpler to implement.
 
  On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   I don't think we need a full on storage plugin.  I think a data format
   should be sufficient, basically CSV on steroids.
  
  
  
  
  
   On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche 
   adene...@maprtech.com
wrote:
  
Yeah, we still lack documentation on how to write a storage plugin.
 One
advice I've been seeing a lot is to take a look at the mongo-db
 plugin,
   it
was basically added in one single commit:
   
   
   
  
 
 https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
   
I think this will give some general ideas on what to expect when
  writing
   a
storage plugin.
   
On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 Hakim,

 Not yet.  Still very much in the stage of gathering feedback.

 I would think it very simple.  The biggest obstacles are

 1) no documentation on how to write a data format

 2) I need to release a jar for log-synth to Maven Central.




 On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  @Ted, the log-synth storage format would be really useful. I'm
   already
  seeing many unit tests that could benefit from this. Do you have
 a
github
  repo for your ongoing work ?
 
  Thanks!
 
  On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning 
  ted.dunn...@gmail.com
  wrote:
 
   Are you hard set on using common table expressions?
  
   I have discussed a bit off-list creating a data format that
 would
allow
   tables to be read from a log-synth [1] schema.  That would let
  you
read
  as
   much data as you might like with an arbitrarily complex (or
  simple)
  query.
  
   Operationally, you would create a file containing a log-synth
   schema
 that
   has the extension .synth.  Your data source would have to be
configured
  to
   connect that extension with the log-synth format.  At that
 point,
   you
  could
   select as much or little data as you like from the file and you
   would
 see
   generated data rather than the schema.
  
  
  
   [1] https://github.com/tdunning/log-synth
  
   On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei 
   alexanderz.si...@gmail.com
wrote:
  
Hi All,
   
I am trying to come up with a query which returns a given
  number
   of
  rows
without having a real table on Storage.
   
I am hoping to achieve something like this:
   
   
   
  
 

   
  
 
 http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
   
DECLARE @start INT = 1;DECLARE @end INT = 100;
WITH numbers AS (
SELECT @start AS number
UNION ALL
SELECT number + 1
FROM  numbers
WHERE number  @end)SELECT *FROM numbersOPTION
  (MAXRECURSION
0);
   
I do

Re: To EMRFS or not to EMRFS?

2015-06-18 Thread Alexander Zarei
What we learned through our research/experiments for doing performance test
for Drill ODBC, you get the best throughput when Solid State Drive EC2
instances such as m3,xlarge are used to form the HDFS.


Re: Apache Drill and S3 performance

2015-06-07 Thread Alexander Zarei
It might be because of using S3 as your file system.

We have done a similar experiment loading data to HDFS on m1.xlarge
machines. A query profile analysis of the experiments showed reading from
magnetic storage on m1.xlarge machines was the bottleneck. Hence we
switched to m3.xlarge instances (which have SSD storage).

A suggestion would be to try loading your data to the HDFS or MapR-FS on
the cluster instead of S3 and it will probably boost the performance. As
you know m3.2xlarge machines have SSD storage which in general performs
better than S3.

Cheers,
Alex



On Sun, Jun 7, 2015 at 5:12 PM, Jacques Nadeau jacq...@apache.org wrote:

 Can you post a query profile json?  It might help us to determine where the
 time is being spent.

 How many files are being queries?

 On Sun, Jun 7, 2015 at 3:47 PM, Satish Cattamanchi scattaman...@4info.com
 
 wrote:

  We are evaluating Apache Drill performance, and we have setup  Apache
  Drill on Amazon.
 
  All EC2 machines are r3.2xLarge instance type.
 
  Model   vCPUMem (GiB)   SSD Storage (GB)
 
 
 
 
  r3.2xlarge  8   61  1 x 160
 
 
 
 
 
 
  Zookeeper - 1 EC2 machine
  Drillbits - 25 EC2 machines.
  Data on - Amazon  S3
  Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed.
  Storage Hierarchy  - /logs/requests/y=2015/m=01/d=01/hh=-01/
  Daily Data Size - 2TB approx.
  Daily Rows - 3.5B
 
  Using Apache Drill with Default Configuration.
 
  I was successfully able to configure Apache Drill and connect to S3 and
  query the data from S3.
 
  But when I do count(*) on the day folder, its taking around 45-50min with
  the above setup. Any other queries with WHERE condition also takes
 similar
  time. I was wondering whether the slowness is due to copying data back n
  forth from S3.
 
  Could anyone give some suggestions on setup/configuration to achieve
  better performance with Apache Drill?
 
  Thanks,
  Satish
 
 



Re: Stopping Drillbit and preventing it from restarting automatically on MapR sandbox

2015-04-30 Thread Alexander Zarei
Thank you very much everyone! The revised command worked as follows:

*maprcli node services -name drill-bits -action stop -nodes 'maprdemo'*

given maprdemo is my hostname.

Thanks,
Alex

On Wed, Apr 29, 2015 at 5:49 PM, Chun Chang cch...@maprtech.com wrote:

 Actually it should be:

 maprcli node services -name drill-bits -action stop -nodes `hostname -f`

 Sorry about that.

 On Tue, Apr 28, 2015 at 6:01 PM, Chun Chang cch...@maprtech.com wrote:

  Run the following cmd to stop the old drillbit:
 
  maprcli node services -name Drill -action stop -nodes `hostname -f`
 
  On Tue, Apr 28, 2015 at 5:27 PM, Alexander Zarei 
  alexanderz.si...@gmail.com wrote:
 
  Hi everyone,
 
  I am wondering if you could help me stop the drillbit in MapR sandbox as
  it
  keeps starting again by default.
 
  I build Drill from source from github and want to start it but the
 default
  (older) drill keeps running avoiding the start of the new drillbit.
 
  Thanks for your time and help!
 
  Thanks,
  Alex
 
 
 



Re: Documentation for Query Profile page in Web UI

2015-04-21 Thread Alexander Zarei
Sorry about the inconvenience. The Web UI output is printed in a PDF file
here:

https://drive.google.com/file/d/0B24zVBhi8pQ3aDRfRllFVUh2eEE/view?usp=sharing

Thanks,
Alex

On Tue, Apr 21, 2015 at 2:40 PM, Jason Altekruse altekruseja...@gmail.com
wrote:

 The attachment for the json profile made it to the list because it is
 ASCII, but the screenprint was blocked as a binary file. We can take a look
 at the profile by loading the json into an instance of Drill, but just a
 reminder about binary attachments for everyone, please upload to a public
 host an share a link.

 On Tue, Apr 21, 2015 at 2:34 PM, Alexander Zarei 
 alexanderz.si...@gmail.com
  wrote:

  Hi Team Drill!
 
 
 
  While performing performance testing on Drill clusters on AWS EMR, with
  TPC-H data of scale factor 100, I observed the results for a cluster of 3
  nodes are similar to a cluster of 13 nodes. Hence, I am investigating how
  the query is being carried out and which part of the query handling (e.g.
  planning, reading the data, executing the query, transferring the record
  batches) is the dominant time consuming part.
 
 
 
  Parth kindly suggested I should use the Query Profile from the Web UI and
  it helped a lot. However, there are some items on the Query Profile page
  that I did not find documentation to interpret them. I was wondering if
 you
  know what the following item are:
 
 
 
  *I)What are the meaning of operator types: Project,
  Unordered receiver, Single Sender? I guess Hive sub Scan is the time
 spent
  reading data from Hive, is that correct?*
 
  *II)  What are the units for the Processes columns in the
  Operator Profiles Overview table? Is it time in a minutes : seconds
 format?*
 
 
 
  Also it would be really nice to know:
 
  III)What metric does the blue chart on the top of the
  Overview section present?
 
  IV)   What is fragment, a minor fragment and major fragment?
 
  V) What are the first start and last start and first end
  and last end?
 
  VI)   What are the sets over which max, min and average are
  calculated?
 
  VII) Why the Peak memory is so small? 4MB while the machine
  has 16 GB of Ram
 
 
 
  The print of the Web UI as well as the json profile are attached.
 
  Thanks a lot for your time and help.
 
 
 
  Thanks,
 
  Alex
 



Documentation for Query Profile page in Web UI

2015-04-21 Thread Alexander Zarei
Hi Team Drill!



While performing performance testing on Drill clusters on AWS EMR, with
TPC-H data of scale factor 100, I observed the results for a cluster of 3
nodes are similar to a cluster of 13 nodes. Hence, I am investigating how
the query is being carried out and which part of the query handling (e.g.
planning, reading the data, executing the query, transferring the record
batches) is the dominant time consuming part.



Parth kindly suggested I should use the Query Profile from the Web UI and
it helped a lot. However, there are some items on the Query Profile page
that I did not find documentation to interpret them. I was wondering if you
know what the following item are:



*I)What are the meaning of operator types: Project,
Unordered receiver, Single Sender? I guess Hive sub Scan is the time spent
reading data from Hive, is that correct?*

*II)  What are the units for the Processes columns in the
Operator Profiles Overview table? Is it time in a minutes : seconds format?*



Also it would be really nice to know:

III)What metric does the blue chart on the top of the
Overview section present?

IV)   What is fragment, a minor fragment and major fragment?

V) What are the first start and last start and first end
and last end?

VI)   What are the sets over which max, min and average are
calculated?

VII) Why the Peak memory is so small? 4MB while the machine has
16 GB of Ram



The print of the Web UI as well as the json profile are attached.

Thanks a lot for your time and help.



Thanks,

Alex


query profile.json
Description: application/json


Re: Documentation for Query Profile page in Web UI

2015-04-21 Thread Alexander Zarei
And thanks very much Andries for the detailed information answering to my
questions. I really appreciated it.

And the tables are stored in HDFS on the EMR cluster, not on S3, and then
loaded into Hive as External tables.


Thanks,
Alex



On Tue, Apr 21, 2015 at 3:08 PM, Andries Engelbrecht 
aengelbre...@maprtech.com wrote:

 Alex,

 Definitely looks like the majority of time is by far spend on reading the
 Hive data (Hive_Sub_Scan). Not sure how well the storage environment is
 configured, and it may very likely be that the nodes are just waiting on
 storage IO. The more nodes will simply just wait longer to actually get the
 data through the same pipe. Is the data on S3 with an Hive external table
 or similar?

 Very short answers, probably will need much more detailed documentation to
 cover these in better detail than a mailing list
 i) Probably will need to get some documentation to clarify that better.
 ii) Typically seconds are x.xxx and minutes are x:xx
 iii) Graphical representation of fragments and time taken in seconds
 iv) Best way for me to describe is that Major Fragment is a query phase
 (operator), and minor fragments are the parallelization of Major fragments,
 that is why some Major Frags only have a single minor frag as it can’t or
 doesn’t make sense to parallelize.
 v) Since you have parallel execution for Major Frags it shows when the
 first execution thread started, when the last one started, and similar for
 ends. Basically just to give you an indication if one of more threads were
 significantly slower than others (important to understand for some
 environments to spot a lagger that holds the whole process up).
 vi) Same logic as v to see the avg execution of all threads and the
 quickest and slowest executors
 vii) The mem is to indicate how much was assigned to a thread to complete
 the operation. A simple scan operation doesn’t require that much memory vs
 joins and other functions than can require much more memory

 —Andries



 On Apr 21, 2015, at 2:34 PM, Alexander Zarei alexanderz.si...@gmail.com
 wrote:

 Hi Team Drill!


 While performing performance testing on Drill clusters on AWS EMR, with
 TPC-H data of scale factor 100, I observed the results for a cluster of 3
 nodes are similar to a cluster of 13 nodes. Hence, I am investigating how
 the query is being carried out and which part of the query handling (e.g.
 planning, reading the data, executing the query, transferring the record
 batches) is the dominant time consuming part.


 Parth kindly suggested I should use the Query Profile from the Web UI and
 it helped a lot. However, there are some items on the Query Profile page
 that I did not find documentation to interpret them. I was wondering if you
 know what the following item are:


 *I)What are the meaning of operator types: Project,
 Unordered receiver, Single Sender? I guess Hive sub Scan is the time spent
 reading data from Hive, is that correct?*

 *II)  What are the units for the Processes columns in the
 Operator Profiles Overview table? Is it time in a minutes : seconds format?*



 Also it would be really nice to know:

 III)What metric does the blue chart on the top of the
 Overview section present?

 IV)   What is fragment, a minor fragment and major fragment?

 V) What are the first start and last start and first end
 and last end?

 VI)   What are the sets over which max, min and average are
 calculated?

 VII) Why the Peak memory is so small? 4MB while the machine
 has 16 GB of Ram


 The print of the Web UI as well as the json profile are attached.

 Thanks a lot for your time and help.


 Thanks,

 Alex
 Apache Drill Query Profile.pdfquery profile.json