Re: Recursive CTE Support in Drill
Ted Dunning wrote: The cartesian join approach will produce an enormous stream of data with only a very small amount of disk read. You don't even need an external seed file (or to query a built-in table), and using WITH can extend multiplication to exponentiation: WITH q(key) AS ( WITH q(key) AS ( VALUES 1, 1 ) SELECT q1.key FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key INNER JOIN q q4 ON q3.key = q4.key ) SELECT COUNT(*) FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key INNER JOIN q q4 ON q3.key = q4.key ; That generates 65536 ((2^4) ^4) rows (before the COUNT(*)). Augmenting the table value constructor (VALUES ...) to list a third row (a third 1) increases the resulting row count to 43,046,721 ((3^4) ^4). Going to four rows bumps that to about 4 billion ((4^4) ^4): 0: jdbc:drill:drillbit=localhost WITH q(key) AS ( WITH q(key) AS (VALUES 1, 1, 1, 1) SELECT q1.key FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key INNER JOIN q q4 ON q3.key = q4.key ) SELECT COUNT(*) FROM q q1 INNER JOIN q q2 ON q1.key = q2.key INNER JOIN q q3 ON q2.key = q3.key INNER JOIN q q4 ON q3.key = q4.key ; +-+ | EXPR$0| +-+ | 4294967296 | +-+ 1 row selected (1197.674 seconds) ! Daniel Ted Dunning wrote: Jacques, Alexander, The cartesian join approach will produce an enormous stream of data with only a very small amount of disk read. It works now. If you start with small file, you can increase the amount of output in small increments by simply adding another join. You do have to have some kind of equality join involved since drill will (anomalously in my view) refuse to plan the query otherwise. Here is an example. Note that I have deleted many lines of output and indicated that with ellipsis (...). The three queries here produce 4, 16 and 64 lines of output from a 4 line input file. That could have been n, n^2 and n^3 just as easily for any value of n that you might like. ted:apache-drill-1.0.0$ cat x.csv ! a,1 a,2 a,3 a,4 ! ted:apache-drill-1.0.0$ bin/sqlline -u jdbc:drill:zk=local -n admin -p admin 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 2 | | a | 3 | | a | 4 | +-+-+ 4 rows selected (0.097 seconds) 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] = y.columns[0];* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 1 | | a | 1 | | a | 1 | | a | 2 | ... | a | 4 | | a | 4 | +-+-+ 16 rows selected (0.614 seconds) 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] = y.columns[0] and x.columns[0] = z.columns[0];* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 1 | | a | 1 | | a | 1 | ... | a | 4 | | a | 4 | | a | 4 | | a | 4 | +-+-+ 64 rows selected (0.63 seconds) 0: jdbc:drill:zk=local On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com wrote: Removing cross posting Alexander, There is currently no way for Drill to generate a large amount of data using SQL. However, you can generate large generic data by using the MockStoragePlugin if you submit a plan. You can find an example plan using this at [1]. I heard someone might be working on extending the MockStoragePlugin to support SQL which would provide the outcome you requested. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: Also, just doing a Cartesian join of three copies of 1000 records will give you a billion records with negligible I/o. Sent from my iPhone On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com wrote: @Alexander If you want to test the speed of the ODBC driver you can do that without a new storage plugin. If you get the entire dataset into memory, it will be returned from Drill a quickly as we can possibly send it to the client. One way to do this is to insert a sort; we cannot send along any of the data until the compete sort is done. As long as you don't read so much data that we will start spilling the sort to disk, all of the records will
Re: Recursive CTE Support in Drill
Thanks for more elaboration Ted, Jacques and Jason! @Ted that is a very cool idea. I tried the cross join but figured cross join is not supported in drill yet but we have DRILL-786 for it. The new method looks very promising. It seems it is an implicit cross join, isn't it? I just tried it out and it worked like a charm. I will go on with this method. @Jaques, yes as Jason said, we discussed this before and I have talked to my colleagues to help me with modifying the ODBC driver so it sends a plan. Also thanks for the query. I tied it out for tow tables and it worked find but extending it to three tables gives me a syntax error. select * from ((select column1, 1 as join_keyb from (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1 Join (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2 on t1.join_key=t2.join_key) t12 Join (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3 on t12.join_keyb=t3.join_key) *The other syntax was easier for me to use the join three times so I could test it with * select t1.column1 from (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1, (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2, (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3 where t1.join_key=t2.join_key and t1.join_key=t3.join_key Thank you very much for your time Ted, Jacques and Jason! Thanks, Alex On Fri, Jul 17, 2015 at 2:09 PM, Jason Altekruse altekruseja...@gmail.com wrote: Jacques, Alexander has brought up this problem previously in one of the hangouts and said that submitting a physical plan was not possible through ODBC. If he is able to modify the driver code to make it possible to submit one, that would be an option, as I believe the C++ client is capable of submitting plans. The issue I seem to recall him mentioning is that the ODBC driver was running a little sanity checking on they sql query to try to prevent submitting complete garbage queries to a server. I think he had concerns that a JSON formatted physical plan would fail these checks and he would have to disable them along with trying to allow submitting two types of queries from ODBC. On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com wrote: Removing cross posting Alexander, There is currently no way for Drill to generate a large amount of data using SQL. However, you can generate large generic data by using the MockStoragePlugin if you submit a plan. You can find an example plan using this at [1]. I heard someone might be working on extending the MockStoragePlugin to support SQL which would provide the outcome you requested. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: Also, just doing a Cartesian join of three copies of 1000 records will give you a billion records with negligible I/o. Sent from my iPhone On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com wrote: @Alexander If you want to test the speed of the ODBC driver you can do that without a new storage plugin. If you get the entire dataset into memory, it will be returned from Drill a quickly as we can possibly send it to the client. One way to do this is to insert a sort; we cannot send along any of the data until the compete sort is done. As long as you don't read so much data that we will start spilling the sort to disk, all of the records will be in memory. To take the read and sort time out of your test, just make sure to record the time you first receive data from Drill, not the query start time. There is one gotcha here. To make the BI tools more responsive, we implemented a feature that will send along one empty batch of records with the schema information populated. This schema is generated by applying all of the transformations that happen throughout the query. For example, the join operator handles this schema population by sending along the schema merged from the two sides of the join, project will similarly add or remove column based on the expressions and columns requested. You will want to make sure you record your start time when you receive the first batch with actual records. This can give you an accurate measurement of the ODBC performance, removing the bottleneck of the disk. On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei alexanderz.si...@gmail.com wrote: Thanks for the answers. @Ted my only goal is to pump a large amount of data without having to read from Hard Disk. I am measuring the ODBC driver performance and I need a higher data transfer rate. So any method that helps
Re: Recursive CTE Support in Drill
Good point. In fact, you can just use a literal expression and some sample data such tpch lineitem: SELECT * FROM (select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key from cp.`tpch/lineitem.parquet`) t1 JOIN (select l_orderkey, l_shipdate, l_commitdate, l_shipmode, 1 as join_key from cp.`tpch/lineitem.parquet`) t2 on t1.join_key = t2.join_key ...SNIP... 3,621,030,625 rows selected On Sat, Jul 18, 2015 at 2:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: Jacques, Alexander, The cartesian join approach will produce an enormous stream of data with only a very small amount of disk read. It works now. If you start with small file, you can increase the amount of output in small increments by simply adding another join. You do have to have some kind of equality join involved since drill will (anomalously in my view) refuse to plan the query otherwise. Here is an example. Note that I have deleted many lines of output and indicated that with ellipsis (...). The three queries here produce 4, 16 and 64 lines of output from a 4 line input file. That could have been n, n^2 and n^3 just as easily for any value of n that you might like. ted:apache-drill-1.0.0$ cat x.csv ! a,1 a,2 a,3 a,4 ! ted:apache-drill-1.0.0$ bin/sqlline -u jdbc:drill:zk=local -n admin -p admin 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x;* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 2 | | a | 3 | | a | 4 | +-+-+ 4 rows selected (0.097 seconds) 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y where x.columns[0] = y.columns[0];* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 1 | | a | 1 | | a | 1 | | a | 2 | ... | a | 4 | | a | 4 | +-+-+ 16 rows selected (0.614 seconds) 0: jdbc:drill:zk=local *select x.columns[0], x.columns[1] from dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` x, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` y, dfs.`/Users/tdunning/tmp/apache-drill-1.0.0/x.csv` z where x.columns[0] = y.columns[0] and x.columns[0] = z.columns[0];* +-+-+ | EXPR$0 | EXPR$1 | +-+-+ | a | 1 | | a | 1 | | a | 1 | | a | 1 | ... | a | 4 | | a | 4 | | a | 4 | | a | 4 | +-+-+ 64 rows selected (0.63 seconds) 0: jdbc:drill:zk=local On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau jacq...@dremio.com wrote: Removing cross posting Alexander, There is currently no way for Drill to generate a large amount of data using SQL. However, you can generate large generic data by using the MockStoragePlugin if you submit a plan. You can find an example plan using this at [1]. I heard someone might be working on extending the MockStoragePlugin to support SQL which would provide the outcome you requested. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: Also, just doing a Cartesian join of three copies of 1000 records will give you a billion records with negligible I/o. Sent from my iPhone On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com wrote: @Alexander If you want to test the speed of the ODBC driver you can do that without a new storage plugin. If you get the entire dataset into memory, it will be returned from Drill a quickly as we can possibly send it to the client. One way to do this is to insert a sort; we cannot send along any of the data until the compete sort is done. As long as you don't read so much data that we will start spilling the sort to disk, all of the records will be in memory. To take the read and sort time out of your test, just make sure to record the time you first receive data from Drill, not the query start time. There is one gotcha here. To make the BI tools more responsive, we implemented a feature that will send along one empty batch of records with the schema information populated. This schema is generated by applying all of the transformations that happen throughout the query. For example, the join operator handles this schema population by sending along the schema merged from the two sides of the join, project will similarly add or remove column based on the expressions and columns requested. You will want to make sure you
Re: Recursive CTE Support in Drill
Also, just doing a Cartesian join of three copies of 1000 records will give you a billion records with negligible I/o. Sent from my iPhone On Jul 16, 2015, at 15:43, Jason Altekruse altekruseja...@gmail.com wrote: @Alexander If you want to test the speed of the ODBC driver you can do that without a new storage plugin. If you get the entire dataset into memory, it will be returned from Drill a quickly as we can possibly send it to the client. One way to do this is to insert a sort; we cannot send along any of the data until the compete sort is done. As long as you don't read so much data that we will start spilling the sort to disk, all of the records will be in memory. To take the read and sort time out of your test, just make sure to record the time you first receive data from Drill, not the query start time. There is one gotcha here. To make the BI tools more responsive, we implemented a feature that will send along one empty batch of records with the schema information populated. This schema is generated by applying all of the transformations that happen throughout the query. For example, the join operator handles this schema population by sending along the schema merged from the two sides of the join, project will similarly add or remove column based on the expressions and columns requested. You will want to make sure you record your start time when you receive the first batch with actual records. This can give you an accurate measurement of the ODBC performance, removing the bottleneck of the disk. On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei alexanderz.si...@gmail.com wrote: Thanks for the answers. @Ted my only goal is to pump a large amount of data without having to read from Hard Disk. I am measuring the ODBC driver performance and I need a higher data transfer rate. So any method that helps pumping data out of Drill faster would help. The log-synth seems a good way to generate data for testing. However, I'd need a ram only option which hopefully provides a higher throughput. @Jacques How involved is it to write a dummy plugin that returns one hardcoded row repeatedly 12 million times? Thanks, Alex On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: It may be easy, but it is completely opaque about what really needs to happen. For instance, 1) how is schema exposed? 2) which classes do I really need to implement? 3) how do I express partitioning of a format? 4) how do I test it? Just a bit of documentation and comments would go a very, very long way. Even answers on the mailing list that have more details than oh, that's easy. I would be happy to transcribe answers into the code if I could just get some. On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org wrote: Creating an EasyFormatPlugin is pretty simple. They were designed to get rid of much of the scaffolding required for a standard FormatPlugin. JSON https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json Text https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text AVRO https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro In all cases, the connection code is pretty light. A fully schematized format like log-synth should be even simpler to implement. On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: I don't think we need a full on storage plugin. I think a data format should be sufficient, basically CSV on steroids. On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would
Re: Recursive CTE Support in Drill
Thanks for the answers. @Ted my only goal is to pump a large amount of data without having to read from Hard Disk. I am measuring the ODBC driver performance and I need a higher data transfer rate. So any method that helps pumping data out of Drill faster would help. The log-synth seems a good way to generate data for testing. However, I'd need a ram only option which hopefully provides a higher throughput. @Jacques How involved is it to write a dummy plugin that returns one hardcoded row repeatedly 12 million times? Thanks, Alex On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: It may be easy, but it is completely opaque about what really needs to happen. For instance, 1) how is schema exposed? 2) which classes do I really need to implement? 3) how do I express partitioning of a format? 4) how do I test it? Just a bit of documentation and comments would go a very, very long way. Even answers on the mailing list that have more details than oh, that's easy. I would be happy to transcribe answers into the code if I could just get some. On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org wrote: Creating an EasyFormatPlugin is pretty simple. They were designed to get rid of much of the scaffolding required for a standard FormatPlugin. JSON https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json Text https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text AVRO https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro In all cases, the connection code is pretty light. A fully schematized format like log-synth should be even simpler to implement. On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: I don't think we need a full on storage plugin. I think a data format should be sufficient, basically CSV on steroids. On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would allow tables to be read from a log-synth [1] schema. That would let you read as much data as you might like with an arbitrarily complex (or simple) query. Operationally, you would create a file containing a log-synth schema that has the extension .synth. Your data source would have to be configured to connect that extension with the log-synth format. At that point, you could select as much or little data as you like from the file and you would see generated data rather than the schema. [1] https://github.com/tdunning/log-synth On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei alexanderz.si...@gmail.com wrote: Hi All, I am trying to come up with a query which returns a given number of rows without having a real table on Storage. I am hoping to achieve something like this: http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table DECLARE @start INT = 1;DECLARE @end INT = 100; WITH numbers AS ( SELECT @start AS number UNION ALL SELECT number + 1 FROM numbers WHERE number @end)SELECT *FROM numbersOPTION (MAXRECURSION 0); I do not
Re: Recursive CTE Support in Drill
@Alexander If you want to test the speed of the ODBC driver you can do that without a new storage plugin. If you get the entire dataset into memory, it will be returned from Drill a quickly as we can possibly send it to the client. One way to do this is to insert a sort; we cannot send along any of the data until the compete sort is done. As long as you don't read so much data that we will start spilling the sort to disk, all of the records will be in memory. To take the read and sort time out of your test, just make sure to record the time you first receive data from Drill, not the query start time. There is one gotcha here. To make the BI tools more responsive, we implemented a feature that will send along one empty batch of records with the schema information populated. This schema is generated by applying all of the transformations that happen throughout the query. For example, the join operator handles this schema population by sending along the schema merged from the two sides of the join, project will similarly add or remove column based on the expressions and columns requested. You will want to make sure you record your start time when you receive the first batch with actual records. This can give you an accurate measurement of the ODBC performance, removing the bottleneck of the disk. On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei alexanderz.si...@gmail.com wrote: Thanks for the answers. @Ted my only goal is to pump a large amount of data without having to read from Hard Disk. I am measuring the ODBC driver performance and I need a higher data transfer rate. So any method that helps pumping data out of Drill faster would help. The log-synth seems a good way to generate data for testing. However, I'd need a ram only option which hopefully provides a higher throughput. @Jacques How involved is it to write a dummy plugin that returns one hardcoded row repeatedly 12 million times? Thanks, Alex On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: It may be easy, but it is completely opaque about what really needs to happen. For instance, 1) how is schema exposed? 2) which classes do I really need to implement? 3) how do I express partitioning of a format? 4) how do I test it? Just a bit of documentation and comments would go a very, very long way. Even answers on the mailing list that have more details than oh, that's easy. I would be happy to transcribe answers into the code if I could just get some. On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org wrote: Creating an EasyFormatPlugin is pretty simple. They were designed to get rid of much of the scaffolding required for a standard FormatPlugin. JSON https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json Text https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text AVRO https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro In all cases, the connection code is pretty light. A fully schematized format like log-synth should be even simpler to implement. On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: I don't think we need a full on storage plugin. I think a data format should be sufficient, basically CSV on steroids. On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have
Re: Recursive CTE Support in Drill
Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would allow tables to be read from a log-synth [1] schema. That would let you read as much data as you might like with an arbitrarily complex (or simple) query. Operationally, you would create a file containing a log-synth schema that has the extension .synth. Your data source would have to be configured to connect that extension with the log-synth format. At that point, you could select as much or little data as you like from the file and you would see generated data rather than the schema. [1] https://github.com/tdunning/log-synth On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei alexanderz.si...@gmail.com wrote: Hi All, I am trying to come up with a query which returns a given number of rows without having a real table on Storage. I am hoping to achieve something like this: http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table DECLARE @start INT = 1;DECLARE @end INT = 100; WITH numbers AS ( SELECT @start AS number UNION ALL SELECT number + 1 FROM numbers WHERE number @end)SELECT *FROM numbersOPTION (MAXRECURSION 0); I do not actually need to create different values and returning identical rows would work too.I just need to bypass the from clause in the query. Thanks, Alex -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
Re: Recursive CTE Support in Drill
I don't think we need a full on storage plugin. I think a data format should be sufficient, basically CSV on steroids. On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would allow tables to be read from a log-synth [1] schema. That would let you read as much data as you might like with an arbitrarily complex (or simple) query. Operationally, you would create a file containing a log-synth schema that has the extension .synth. Your data source would have to be configured to connect that extension with the log-synth format. At that point, you could select as much or little data as you like from the file and you would see generated data rather than the schema. [1] https://github.com/tdunning/log-synth On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei alexanderz.si...@gmail.com wrote: Hi All, I am trying to come up with a query which returns a given number of rows without having a real table on Storage. I am hoping to achieve something like this: http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table DECLARE @start INT = 1;DECLARE @end INT = 100; WITH numbers AS ( SELECT @start AS number UNION ALL SELECT number + 1 FROM numbers WHERE number @end)SELECT *FROM numbersOPTION (MAXRECURSION 0); I do not actually need to create different values and returning identical rows would work too.I just need to bypass the from clause in the query. Thanks, Alex -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
Re: Recursive CTE Support in Drill
Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would allow tables to be read from a log-synth [1] schema. That would let you read as much data as you might like with an arbitrarily complex (or simple) query. Operationally, you would create a file containing a log-synth schema that has the extension .synth. Your data source would have to be configured to connect that extension with the log-synth format. At that point, you could select as much or little data as you like from the file and you would see generated data rather than the schema. [1] https://github.com/tdunning/log-synth On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei alexanderz.si...@gmail.com wrote: Hi All, I am trying to come up with a query which returns a given number of rows without having a real table on Storage. I am hoping to achieve something like this: http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table DECLARE @start INT = 1;DECLARE @end INT = 100; WITH numbers AS ( SELECT @start AS number UNION ALL SELECT number + 1 FROM numbers WHERE number @end)SELECT *FROM numbersOPTION (MAXRECURSION 0); I do not actually need to create different values and returning identical rows would work too.I just need to bypass the from clause in the query. Thanks, Alex -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
Re: Recursive CTE Support in Drill
It may be easy, but it is completely opaque about what really needs to happen. For instance, 1) how is schema exposed? 2) which classes do I really need to implement? 3) how do I express partitioning of a format? 4) how do I test it? Just a bit of documentation and comments would go a very, very long way. Even answers on the mailing list that have more details than oh, that's easy. I would be happy to transcribe answers into the code if I could just get some. On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau jacq...@apache.org wrote: Creating an EasyFormatPlugin is pretty simple. They were designed to get rid of much of the scaffolding required for a standard FormatPlugin. JSON https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json Text https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text AVRO https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro In all cases, the connection code is pretty light. A fully schematized format like log-synth should be even simpler to implement. On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: I don't think we need a full on storage plugin. I think a data format should be sufficient, basically CSV on steroids. On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: Yeah, we still lack documentation on how to write a storage plugin. One advice I've been seeing a lot is to take a look at the mongo-db plugin, it was basically added in one single commit: https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304 I think this will give some general ideas on what to expect when writing a storage plugin. On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hakim, Not yet. Still very much in the stage of gathering feedback. I would think it very simple. The biggest obstacles are 1) no documentation on how to write a data format 2) I need to release a jar for log-synth to Maven Central. On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim Deneche adene...@maprtech.com wrote: @Ted, the log-synth storage format would be really useful. I'm already seeing many unit tests that could benefit from this. Do you have a github repo for your ongoing work ? Thanks! On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you hard set on using common table expressions? I have discussed a bit off-list creating a data format that would allow tables to be read from a log-synth [1] schema. That would let you read as much data as you might like with an arbitrarily complex (or simple) query. Operationally, you would create a file containing a log-synth schema that has the extension .synth. Your data source would have to be configured to connect that extension with the log-synth format. At that point, you could select as much or little data as you like from the file and you would see generated data rather than the schema. [1] https://github.com/tdunning/log-synth On Thu, Jul 9, 2015 at 11:31 AM, Alexander Zarei alexanderz.si...@gmail.com wrote: Hi All, I am trying to come up with a query which returns a given number of rows without having a real table on Storage. I am hoping to achieve something like this: http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table DECLARE @start INT = 1;DECLARE @end INT = 100; WITH numbers AS ( SELECT @start AS number UNION ALL SELECT number + 1 FROM numbers WHERE number @end)SELECT *FROM numbersOPTION (MAXRECURSION 0); I do not actually need to create different values and returning identical rows would work too.I just need to bypass the from clause in the query. Thanks, Alex -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available -- Abdelhakim Deneche Software Engineer http://www.mapr.com/ Now Available - Free Hadoop On-Demand Training http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available