Hi, I have been trying to get Drill to work with Tachyon ( http://tachyon-project.org/index.html) using S3 as a Deep storage (Tachyon: Under File System).
The whole Idea is that each Drillbit (node) has it own, mutli tired, local storage (MEM, SSD + HDD) and uses that to cache Parquet files which are stored in S3. This should minimize the S3 traffic and latency and maximize the performance as Tachyon handles eviction of unused files and moving hot files between tires. In theory this sounds good (to me at least) and in practice it is almost working. I would like to share the steps we have taken to get this running so others can follow them and hoping someone here can assist us with what we hope is the last leg. *Steps taken:* 1. *Have Drill 1.1.x running * - according to their simple simple guide ( http://tachyon-project.org/Running-Tachyon-Locally.html) 2. *Have Tachyon running* 1. Latest release: https://github.com/amplab/tachyon/releases/download/v0.7.0/tachyon-0.7.0-bin.tar.gz 2. Configure and run local instance according to their simple guide ( https://drill.apache.org/docs/starting-drill-on-linux-and-mac-os-x/) - requires java 7 to run (jsp pages will not render correctly using java 8) 3. Make sure to run the tests (the should leave some test files in your Tachyon files ystem) 4. Have it running on localhost (bin/tachyon-startup.sh localhost (for this example)) 3. *Configure S3 Underlying FS for Tachyon* 1. Configure S3 according to this guide ( http://tachyon-project.org/Setup-UFS.html 2. Add "export TACHYON_UNDERFS_ADDRESS=s3n://<bucket-name>" to conf/tachyon-env.sh 3. Add "-Dfs.s3n.awsAccessKeyId=<your-key>" to the export TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh 4. Add "-Dfs.s3n.awsSecretAccessKey=<you-secret>" to the export TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh 4. *Add Tachyon client and jet3t client (jars) to Drill* 1. cp <tachyoon-root>/clients/client/target/tachyon-client-0.7.0.jar <drill-root>/jars/3rdparty/ 2. get the jets3t download ( http://bitbucket.org/jmurty/jets3t/downloads/jets3t-0.9.3.zip) 3. unzip it and cp jets3t-0.9.3/jars/jets3t-0.9.3.jar to <drill-root>/jars/3rdparty/ 5. *Allow Drill to load jets3t jar* 1. Edit <drill-root>/bin/hadoop-excludes.txt 2. Remove the jets3t line from the file 6. *Configure S3 access for the jets3t in Drill (used by the Tachyon driver)* 1. Edit vim <drill-root>/conf/drill-env.sh 2. Add -Dfs.s3n.awsAccessKeyId=<your-key> to the "export DRILL_JAVA_OPTS=" line 3. Add -Dfs.s3n.awsSecretAccessKey=<you-secret> to the "export DRILL_JAVA_OPTS=" line - I have no idea why the Tachyon client needs both a native Tachyon client-master/worker connection as well as a S3 connection 7. *Configure a new storage for Drill using the Drill admin (localhost:8047)* 1. Create new storage name "ts3" (for example) 2. Use the following config for it: {"type": "file", "enabled": true, "connection": "tachyon:// 127.0.0.1:19998/", "workspaces": { "root": { "location": "/", "writable": true, "defaultInputFormat": null } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "parquet": { "type": "parquet" }, "json": { "type": "json" }, "avro": { "type": "avro" } } } 3. Notice the "tachyon://127.0.0.1:19998/" connection string in the config. - It's the glue between Drill and Tachyon 4. Run Drillbit + local client/sqlline (see drill documentation) 8. *Make sure Drill is communicating to Tachyon* 1. Type "use ts3.root;" in the Drill sqlline/client 2. Type "show files;" in the Drill sqlline/client 3. Should show the test files directory generated earlier: +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+ | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+ | default_tests_files | true | false | 0 | | | rwxrwxrwx | 2015-07-29 15:08:13.782 | 2015-07-29 15:08:13.782 | +----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+ 4. Do the partial-success dance! - Drill is now talking to the local Tachyon file system 9. *Create a database on the Tachyond file system* 1. run: "CREATE TABLE ts3.root.`/test` AS SELECT * FROM dfs.tmp.`/some-file.json`; 2. have it not work: Error: SYSTEM ERROR: IllegalArgumentException: No Under File System Factory found for: s3n://streamanalytics/tmp/tachyon/workers/1438179000001/48 Fragment 0:0 [Error Id: e4201119-1805-44b7-8088-3fc1c898f388 on localhost:31010] (state=,code=0) 3. Do the goddammit-furstration dance and then help me solve this one! - the empty parquet file is created in Tachyon and can be listed with "show files" - nothing is created in S3 (other than the tmp files created by Tachyon when formatting/setting up) 10. *Verify that everything is saved to S3* - pending 11. *Verify that Drillbits see material from every Tachyon node* - pending 12. *Configure Tachyon to be multi-tiered * - pending So, there we almost have it! :) All input and ideas are welcomed! (If someone is doing this already then please set forth and share) Regards, -Stefan
