The incomplete saga of Drill, Tachyon and S3 (Three Amigos, - the analytics edition)

Stefán Baxter Wed, 29 Jul 2015 08:50:54 -0700

Hi,

I have been trying to get Drill to work with Tachyon (
http://tachyon-project.org/index.html) using S3 as a Deep storage (Tachyon:
Under File System).


The whole Idea is that each Drillbit (node) has it own, mutli tired, local
storage (MEM, SSD + HDD) and uses that to cache Parquet files which are
stored in S3.
This should minimize the S3 traffic and latency and maximize the
performance as Tachyon handles eviction of unused files and moving hot
files between tires.

In theory this sounds good (to me at least) and in practice it is almost
working.

I would like to share the steps we have taken to get this running so others
can follow them and hoping someone here can assist us with what we hope is
the last leg.

*Steps taken:*

   1. *Have Drill 1.1.x running *
   - according to their simple simple guide (
   http://tachyon-project.org/Running-Tachyon-Locally.html)

   2. *Have Tachyon running*
      1. Latest release:
      
https://github.com/amplab/tachyon/releases/download/v0.7.0/tachyon-0.7.0-bin.tar.gz
      2. Configure and run local instance according to their simple guide (
      https://drill.apache.org/docs/starting-drill-on-linux-and-mac-os-x/)
      - requires java 7 to run (jsp pages will not render correctly using
      java 8)
      3. Make sure to run the tests (the should leave some test files in
      your Tachyon files ystem)
      4. Have it running on localhost (bin/tachyon-startup.sh localhost
      (for this example))

      3. *Configure S3 Underlying FS for Tachyon*
   1. Configure S3 according to this guide (
      http://tachyon-project.org/Setup-UFS.html
      2. Add "export TACHYON_UNDERFS_ADDRESS=s3n://<bucket-name>"
      to conf/tachyon-env.sh
      3. Add "-Dfs.s3n.awsAccessKeyId=<your-key>" to the export
      TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh
      4. Add "-Dfs.s3n.awsSecretAccessKey=<you-secret>" to the export
      TACHYON_JAVA_OPTS section of the same file: conf/tachyon-env.sh

      4. *Add Tachyon client and jet3t client (jars) to Drill*
      1. cp <tachyoon-root>/clients/client/target/tachyon-client-0.7.0.jar
      <drill-root>/jars/3rdparty/
      2. get the jets3t download (
      http://bitbucket.org/jmurty/jets3t/downloads/jets3t-0.9.3.zip)
      3. unzip it and cp jets3t-0.9.3/jars/jets3t-0.9.3.jar
      to <drill-root>/jars/3rdparty/

      5. *Allow Drill to load jets3t jar*
      1. Edit <drill-root>/bin/hadoop-excludes.txt
      2. Remove the jets3t line from the file

      6. *Configure S3 access for the jets3t in Drill (used by the Tachyon
   driver)*
      1. Edit vim <drill-root>/conf/drill-env.sh
      2. Add -Dfs.s3n.awsAccessKeyId=<your-key> to the "export
      DRILL_JAVA_OPTS=" line
      3. Add -Dfs.s3n.awsSecretAccessKey=<you-secret> to the "export
      DRILL_JAVA_OPTS=" line
      - I have no idea why the Tachyon client needs both a native Tachyon
      client-master/worker connection as well as a S3 connection

      7. *Configure a new storage for Drill using the Drill admin
   (localhost:8047)*
   1. Create new storage name "ts3" (for example)
      2. Use the following config for it:
      {"type": "file",  "enabled": true,  "connection": "tachyon://
      127.0.0.1:19998/",  "workspaces": {    "root": {      "location":
      "/",      "writable": true,      "defaultInputFormat": null    }  },
       "formats": {    "psv": {      "type": "text",      "extensions": [
       "tbl"      ],      "delimiter": "|"    },    "csv": {
"type": "text",
           "extensions": [        "csv"      ],      "delimiter": ","    },
       "tsv": {      "type": "text",     "extensions": [        "tsv"
    ],
       "delimiter": "\t"    },    "parquet": {      "type": "parquet"    },
       "json": {      "type": "json"    },    "avro": {      "type":
"avro"    }
       } }
      3. Notice the "tachyon://127.0.0.1:19998/" connection string in the
      config.
      - It's the glue between Drill and Tachyon
      4. Run Drillbit + local client/sqlline (see drill documentation)

      8. *Make sure Drill is communicating to Tachyon*
      1. Type "use ts3.root;" in the Drill sqlline/client
      2. Type "show files;" in the Drill sqlline/client
      3. Should show the test files directory generated earlier:

      
+----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      |         name         | isDirectory  | isFile  | length  | owner  |
      group  | permissions  |        accessTime        |
modificationTime
      |

      
+----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      | default_tests_files  | true         | false   | 0       |        |
             | rwxrwxrwx    | 2015-07-29 15:08:13.782  | 2015-07-29
15:08:13.782
       |

      
+----------------------+--------------+---------+---------+--------+--------+--------------+--------------------------+--------------------------+
      4. Do the partial-success dance!
      - Drill is now talking to the local Tachyon file system

      9. *Create a database on the Tachyond file system*
      1. run: "CREATE TABLE ts3.root.`/test` AS SELECT * FROM
      dfs.tmp.`/some-file.json`;
      2. have it not work:
      Error: SYSTEM ERROR: IllegalArgumentException: No Under File System
      Factory found for:
      s3n://streamanalytics/tmp/tachyon/workers/1438179000001/48
      Fragment 0:0
      [Error Id: e4201119-1805-44b7-8088-3fc1c898f388 on localhost:31010]
      (state=,code=0)
      3. Do the goddammit-furstration dance and then help me solve this one!
      - the empty parquet file is created in Tachyon and can be listed with
      "show files"
      - nothing is created in S3 (other than the tmp files created by
      Tachyon when formatting/setting up)

      10. *Verify that everything is saved to S3*
   - pending

   11. *Verify that Drillbits see material from every Tachyon node*
   - pending

   12. *Configure Tachyon to be multi-tiered *
   - pending


So, there we almost have it! :)

All input and ideas are welcomed! (If someone is doing this already then
please set forth and share)

Regards,
 -Stefan

The incomplete saga of Drill, Tachyon and S3 (Three Amigos, - the analytics edition)

Reply via email to