That is tricky, and from what I understand, you can't avoid that. A drillbit on one node doesn't know that what the other drillbit is seeing on its local FS is a duplicate of a file on its own FS.
However, there might be some crafty ways of working around this. Thinking on the top of my head (so, take this with a pinch of salt_, you could do the following: 1. Have all the nodes' running NFS servers to expose their local filesystem directories where you files reside. 2. Mount these on the node where you launch your queries. 3. Write and Run a script that walks through these mounted directories to identify files that are duplicates. (Use md5 checksum and filename as a unique ID). 4. Move out files your script identifies as a duplicate to a temp directory on that node. 5. Run the query. 6. Move back files originally moved in step 4. Not an elegant solution, but it mimics the way a distributed FS would manage files. Also, Zookeeper actually doesn’t coordinate the files within a cluster.. only the Drillbits. The distributed FS storage plugins, etc... help the Foreman Drillbit allocate files to each of the participating worker drillbits for processing . Hope this helps. -----Original Message----- From: Matt [mailto:[email protected]] Sent: Saturday, November 04, 2017 9:53 PM To: [email protected] Subject: Drillbits and duplicate files If multiple Drillbits on different servers are coordinating via Zookeeper, and some files across the servers are duplicates (with identical filenames), will the cluster of distributed Drillbits avoid duplicating data on queries? I’m specifically interested in aggregating CSV data on multiple servers, but not in HDFS.
