That is tricky, and from what I understand, you can't avoid that. A drillbit on 
one node doesn't know that what the other drillbit is seeing on its local FS is 
a duplicate of a file on its own FS. 

However, there might be some crafty ways of working around this. 

Thinking on the top of my head (so, take this with a pinch of salt_, you could 
do the following:

1. Have all the nodes' running NFS servers to expose their local filesystem 
directories where you files reside.
2. Mount these on the node where you launch your queries. 
3. Write and Run a script that walks through these mounted directories to 
identify files that are duplicates. (Use md5 checksum and filename as a unique 
ID).
4. Move out files your script identifies as a duplicate to a temp directory on 
that node.
5. Run the query.
6. Move back files originally moved in step 4.

Not an elegant solution, but it mimics the way a distributed FS would manage 
files. 

Also, Zookeeper actually doesn’t coordinate the files within a cluster.. only 
the Drillbits. The distributed FS storage plugins, etc... help the Foreman 
Drillbit allocate files to each of the participating worker drillbits for 
processing .

Hope this helps. 

-----Original Message-----
From: Matt [mailto:[email protected]] 
Sent: Saturday, November 04, 2017 9:53 PM
To: [email protected]
Subject: Drillbits and duplicate files

If multiple Drillbits on different servers are coordinating via Zookeeper, and 
some files across the servers are duplicates (with identical filenames), will 
the cluster of distributed Drillbits avoid duplicating data on queries?

I’m specifically interested in aggregating CSV data on multiple servers, but 
not in HDFS.

Reply via email to