Thank you. This kind of summary advice is helpful to getting started.
On 5/22/15, 6:37 PM, "Ted Dunning" <[email protected]> wrote: >The variation will have less to do with Drill (which can read all these >options such as EMR resident MapR FS or HDFS or persistent MapR FS or HDFS >or S3). > >The biggest differences will have to do with whether your clusters >providing storage are permanent or ephemeral. If they are ephemeral, you >can host the distributed file system on EBS based volumes so that you will >have an ephemeral but restartable cluster. > >So the costs in run time will have to do with startup or restart times and >the time it takes to pour the data into any new distributed file system. >If you host permanently in S3 and have Drill read directly from there, you >have no permanent storage cost for the input data, but will probably have >slower reads. With a permanent cluster hosting the data, you will have >higher costs, but likely also higher performance. Copying data from S3 to >a distributed file system is probably not a great idea since you pay >roughly the same cost during copy as you would have paid just querying >directly from S3. > >Exactly how these trade-offs pan out requires some careful thought and >considerable knowledge of your workload. > > > >On Fri, May 22, 2015 at 3:22 PM, Paul Mogren <[email protected]> >wrote: > >> > When running Drill in AWS EMR, can anyone advise as to the advantages >> >and disadvantages of having Drill access S3 via EMRFS vs. directly? >> >> Also, a third option: an actual HDFS not backed by S3 >> >>
