Re: To EMRFS or not to EMRFS?

Paul Mogren Tue, 26 May 2015 11:16:57 -0700

Thank you. This kind of summary advice is helpful to getting started.




On 5/22/15, 6:37 PM, "Ted Dunning" <[email protected]> wrote:

>The variation will have less to do with Drill (which can read all these
>options such as EMR resident MapR FS or HDFS or persistent MapR FS or HDFS
>or S3).
>
>The biggest differences will have to do with whether your clusters
>providing storage are permanent or ephemeral.  If they are ephemeral, you
>can host the distributed file system on EBS based volumes so that you will
>have an ephemeral but restartable cluster.
>
>So the costs in run time will have to do with startup or restart times and
>the time it takes to pour the data into any new distributed file system.
>If you host permanently in S3 and have Drill read directly from there, you
>have no permanent storage cost for the input data, but will probably have
>slower reads.  With a permanent cluster hosting the data, you will have
>higher costs, but likely also higher performance.  Copying data from S3 to
>a distributed file system is probably not a great idea since you pay
>roughly the same cost during copy as you would have paid just querying
>directly from S3.
>
>Exactly how these trade-offs pan out requires some careful thought and
>considerable knowledge of your workload.
>
>
>
>On Fri, May 22, 2015 at 3:22 PM, Paul Mogren <[email protected]>
>wrote:
>
>> > When running Drill in AWS EMR, can anyone advise as to the advantages
>> >and disadvantages of having Drill access S3 via EMRFS vs. directly?
>>
>> Also, a third option: an actual HDFS not backed by S3
>>
>>

Re: To EMRFS or not to EMRFS?

Reply via email to