"When consistent view is enabled, Amazon EMR also has better performance when listing Amazon S3 prefixes with over 10,000 objects. In fact, we’ve seen a 5x increase in list performance on prefixes with over 1 million objects. This speed-up is due to using the EMRFS metadata, which is required for consistent view, to make listing large numbers of objects more efficient."
https://blogs.aws.amazon.com/bigdata/post/Tx1WL4KR7SE37YY/Ensuring-Consiste ncy-When-Using-Amazon-S3-and-Amazon-Elastic-MapReduce-for-ETL-W I’m also realizing that EMRFS may be necessary for Drill to find data that was written very recently to S3 by another process. The other process also has to write via EMRFS, not directly to S3, in order to get that benefit. On 6/18/15, 11:24 AM, "Paul Mogren" <[email protected]> wrote: >Following up. Ted gave sound advice regarding reading S3 vs HDFS, but >didn¹t address EMRFS specifically. Here is what I have learned. > >EMRFS is an HDFS emulation layer over S3 storage. What it provides is a >way to get the data consistency expected by clients of HDFS, but not >provided directly by S3, by transparently doing things like retries of >reads and tracking which clients recently wrote which data. I see no >reason to believe that the performance of Drill over EMRFS would be better >than Drill directly reading from S3, unless maybe in the context of large >objects, if there is something different about how they can be split for >concurrent readers - I have not investigated this. > >I don¹t believe Drill writes to shared storage except by user request or >configuration. Perhaps EMRFS is helpful to users of Drill features that >write data, like CTAS or spill-to-DFS. > > >Other potential reasons that some might like to use EMRFS include its >support for IAM Role security in EC2, and transparent client-side >encryption of S3 objects. It seems the former is finally coming to Jets3t, >the library used by Drill to access S3, so hopefully that will soon be an >option even without requiring EMRFS: >https://bitbucket.org/jmurty/jets3t/issue/163/provide-support-for-aws-iam- >i >nstance-roles > >-Paul > > > >On 5/26/15, 2:15 PM, "Paul Mogren" <[email protected]> wrote: > >>Thank you. This kind of summary advice is helpful to getting started. >> >> >> >> >>On 5/22/15, 6:37 PM, "Ted Dunning" <[email protected]> wrote: >> >>>The variation will have less to do with Drill (which can read all these >>>options such as EMR resident MapR FS or HDFS or persistent MapR FS or >>>HDFS >>>or S3). >>> >>>The biggest differences will have to do with whether your clusters >>>providing storage are permanent or ephemeral. If they are ephemeral, >>>you >>>can host the distributed file system on EBS based volumes so that you >>>will >>>have an ephemeral but restartable cluster. >>> >>>So the costs in run time will have to do with startup or restart times >>>and >>>the time it takes to pour the data into any new distributed file system. >>>If you host permanently in S3 and have Drill read directly from there, >>>you >>>have no permanent storage cost for the input data, but will probably >>>have >>>slower reads. With a permanent cluster hosting the data, you will have >>>higher costs, but likely also higher performance. Copying data from S3 >>>to >>>a distributed file system is probably not a great idea since you pay >>>roughly the same cost during copy as you would have paid just querying >>>directly from S3. >>> >>>Exactly how these trade-offs pan out requires some careful thought and >>>considerable knowledge of your workload. >>> >>> >>> >>>On Fri, May 22, 2015 at 3:22 PM, Paul Mogren <[email protected]> >>>wrote: >>> >>>> > When running Drill in AWS EMR, can anyone advise as to the >>>>advantages >>>> >and disadvantages of having Drill access S3 via EMRFS vs. directly? >>>> >>>> Also, a third option: an actual HDFS not backed by S3 >>>> >>>> >> >
