It sounds like you're running into the known S3 consistency issues. However, I don't know what exactly EMRFS is supposed to support all of the things that Accumulo requires. I would assume that EMRFS should be bridging the gap from S3 (a blobstore) to a consistent, distributed FileSystem that Accumulo provides. Their summary[1] indicates that consistent listings and read-after-write is solve which is a big problem. Not sure if you are supposed to also get atomic rename from it.

This presentation[2] should be a good primer I put together earlier this year on cloud storage for BigTables which may help you understand what's going on. I gave it at a meetup here in MD a couple of months back, but I don't think we were recording it.

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
[2]
https://drive.google.com/file/d/1Or1s-X0JjiLM87HKIOWlh3WlkdUQfYH9/view?usp=sharing

On 4/2/20 3:56 PM, Kevin Hobbs wrote:
Accumulo Users,

Is AWS EMR's "EMRFS consistent view" useful or required for Accumulo2 on S3? Has anyone else tried EMR + Accumulo2 on S3?

I have incorporated *most* of the steps in the blog post

https://accumulo.apache.org/blog/2019/09/10/accumulo-S3-notes.html

into an AWS EMR bootstrap action, that creates an Accumulo cluster running on emr-6.0.0-beta2. I have not used the hadoop-aws-relocated jar as the emr jars are available.

I am able to use a GeoMesa snapshot to ingest and retrieve data on the s3 volume. However, I just tried an ingest of about 10GB which progressed smoothly for a while until the masters  web UI reported "MajC Failed, extent = a<;":

java.io.IOException: Rename s3://THEBUCKET/accumulo/tables/a/default_tablet/A00000ci.rf_tmp to s3://THEBUCKET/accumulo/tables/a/default_tablet/A00000ci.rf returned false     at org.apache.accumulo.tserver.tablet.DatafileManager.rename(DatafileManager.java:85)     at org.apache.accumulo.tserver.tablet.DatafileManager.bringMajorCompactionOnline(DatafileManager.java:533)     at org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:2051)     at org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:2164)     at org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunner.java:37)
     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)     at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
     at java.lang.Thread.run(Thread.java:748)


A bit later it reported:

java.io.FileNotFoundException: No such file or directory 's3://THEBUCKET/accumulo/tables/c/t-0000090/F00000nz.rf'     at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:808)     at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1212)
     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:902)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:207)     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$CachableBuilder.lambda$fsPath$0(CachableBlockFile.java:91)     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:172)     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:400)     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1156)     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1251)     at org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOperations.java:53)     at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:68)     at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:83)     at org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileOperations.java:478)     at org.apache.accumulo.tserver.tablet.Compactor.openMapDataFiles(Compactor.java:299)     at org.apache.accumulo.tserver.tablet.Compactor.compactLocalityGroup(Compactor.java:344)     at org.apache.accumulo.tserver.tablet.Compactor.call(Compactor.java:225)     at org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:2039)     at org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:2164)     at org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunner.java:37)
     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)     at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
     at java.lang.Thread.run(Thread.java:748)


These seem like the same sort of problems HBASE on EMR can have when EMRFS isn't functioning properly.

--Kevin

On 3/3/20 1:57 PM, Jim Hughes wrote:
Hi all,

The next major release of GeoMesa is aimed at supporting Accumulo 2.x. As part of testing, my coworker Kevin and I are trying out Accumulo 2.0 on S3.

Keith's blog post[1] is great.  As people have tested Accumulo 2.0 in AWS, has anyone tried using EMR for the underlying HDFS cluster (and then installing Accumulo via bootstrap actions)?  Is there a preferred/suggested deployment strategy?

Cheers,

Jim

1. https://accumulo.apache.org/blog/2019/09/10/accumulo-S3-notes.html

Reply via email to