Since S3FileSystem is not taken into account in FSHDFSUtils#isSameHdfs, we need to add more code to avoid the overhead.
Can you log a JIRA with what you discovered ? Thanks On Thu, Jun 21, 2018 at 2:08 PM, Austin Heyne <[email protected]> wrote: > Hi again, > > I've been doing more digging into this and I've found that with the way > the code it written it's actually impossible. In FSHDFSUtils [1] HBase > attempts to get the Canonical Service Name from Hadoop. Since we're running > on EMR our filesystem is the S3NativeFileSystem (com.amazon) which extends, > I believe, the NativeS3FileSystem (org.apache). Since the > NativeS3FileSystem [2] and S3FileSystem [2] both always return null for > getCannonicalServiceName and from testing it appears the S3NativeFileSystem > does the same it looks like there is no way to get past the check in > FSHDFSUtils.isSameHdfs when running on S3 of any kind. > > Does anyone know of a workaround for this issue? > > Thanks, > Austin Heyne > > [1] https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/ > src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123 > [2] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado > op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/ > s3native/NativeS3FileSystem.java#L788-L791 > [3] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado > op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/ > S3FileSystem.java#L403-L406 > > > On 06/20/2018 07:07 PM, Austin Heyne wrote: > >> Hi everyone, >> >> I'm trying to run a bulk load of about 15TB of data sitting in s3 that >> I've bulk ingested. When I initiate the load I'm seeing the data get copied >> down to the workers and then back up to s3 even though the HBase root is >> that same bucket. >> >> Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and >> the HBase root is s3://bucket_name/. I have validation disabled through the >> "hbase.loadincremental.validate.hfile" = "false" config set in code >> before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at >> [1]) The splits have already been generated with the same config that was >> used during the ingest so they'll line up. I'm currently running HBase >> 1.4.2 on AWS EMR. The logs of interest were pulled from a worker on the >> cluster: >> >> """ >> 2018-06-20 22:42:15,888 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo >> ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading >> 2018-06-20 22:42:16,026 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> compress.CodecPool: Got brand-new decompressor [.snappy] >> 2018-06-20 22:42:16,056 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> regionserver.HRegionFileSystem: Bulk-load file >> s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is >> on different filesystem than the destination store. Copying file over to >> destination filesystem. >> 2018-06-20 22:42:16,109 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo >> ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading >> 2018-06-20 22:42:17,910 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> s3n.MultipartUploadOutputStream: close closed:false >> s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188 >> 71a3f705/.tmp/3d69598daa9841f986da2341a5901444 >> 2018-06-20 22:42:17,927 INFO >> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] >> regionserver.HRegionFileSystem: Copied s3n://bucket_name/data/bulkloa >> d/z3/d/b9371885084e4060ac157799e5c89b59 to temporary path on destination >> filesystem: s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188 >> 71a3f705/.tmp/3d69598daa9841f986da2341a5901444 >> """ >> >> I'm seeing the behavior using 's3://' or 's3n://'. Has anyone experienced >> this or have advice? >> >> Thanks, >> Austin Heyne >> >> [1] https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8 >> b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/ >> main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBase >> BulkLoadCommand.scala#L47 >> > > -- > Austin L. Heyne > >
