Since S3FileSystem is not taken into account in FSHDFSUtils#isSameHdfs, we
need to add more code to avoid the overhead.

Can you log a JIRA with what you discovered ?

Thanks

On Thu, Jun 21, 2018 at 2:08 PM, Austin Heyne <[email protected]> wrote:

> Hi again,
>
> I've been doing more digging into this and I've found that with the way
> the code it written it's actually impossible. In FSHDFSUtils [1] HBase
> attempts to get the Canonical Service Name from Hadoop. Since we're running
> on EMR our filesystem is the S3NativeFileSystem (com.amazon) which extends,
> I believe, the NativeS3FileSystem (org.apache). Since the
> NativeS3FileSystem [2] and S3FileSystem [2] both always return null for
> getCannonicalServiceName and from testing it appears the S3NativeFileSystem
> does the same it looks like there is no way to get past the check in
> FSHDFSUtils.isSameHdfs when running on S3 of any kind.
>
> Does anyone know of a workaround for this issue?
>
> Thanks,
> Austin Heyne
>
> [1] https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/
> src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123
> [2] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado
> op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/
> s3native/NativeS3FileSystem.java#L788-L791
> [3] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado
> op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/
> S3FileSystem.java#L403-L406
>
>
> On 06/20/2018 07:07 PM, Austin Heyne wrote:
>
>> Hi everyone,
>>
>> I'm trying to run a bulk load of about 15TB of data sitting in s3 that
>> I've bulk ingested. When I initiate the load I'm seeing the data get copied
>> down to the workers and then back up to s3 even though the HBase root is
>> that same bucket.
>>
>> Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and
>> the HBase root is s3://bucket_name/. I have validation disabled through the
>> "hbase.loadincremental.validate.hfile" = "false" config set in code
>> before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at
>> [1]) The splits have already been generated with the same config that was
>> used during the ingest so they'll line up. I'm currently running HBase
>> 1.4.2 on AWS EMR. The logs of interest were pulled from a worker on the
>> cluster:
>>
>> """
>> 2018-06-20 22:42:15,888 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo
>> ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading
>> 2018-06-20 22:42:16,026 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> compress.CodecPool: Got brand-new decompressor [.snappy]
>> 2018-06-20 22:42:16,056 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> regionserver.HRegionFileSystem: Bulk-load file
>> s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is
>> on different filesystem than the destination store. Copying file over to
>> destination filesystem.
>> 2018-06-20 22:42:16,109 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo
>> ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading
>> 2018-06-20 22:42:17,910 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> s3n.MultipartUploadOutputStream: close closed:false
>> s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188
>> 71a3f705/.tmp/3d69598daa9841f986da2341a5901444
>> 2018-06-20 22:42:17,927 INFO 
>> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
>> regionserver.HRegionFileSystem: Copied s3n://bucket_name/data/bulkloa
>> d/z3/d/b9371885084e4060ac157799e5c89b59 to temporary path on destination
>> filesystem: s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188
>> 71a3f705/.tmp/3d69598daa9841f986da2341a5901444
>> """
>>
>> I'm seeing the behavior using 's3://' or 's3n://'. Has anyone experienced
>> this or have advice?
>>
>> Thanks,
>> Austin Heyne
>>
>> [1] https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8
>> b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/
>> main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBase
>> BulkLoadCommand.scala#L47
>>
>
> --
> Austin L. Heyne
>
>

Reply via email to