[ 
https://issues.apache.org/jira/browse/HADOOP-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated HADOOP-16260:
---------------------------------
    Description: 
We use distcp to copy entire HDFS clusters to GCS.
 In the process, we hit the following error:
{noformat}
INFO: Encountered status code 410 when accessing URL 
https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU.
 Delegating to response handler for possible retry.
Apr 14, 2019 5:53:17 AM 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation
 call
SEVERE: Exception not convertible into handled response
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException:
 410 Gone
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object 
app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate 
limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object 
app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate 
limit. Please reduce the rate of create, update, and delete requests."
}
       at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
 
{noformat}
Looking at the code, it looks like a distCp mapper gets a list of files to copy 
from src to target filesystem. The mapper handles each file in its list 
sequentially: It first creates/overwrites a temp file 
(*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the src 
file to the temp file, and finally renames the temp file to the actual target 
file.
 The temp file name (which contains the task ID) is reused for all the files in 
the mapper's batch. It looks like GCP enforces a rate-limit on the number of 
operations per sec on any object (even though we are actually creating a new 
file and renaming it to the final target, gcp assumes we are making changes to 
the same object)

Even though it is possible to play around with the number of Maps / split size 
etc. It is hard to arrive at one of those values based on any rate-limit.

Thus, we propose we add a flag to allow the DistCp mapper to use a different 
temp file PER file.

Thoughts ? (cc/[~steve_l], [~benoyantony])

  was:
We use distcp to copy entire HDFS clusters to GCS.
 In the process, we hit the following error:
{noformat}
INFO: Encountered status code 410 when accessing URL 
https://www.googleapis.com/upload/storage/v1/b/ap10data1/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU.
 Delegating to response handler for possible retry.
Apr 14, 2019 5:53:17 AM 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation
 call
SEVERE: Exception not convertible into handled response
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException:
 410 Gone
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object 
ap10data1/analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds 
the rate limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object 
ap10data1/analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds 
the rate limit. Please reduce the rate of create, update, and delete requests."
}
       at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
 
{noformat}
Looking at the code, it looks like a distCp mapper gets a list of files to copy 
from src to target filesystem. The mapper handles each file in its list 
sequentially: It first creates/overwrites a temp file 
(*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the src 
file to the temp file, and finally renames the temp file to the actual target 
file.
 The temp file name (which contains the task ID) is reused for all the files in 
the mapper's batch. It looks like GCP enforces a rate-limit on the number of 
operations per sec on any object (even though we are actually creating a new 
file and renaming it to the final target, gcp assumes we are making changes to 
the same object)

Even though it is possible to play around with the number of Maps / split size 
etc. It is hard to arrive at one of those values based on any rate-limit.

Thus, we propose we add a flag to allow the DistCp mapper to use a different 
temp file PER file.

Thoughts ? (cc/[~steve_l], [~benoyantony])


> Allow Distcp to create a new tempTarget file per File
> -----------------------------------------------------
>
>                 Key: HADOOP-16260
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16260
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Arun Suresh
>            Priority: Major
>
> We use distcp to copy entire HDFS clusters to GCS.
>  In the process, we hit the following error:
> {noformat}
> INFO: Encountered status code 410 when accessing URL 
> https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU.
>  Delegating to response handler for possible retry.
> Apr 14, 2019 5:53:17 AM 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation
>  call
> SEVERE: Exception not convertible into handled response
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException:
>  410 Gone
> {
>   "code" : 429,
>   "errors" : [ {
>     "domain" : "usageLimits",
>     "message" : "The total number of changes to the object 
> app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the 
> rate limit. Please reduce the rate of create, update, and delete requests.",
>     "reason" : "rateLimitExceeded"
>   } ],
>   "message" : "The total number of changes to the object 
> app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the 
> rate limit. Please reduce the rate of create, update, and delete requests."
> }
>        at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>         at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>  
> {noformat}
> Looking at the code, it looks like a distCp mapper gets a list of files to 
> copy from src to target filesystem. The mapper handles each file in its list 
> sequentially: It first creates/overwrites a temp file 
> (*.distcp.tmp.attempt_local1083459072_0001_m_000000_0*), then it copies the 
> src file to the temp file, and finally renames the temp file to the actual 
> target file.
>  The temp file name (which contains the task ID) is reused for all the files 
> in the mapper's batch. It looks like GCP enforces a rate-limit on the number 
> of operations per sec on any object (even though we are actually creating a 
> new file and renaming it to the final target, gcp assumes we are making 
> changes to the same object)
> Even though it is possible to play around with the number of Maps / split 
> size etc. It is hard to arrive at one of those values based on any rate-limit.
> Thus, we propose we add a flag to allow the DistCp mapper to use a different 
> temp file PER file.
> Thoughts ? (cc/[~steve_l], [~benoyantony])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to