[jira] [Comment Edited] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-04-26 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455520#comment-16455520
 ] 

Aaron Fabbri edited comment on HIVE-16295 at 4/26/18 11:08 PM:
---

This is a really cool prototype [~stakiar], thank you for doing this. I don't 
have much Hive knowledge but will try to spend some more time looking at the 
code.  I'm also happy to work w/ [~ste...@apache.org] on stabilizing the 
_SUCCESS file manifest (which enumerates the files committed) if that works for 
your dynamic partitioning problem.

edit: need more coffee.


was (Author: fabbri):
This is a really cool prototype [~stakiar], thank you for doing this. I don't 
have much Hive knowledge but will try to spend some more time looking at the 
code.  I'm also happy to work w/ [~ste...@apache.org] on stabilizing the 
_SUCCESS file manifest (which enumerates the uploaded-but-not-completed 
multipart uploads to S3) if that works for your dynamic partitioning problem.

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-04-26 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455520#comment-16455520
 ] 

Aaron Fabbri commented on HIVE-16295:
-

This is a really cool prototype [~stakiar], thank you for doing this. I don't 
have much Hive knowledge but will try to spend some more time looking at the 
code.  I'm also happy to work w/ [~ste...@apache.org] on stabilizing the 
_SUCCESS file manifest (which enumerates the uploaded-but-not-completed 
multipart uploads to S3) if that works for your dynamic partitioning problem.

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's OutputCommitter

2017-11-30 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273948#comment-16273948
 ] 

Aaron Fabbri commented on HIVE-16295:
-

Just FYI for watchers: the S3 Output Committer has been merged to trunk in 
Hadoop Common (HADOOP-13786).

> Add support for using Hadoop's OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files

2016-06-07 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319577#comment-15319577
 ] 

Aaron Fabbri commented on HIVE-13778:
-

Thanks.. You could also resolve as duplicated by.

> DROP TABLE PURGE on S3A table with too many files does not delete the files
> ---
>
> Key: HIVE-13778
> URL: https://issues.apache.org/jira/browse/HIVE-13778
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Sailesh Mukil
>Priority: Critical
>  Labels: metastore, s3
>
> I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A 
> that has many files, the files never get deleted. However, the Hive metastore 
> logs do say that the path was deleted:
> "Not moving [path] to trash"
> "Deleted the diretory [path]"
> I initially thought that this was due to the eventually consistent nature of 
> S3 for deletes, however, a week later, the files still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files

2016-05-26 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303387#comment-15303387
 ] 

Aaron Fabbri commented on HIVE-13778:
-

[~sailesh] can you assign this to me please?  I will resolve it.

> DROP TABLE PURGE on S3A table with too many files does not delete the files
> ---
>
> Key: HIVE-13778
> URL: https://issues.apache.org/jira/browse/HIVE-13778
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Sailesh Mukil
>Priority: Critical
>  Labels: metastore, s3
>
> I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A 
> that has many files, the files never get deleted. However, the Hive metastore 
> logs do say that the path was deleted:
> "Not moving [path] to trash"
> "Deleted the diretory [path]"
> I initially thought that this was due to the eventually consistent nature of 
> S3 for deletes, however, a week later, the files still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files

2016-05-26 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303380#comment-15303380
 ] 

Aaron Fabbri edited comment on HIVE-13778 at 5/27/16 3:01 AM:
--

Note this is the same as 
[IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558].  
See that issue for my explanation that this is expected behavior.


was (Author: fabbri):
Note this is the same as 
[IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558]

> DROP TABLE PURGE on S3A table with too many files does not delete the files
> ---
>
> Key: HIVE-13778
> URL: https://issues.apache.org/jira/browse/HIVE-13778
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Sailesh Mukil
>Priority: Critical
>  Labels: metastore, s3
>
> I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A 
> that has many files, the files never get deleted. However, the Hive metastore 
> logs do say that the path was deleted:
> "Not moving [path] to trash"
> "Deleted the diretory [path]"
> I initially thought that this was due to the eventually consistent nature of 
> S3 for deletes, however, a week later, the files still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files

2016-05-26 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303380#comment-15303380
 ] 

Aaron Fabbri commented on HIVE-13778:
-

Note this is the same as 
[IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558]

> DROP TABLE PURGE on S3A table with too many files does not delete the files
> ---
>
> Key: HIVE-13778
> URL: https://issues.apache.org/jira/browse/HIVE-13778
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Sailesh Mukil
>Priority: Critical
>  Labels: metastore, s3
>
> I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A 
> that has many files, the files never get deleted. However, the Hive metastore 
> logs do say that the path was deleted:
> "Not moving [path] to trash"
> "Deleted the diretory [path]"
> I initially thought that this was due to the eventually consistent nature of 
> S3 for deletes, however, a week later, the files still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files

2016-05-20 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294575#comment-15294575
 ] 

Aaron Fabbri commented on HIVE-13778:
-

Thanks for the details [~sailesh].  Namenode should not be involved with s3a 
paths.

Can you re-run with some s3a logging on?  i.e. org.apache.hadoop.fs.s3a=DEBUG

> DROP TABLE PURGE on S3A table with too many files does not delete the files
> ---
>
> Key: HIVE-13778
> URL: https://issues.apache.org/jira/browse/HIVE-13778
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Sailesh Mukil
>Priority: Critical
>  Labels: metastore, s3
>
> I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A 
> that has many files, the files never get deleted. However, the Hive metastore 
> logs do say that the path was deleted:
> "Not moving [path] to trash"
> "Deleted the diretory [path]"
> I initially thought that this was due to the eventually consistent nature of 
> S3 for deletes, however, a week later, the files still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)