[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2018-09-28 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631691#comment-16631691
 ] 

t oo commented on HIVE-14271:
-

is this still relevant?

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Major
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-11-14 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665533#comment-15665533
 ] 

Sahil Takiar commented on HIVE-14271:
-

[~spena] looked more into what we discussed this morning, you are correct, 
there are two places where the {{FileSinkOperator}} is renaming files. The 
first happens in the {{commit(FileSystem)}} method, the method is invoked 
inside each map task. The second happens in the {{jobCloseOp(boolean)}} method, 
the method is invoked inside HiveServer2.

I think we can break this work down into two JIRAs:

1: Eliminate the rename that occurs in HiveServer2
2: Eliminate the rename that occurs inside each map task

When running on S3, I can't think of a reason why either would be necessary. I 
think the first priority will be to eliminate the rename that occurs in 
HiveServer2 (as you said this morning).

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-11-09 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651862#comment-15651862
 ] 

Sahil Takiar commented on HIVE-14271:
-

Yes, agree with Steve. Sergio summarized it well. Sounds like this is a 
reasonable change, [~spena] can you re-open this JIRA.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-11-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651362#comment-15651362
 ] 

Sergio Peña commented on HIVE-14271:


Agree with approach #2. If outPath and finalPath are scratch directories, then 
we can just write directly to finalPath and avoid the rename. 
[~ste...@apache.org] There is another patch to do S3-to-S3 renames in parallel 
to speed up the COPY operations (See HIVE-15093)

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-11-09 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651284#comment-15651284
 ] 

Steve Loughran commented on HIVE-14271:
---

Strategy 2 will eliminate one rename, which, with rename costs being O(data) is 
good. However, there's still one rename to go.

there's still the overhead of copying the data from scratch to final. This 
shouldn't be done in the client-side code, as object store COPY operations 
happen server side; they're what rename() uses. If renames of files in a 
directory are issued in parallel, then the rename can be significantly speeded 
up; this works precisely because you can hold open the HTTP connections for the 
copy calls without much cost in network traffic.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-11-08 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649356#comment-15649356
 ] 

Sahil Takiar commented on HIVE-14271:
-

We might want to consider re-opening this ticket, but changing the original 
approach. To clarify, right now the FileSinkOperator (FSOP) will always write 
all its data to a scratch directory. The FSOP first writes to a {{outPaths}} 
and then renames the data to {{finalPaths}}, but all the data is still under 
the scratch directory. No data is exposed to users or future ETL jobs yet.

There are two different ways to modify this to improve performance on S3:

1: FSOP implements the "direct output committer" strategy (similar to 
HIVE-1620) and all data is written directly to the final table location, no 
data is written to a staging file or in the scratch directory. Hive's MoveTask 
(which runs in HiveServer2) does nothing.

2: FSOP writes data to a scratch directory, but it doesn't write to 
{{outPaths}} it writes to {{finalPaths}} instead (remember both of these 
directories are still under the scratch directory). Hive's MoveTask (which runs 
inside HiveServer2) copies the data from the scratch directory to the final 
table location. The FSOP writes directly to the final location in the scratch 
directory, no writing to a temp file is done. This improves performance since 
it avoids copying data from {{outPaths}} to {{finalPaths}}.

For reasons stated in earlier comments, there are a number of issues with 
approach 1. Implementing approach 2 should be better, and should improve 
performance significantly.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-10-23 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15600134#comment-15600134
 ] 

Sahil Takiar commented on HIVE-14271:
-

Thanks [~ste...@apache.org] that does make sense.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593138#comment-15593138
 ] 

Steve Loughran commented on HIVE-14271:
---

one funny about last-writer-wins is the scenario

# executor 1 starts working on part-001
# executor 2 gets starts working on it, also opens stream to part-001
# executor 2 finishes; their work becomes visible
# whatever was waiting for part 001 to be ready sets off
# executor 1 finishes and overwrites the existing part 001

That needs to be avoided

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-10-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593020#comment-15593020
 ] 

Sahil Takiar commented on HIVE-14271:
-

[~cnauroth], we were actually thinking of implementing a "direct output 
committer" strategy for Hive (it would be optional of course). Any chance you 
could expand some more on what the drawbacks of this approach would be?

For the issue reported in SPARK-10063, I think you should be able to add a 
config option that says the file is only closed if the Task was successful.

I know there are other concerns with things like speculative execution and task 
retries, but Hive may be able to overcome those by making sure each task 
attempt writes to the same file on S3. Since S3 follows a last-writer-wins 
approach, and each task attempt is idempotent, there should be no data issues 
(similar approach was taken in HIVE-1620).

Thoughts?

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-07-22 Thread Thomas Poepping (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389878#comment-15389878
 ] 

Thomas Poepping commented on HIVE-14271:


If we have  [HIVE-14270|https://issues.apache.org/jira/browse/HIVE-14270], then 
it seems like only the first option will be necessary, as all temporary paths 
will be on HDFS. The "rename" can be changed to a move or copy, giving us only 
one operation to S3, rather than many. This also avoids the potential downsides 
Chris describes with direct output committing.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Abdullah Yousufi
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-07-22 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389835#comment-15389835
 ] 

Chris Nauroth commented on HIVE-14271:
--

If I understand correctly, then approach b) sounds like the "direct output 
committer" strategy that has been discussed in a few other contexts.  Please be 
aware that this is unsafe in the presence of certain kinds of network 
partitions.  It might be a rare case, but the consequences are distastrous: 
data loss or corruption.  For example, Spark highly discourages a direct write 
strategy.  (See SPARK-10063.)

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Abdullah Yousufi
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-07-21 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388230#comment-15388230
 ] 

Abdullah Yousufi commented on HIVE-14271:
-

Agreed. I'll upload a patch for the second approach shortly.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Abdullah Yousufi
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14271) FileSinkOperator should not rename files to final paths when S3 is the default destination

2016-07-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388208#comment-15388208
 ] 

Steve Loughran commented on HIVE-14271:
---

Given S3 rename is emulated by a recursive copy() + delete(), it's not clear 
that a copy() operation will provide any performance benefits, and still have 
the failure conditions of a non-atomic operation.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> --
>
> Key: HIVE-14271
> URL: https://issues.apache.org/jira/browse/HIVE-14271
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergio Peña
>Assignee: Abdullah Yousufi
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)