[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-23 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15600119#comment-15600119
 ] 

Sahil Takiar commented on HIVE-14535:
-

Thanks [~gopalv] and [~sershe] that helps a ton!

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593182#comment-15593182
 ] 

Gopal V commented on HIVE-14535:


bq.  Was Hive modified to force each task attempt to write to the same file?

No, the file name choice was the product of hive bucketing. Due to the write 
once, rename twice (_tmp -> task dir, task dir -> table dir), this was not a 
problem until someone tried to write directly.

bq.  In that case what was the exact issue with checksum-safety?

The writers can't "win" till they have consumed the last byte of their shuffle, 
which is the point where one of them gets to find out they had corrupted data 
(because the checksum does not match).

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593127#comment-15593127
 ] 

Sahil Takiar commented on HIVE-14535:
-

Thanks a ton [~gopalv]! Definitely helps a lot, don't want to go down the 
direct write approach if we have already seen users hit issues with it. Could 
you expand a little more on your comment about writing to the same file name? 
Was Hive modified to force each task attempt to write to the same file? In that 
case what was the exact issue with checksum-safety? Was one of the writes 
rejected?

Also, want to understand more how it can lead to data loss? Wouldn't the query 
fail if something like this happens?

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593118#comment-15593118
 ] 

Sergey Shelukhin commented on HIVE-14535:
-

Also, as a side note, there's work under way to get rid of new APIs and tables 
that I added and reuse existing ACID infrastructure (without requirements like 
ORC/buckets/etc.) for MM tables.

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593112#comment-15593112
 ] 

Sergey Shelukhin commented on HIVE-14535:
-

Just to add to [~gopalv] 's response - the "rest" of the MM table support, 
namely the commit mechanic in metastore, is what makes it safe to write 
directly to the table without moves/copies, in the presence of task 
failures/retries/speculative execution, catastrophic query failures (when 
there's noone left to clean up), and also considering reads parallel with 
in-flight writes.
There has to be some way to tell apart the committed files from uncommitted.
My initial plan was to store file names in metastore for every file that 
MoveTask would have moved, but the ID approach is much more efficient for 
commit and DB storage requirements.

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593084#comment-15593084
 ] 

Gopal V commented on HIVE-14535:


>  Do you think it would be reasonable to commit the changes to the 
> FileSinkOperator without the rest of the MM tables support?

No, a direct output committer approach without query isolation has lost data 
for production customers before, by forcing multiple tasks to write to the same 
file-name by accident - due to the way checksum-safety works, the first writer 
is not the winner in failure-tolerance scenarios.

We want to prevent users from making such expensive mistakes again, so this 
patch isolates different queries from each other - without which you will stomp 
over files.

>  I know there are some concerns that this "direct output committer" approach 
> could cause data corruption issues, is this something was considered 
> explicitly in the design? If so, could you expand on why those data 
> corruption issues would occur?

Without the isolation fix, the other parts are dangerous to use. 

With the isolation in place, the system moves away from the move model to a 
cleanup model (the cleanup code already exists, it is just applied to the 
scratch dir today).

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-10-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593030#comment-15593030
 ] 

Sahil Takiar commented on HIVE-14535:
-

Hey [~sershe],

Very interesting feature. I think this could have some benefits for Hive-on-S3 
write performance also (ref: HIVE-14269). Particularly the changes to the 
{{FileSinkOperator}}. If I understand correctly, the changes cause the 
{{FileSinkOperator}} to directly write to the final Hive table location rather 
than to a staging directory. On Blobstores (like S3), this should significantly 
improve performance since data doesn't need to be copied from a staging 
directory to the final directory. We were thinking of implementing something 
similar in HIVE-14271. Do you think it would be reasonable to commit the 
changes to the {{FileSinkOperator}} without the rest of the MM tables support? 
I know there are some concerns that this "direct output committer" approach 
could cause data corruption issues, is this something was considered explicitly 
in the design? If so, could you expand on why those data corruption issues 
would occur?

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-08-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437494#comment-15437494
 ] 

Sergey Shelukhin commented on HIVE-14535:
-

Also it will allow flexibility wrt the actual model as e.g. metastore plumbing 
and special cases can be done separately, etc.

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)

2016-08-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437493#comment-15437493
 ] 

Sergey Shelukhin commented on HIVE-14535:
-

I think the patch will take too long for me to just do as one patch (maybe it's 
just cause I'm not very familiar with this part of the codebase).
I think I'll create a branch to be able to commit broken code and split the 
work in subtasks...

> add micromanaged tables to Hive (metastore keeps track of the files)
> 
>
> Key: HIVE-14535
> URL: https://issues.apache.org/jira/browse/HIVE-14535
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)