[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2019-03-20 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797769#comment-16797769
 ] 

Eugene Koifman commented on HIVE-13479:
---

There is no sorting restriction on insert-only ACID tables.
Delete event filtering (HIVE-20738) for full-crud tables relies on the fact 
that data is ordered by ROW__ID.
I don't think there is anything that precludes INSERT INTO T  SORT BY ...  
for full-crud table
That should be enough to make min/max in ORC useful for predicate push-down in 
a lot of cases.

IOW is supported and I think could be used to re-sort the table by any column 
(and will generate new row_id) but it's currently an operation with X lock.  
With some work, IOW could run with less strict lock, that allows reads but not 
any other writes.  Compaction that does overwrite would have the same issue 
which is likely too restrictive.  
IOW (directly from user or compactor) is also problematic since it will 
invalidate all result set caches and materialized views.

Incidentally, {{hive.optimize.sort.dynamic.partition=true}} was fixed on ACID 
tables long ago.








> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2019-03-19 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795933#comment-16795933
 ] 

Abhishek Somani commented on HIVE-13479:


What I meant is now that ACID v2 has been implemented, do we plan to work on 
relaxing the sorting requirement? As far as I know, we still enforce that the 
rows be sorted on the acid columns(row id), and this is done so that the reader 
can sort-merge the delete events with the insert events while reading. Isn't 
that right?

If so, it seems the only way to have data sorted on another column specified by 
the user seems to be to initially insert the data with ordering on that column, 
so that the data is sorted BOTH on the acid columns as well as user specified 
column.

If however we were able to relax the requirement that data HAS to be sorted on 
the acid columns, we could utilize something like compaction to sort the data 
on user desired columns in the background. Theoretically one could do such 
sorting in compaction even today, but if the sorting requirement is not 
relaxed, we will need to sort both on row ids and user-column, for which one 
would need the compaction to behave as an insert overwrite and generate new row 
ids so that the data is sorted on both the (new)row id columns as well as the 
user specified column, which would be good to avoid.

Have I understood this correct?

> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2019-03-18 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795741#comment-16795741
 ] 

Gopal V commented on HIVE-13479:


[~asomani]: This ticket describes ACIDv2

> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2019-03-18 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795678#comment-16795678
 ] 

Abhishek Somani commented on HIVE-13479:


[~ekoifman] [~vgumashta] [~gopalv]

Do we have any plans to work on this?

> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2016-04-11 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235674#comment-15235674
 ] 

Owen O'Malley commented on HIVE-13479:
--

The hive.optimize.sort.dynamic.partition=true has other issues, but ACID 
shouldn't need to disable it. Since it only sorts on the partition columns, 
which does not interfere with the ACID sort that is only required inside each 
bucket file.

The bigger requirement is that we need to support sorting on user defined 
primary keys instead of our internal row ids. That will enable implementation 
of the upsert/merge commands. That does NOT require moving to a split insert + 
delete for modifications. There are other advantages to it (like enabling 
predicate push down on the deltas), but they don't help very much for the case 
of sorted primary keys.

> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)