subject:"\[jira\] \[Commented\] \(HIVE\-7384\) Research into reduce\-side join \[Spark Branch\]"

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-09-08 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126179#comment-14126179
 ] 

Xuefu Zhang commented on HIVE-7384:
---

For information only, SPARK-2978 got resolved. We should be able to continue 
our work on reduce-side join.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106532#comment-14106532
 ] 

Szehon Ho commented on HIVE-7384:
-

Thanks [~lianhuiwang] for the information.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-22 Thread Brock Noland (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107278#comment-14107278
]

Brock Noland commented on HIVE-7384:

bq. I thought that TotalOrderPartition was only in order-by case, for
hive.optimize.sampling.orderby=true, and not for joins? Just my reading of it,
I'll take a second look and update if wrong.

You are right.

Research into reduce-side join [Spark Branch]
-

Key: HIVE-7384
URL: https://issues.apache.org/jira/browse/HIVE-7384
Project: Hive
Issue Type: Sub-task
Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt,
sales_products.txt, sales_stores.txt

Hive's join operator is very sophisticated, especially for reduce-side join.
While we expect that other types of join, such as map-side join and SMB
map-side join, will work out of the box with our design, there may be some
complication in reduce-side join, which extensively utilizes key tag and
shuffle behavior. Our design principle prefers to making Hive implementation
work out of box also, which might requires new functionality from Spark. The
tasks is to research into this area, identifying requirements for Spark
community and the work to be done on Hive to make reduce-side join work.
A design doc might be needed for this. For more information, please refer to
the overall design doc on wiki.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107583#comment-14107583
 ] 

Szehon Ho commented on HIVE-7384:
-

Did some preliminary research for part2 (parallel reducers), and put findings 
in HIVE-7856 where the work will happen.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105301#comment-14105301
 ] 

Lianhui Wang commented on HIVE-7384:


i think current spark already support hash by join_col,sort by {join_col,tag}. 
because in spark map's shuffleWriter hash by Key.hashcode and sort by Key and 
in Hive HiveKey class already define the hashcode. so that can support hash by 
HiveKey.hashcode, sort by HiveKey's bytes

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Szehon Ho (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105720#comment-14105720
]

Szehon Ho commented on HIVE-7384:
-

Thanks for the comment, I had a similar thought initially, but then saw that
sortByKey does a re-partitioning (range-partition), as it has to achieve total
order. I think we need something that does sorting within a partition.

Research into reduce-side join [Spark Branch]
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Brock Noland (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106253#comment-14106253
 ] 

Brock Noland commented on HIVE-7384:


1) I noticed recently that latest Hive, when there are more than one reducers, 
does a total order sort:
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java#L374

2)Should we do some investigation into Tez auto-parallelism (HIVE-7158)? Let me 
know your thoughts.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106343#comment-14106343
]

Lianhui Wang commented on HIVE-7384:

@Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey
actually does a range-partition. we need to replace range-partition with hash
partition. so spark maybe should create a new interface example:
partitionSortByKey.
@Brock Noland code in 1) means when sample data and more than one reducers,
Hive does a total order sort. so join does not sample data, it does not need a
total order sort.
2) i think we really need auto-parallelism. before i talk it with Reynold Xin,
spark need to support re-partition mapoutput's data as same as tez does.

Research into reduce-side join [Spark Branch]
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Szehon Ho (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106358#comment-14106358
]

Szehon Ho commented on HIVE-7384:
-

1. I thought that TotalOrderPartition was only in order-by case, for
hive.optimize.sampling.orderby=true, and not for joins? Just my reading of it,
I'll take a second look and update if wrong.

2. Auto-parallelism looks like a Tez feature, that will auto-calculate the
number of reducers based on some input from Hive (upper/lower bound).

Today in Spark we are taking numReducers from what hive query optimizer gives
us, and during shuffle stage put that as the number of RDD partitions of the
shuffle output (reducer input). Spark has some defaults if we dont set it
explicitly (the doc says its based on the number of partitions already in the
largest parent RDD)
[http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism|http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism].
I dont know off top of my head of any option for Spark to give us a better
number, if thats what you're asking?

Research into reduce-side join [Spark Branch]
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106361#comment-14106361
 ] 

Szehon Ho commented on HIVE-7384:
-

Sorry Lianhui, I didnt see your reply before I typed the comment.  Please do 
let us know about the Spark thoughts on auto-parallelism if you have any.

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106407#comment-14106407
 ] 

Lianhui Wang commented on HIVE-7384:


i think the thoughts is same as ideas that you said before. like HIVE-7158, 
that will auto-calculate the number of reducers based on some input from Hive 
(upper/lower bound).

 Research into reduce-side join [Spark Branch]
 -

 Key: HIVE-7384
 URL: https://issues.apache.org/jira/browse/HIVE-7384
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Szehon Ho
 Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
 sales_products.txt, sales_stores.txt


 Hive's join operator is very sophisticated, especially for reduce-side join. 
 While we expect that other types of join, such as map-side join and SMB 
 map-side join, will work out of the box with our design, there may be some 
 complication in reduce-side join, which extensively utilizes key tag and 
 shuffle behavior. Our design principle prefers to making Hive implementation 
 work out of box also, which might requires new functionality from Spark. The 
 tasks is to research into this area, identifying requirements for Spark 
 community and the work to be done on Hive to make reduce-side join work.
 A design doc might be needed for this. For more information, please refer to 
 the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

11 matches

Site Navigation

Mail list logo

Footer information