[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126179#comment-14126179 ] Xuefu Zhang commented on HIVE-7384: --- For information only, SPARK-2978 got resolved. We should be able to continue our work on reduce-side join. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106532#comment-14106532 ] Szehon Ho commented on HIVE-7384: - Thanks [~lianhuiwang] for the information. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107278#comment-14107278 ] Brock Noland commented on HIVE-7384: bq. I thought that TotalOrderPartition was only in order-by case, for hive.optimize.sampling.orderby=true, and not for joins? Just my reading of it, I'll take a second look and update if wrong. You are right. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107583#comment-14107583 ] Szehon Ho commented on HIVE-7384: - Did some preliminary research for part2 (parallel reducers), and put findings in HIVE-7856 where the work will happen. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105301#comment-14105301 ] Lianhui Wang commented on HIVE-7384: i think current spark already support hash by join_col,sort by {join_col,tag}. because in spark map's shuffleWriter hash by Key.hashcode and sort by Key and in Hive HiveKey class already define the hashcode. so that can support hash by HiveKey.hashcode, sort by HiveKey's bytes Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105720#comment-14105720 ] Szehon Ho commented on HIVE-7384: - Thanks for the comment, I had a similar thought initially, but then saw that sortByKey does a re-partitioning (range-partition), as it has to achieve total order. I think we need something that does sorting within a partition. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106253#comment-14106253 ] Brock Noland commented on HIVE-7384: 1) I noticed recently that latest Hive, when there are more than one reducers, does a total order sort: https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java#L374 2)Should we do some investigation into Tez auto-parallelism (HIVE-7158)? Let me know your thoughts. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106343#comment-14106343 ] Lianhui Wang commented on HIVE-7384: @Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey actually does a range-partition. we need to replace range-partition with hash partition. so spark maybe should create a new interface example: partitionSortByKey. @Brock Noland code in 1) means when sample data and more than one reducers, Hive does a total order sort. so join does not sample data, it does not need a total order sort. 2) i think we really need auto-parallelism. before i talk it with Reynold Xin, spark need to support re-partition mapoutput's data as same as tez does. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106358#comment-14106358 ] Szehon Ho commented on HIVE-7384: - 1. I thought that TotalOrderPartition was only in order-by case, for hive.optimize.sampling.orderby=true, and not for joins? Just my reading of it, I'll take a second look and update if wrong. 2. Auto-parallelism looks like a Tez feature, that will auto-calculate the number of reducers based on some input from Hive (upper/lower bound). Today in Spark we are taking numReducers from what hive query optimizer gives us, and during shuffle stage put that as the number of RDD partitions of the shuffle output (reducer input). Spark has some defaults if we dont set it explicitly (the doc says its based on the number of partitions already in the largest parent RDD) [http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism|http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism]. I dont know off top of my head of any option for Spark to give us a better number, if thats what you're asking? Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106361#comment-14106361 ] Szehon Ho commented on HIVE-7384: - Sorry Lianhui, I didnt see your reply before I typed the comment. Please do let us know about the Spark thoughts on auto-parallelism if you have any. Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106407#comment-14106407 ] Lianhui Wang commented on HIVE-7384: i think the thoughts is same as ideas that you said before. like HIVE-7158, that will auto-calculate the number of reducers based on some input from Hive (upper/lower bound). Research into reduce-side join [Spark Branch] - Key: HIVE-7384 URL: https://issues.apache.org/jira/browse/HIVE-7384 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Szehon Ho Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, sales_products.txt, sales_stores.txt Hive's join operator is very sophisticated, especially for reduce-side join. While we expect that other types of join, such as map-side join and SMB map-side join, will work out of the box with our design, there may be some complication in reduce-side join, which extensively utilizes key tag and shuffle behavior. Our design principle prefers to making Hive implementation work out of box also, which might requires new functionality from Spark. The tasks is to research into this area, identifying requirements for Spark community and the work to be done on Hive to make reduce-side join work. A design doc might be needed for this. For more information, please refer to the overall design doc on wiki. -- This message was sent by Atlassian JIRA (v6.2#6252)