[jira] [Commented] (HIVE-21524) Impala Engine

2019-03-27 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803073#comment-16803073
 ] 

Xuefu Zhang commented on HIVE-21524:


This sounds interesting. However, I'm trying to figure out what a user gains 
from this if query is simply routed to Impala coordinator. As an alternative, a 
user can just connect the client (ex. beeline) to Impala coordinator directly.

> Impala Engine
> -
>
> Key: HIVE-21524
> URL: https://issues.apache.org/jira/browse/HIVE-21524
> Project: Hive
>  Issue Type: New Feature
>Affects Versions: 4.0.0
>Reporter: David Mollitor
>Priority: Major
>
> Now that Impala has "dedicated coordinator" capability, it could be 
> interesting to pair HiveServer2 instances with Impala dedicated coordinators 
> on the same localhost.  A client could request an 'impala' execution engine 
> and subsequent queries would be routed to the local coordinator.
> {code:sql}
> set hive.execution.engine=impala;
> {code}
> This would allow clients seamless access to both capabilities without needing 
> different connections or drivers, Hive would also be a central location for 
> auditing and authorization.
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21035) Race condition in SparkUtilities#getSparkSession

2018-12-12 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719356#comment-16719356
 ] 

Xuefu Zhang commented on HIVE-21035:


[~asinkovits] Thanks for working on this. Maybe I have missed something, but 
I'm wondering how multiple app masters can be created in one session. My 
understanding is that at most one master is created for one session while 
multiple queries can be submitted to the app master.

> Race condition in SparkUtilities#getSparkSession
> 
>
> Key: HIVE-21035
> URL: https://issues.apache.org/jira/browse/HIVE-21035
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 4.0.0
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-21035.01.patch
>
>
> It can happen, that when in one given session, multiple queries are executed, 
> that due to a race condition, multiple spark application master gets kicked 
> off.
> In this case, the one that started earlier, will not be killed, when the hive 
> session closes, consuming resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20721) Describe table sometimes shows "from deserializer" for column comments

2018-10-10 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645899#comment-16645899
 ] 

Xuefu Zhang commented on HIVE-20721:


Hi [~Ballesteros], have you tried "desc table" for your table as that's what 
HIVE-6681 was about?

> Describe table sometimes shows "from deserializer" for column comments
> --
>
> Key: HIVE-20721
> URL: https://issues.apache.org/jira/browse/HIVE-20721
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 1.1.0
>Reporter: Pedro
>Priority: Major
>
> When we want to see the comments made in the hive tables, it only shows "from 
> deserializer"
>  
> For example
> > show create table table_name
> CREATE EXTERNAL TABLE `table_name`(
>  `some_column1` bigint COMMENT 'from deserializer', 
>  `some_column2` string COMMENT 'from deserializer', 
>  `some_column3` string COMMENT 'from deserializer',
> [...]
> PARTITIONED BY ( 
>  `dt` string)
> ROW FORMAT SERDE 
>  'org.apache.hive.hcatalog.data.JsonSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>  'hdfs://location/table_name'
> TBLPROPERTIES (
>  'json.schema.url'='/location/json/table_name.json', 
>  'transient_lastDdlTime'='1525858710')
>  
> I saw it was resolved in HIVE-6681 but the fixed version was 0.13.0. I'm on 
> 1.1.0 and apparently I have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20276) Hive UDF class getting Instantiated for each call of function

2018-07-31 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563838#comment-16563838
 ] 

Xuefu Zhang commented on HIVE-20276:


[~hardik1808] Could you give some code examples of the usage of your UDF in 
your code? In Hive, one usually registers a UDF and uses it in the query. Thus, 
I'm not sure what you meant by calling the UDF in your code or spark session 
object.

> Hive UDF class getting Instantiated for each call of function
> -
>
> Key: HIVE-20276
> URL: https://issues.apache.org/jira/browse/HIVE-20276
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.1
>Reporter: Hardik Trivedi
>Priority: Blocker
>
> * I have created One Hive UDF class and register its function in spark.
>  * In hive query inside spark session object  i call this function
>  * Now when i run my code i observe on each time when function called it 
> create new instance of UDF class.
>  * Is it normal behavior? On each call should it create new instance?
>  * Is it version specific issue? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20261) Expose inputPartitionList in QueryPlan

2018-07-30 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562622#comment-16562622
 ] 

Xuefu Zhang commented on HIVE-20261:


+1 pending on test.

[~zshao] thanks for working on this. As a fyi, you might need to submit the 
patch in order to trigger the auto test.

> Expose inputPartitionList in QueryPlan
> --
>
> Key: HIVE-20261
> URL: https://issues.apache.org/jira/browse/HIVE-20261
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Planning
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
> Attachments: HIVE-20261.1.patch, HIVE-20261.2.patch
>
>
> Having access to the list of input partitions for all historical Hive queries 
> in a system provides a great opportunity to insights on data access frequency 
> and potential storage tiering.
> This task aims to expose that via QueryPlan so that a Hive Hook can pick it 
> up and store the information for analysis later.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20141) Turn hive.spark.use.groupby.shuffle off by default

2018-07-11 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540335#comment-16540335
 ] 

Xuefu Zhang commented on HIVE-20141:


[~stakiar] Based on our benchmarking at Uber, groupByKey does offer better 
performance in certain cases, specifically, in aggregation without ordering. 
The difference is about 10%. I understand the limitation with group-by, which 
is why this configuration exists. I don't feel it's compelling enough to change 
the default behavior from either the perf or b/c point of view. The 
configuration has existed for a few releases already, and most of the users 
doesn't have to bother with it anyway.

The best approach is to enhance groupbykey or provide a new shuffle mode that 
overcomes the memory limitation while maintaining the benefit of not enforcing 
ordering in keys. I saw you created an JIRA for that, looking forward to 
progress on that.

> Turn hive.spark.use.groupby.shuffle off by default
> --
>
> Key: HIVE-20141
> URL: https://issues.apache.org/jira/browse/HIVE-20141
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
>
> [~xuefuz] any thoughts on this? I think it would provide better out of the 
> box behavior for Hive-on-Spark users, especially for users who are migrating 
> from Hive-on-MR to HoS. Wondering what your experience with this config has 
> been?
> I've done a bunch of performance profiling with this config turned on vs. 
> off, and for TPC-DS queries it doesn't make a significant difference. The 
> main difference I can see is that when a Spark stage has to spill to disk, 
> {{repartitionAndSortWithinPartitions}} spills more data to disk than 
> {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores 
> everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single 
> copy of the key for potentially multiple values) whereas 
> {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which 
> sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results 
> in more data being spilled to disk).
> My understanding is that using {{repartitionAndSortWithinPartitions}} for 
> Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config 
> would provide a similar experience to HoMR. Furthermore, last I checked, 
> {{groupByKey}} still can't spill within a row group.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20007) Hive should carry out timestamp computations in UTC

2018-06-27 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525488#comment-16525488
 ] 

Xuefu Zhang commented on HIVE-20007:


cc: [~lirui]

> Hive should carry out timestamp computations in UTC
> ---
>
> Key: HIVE-20007
> URL: https://issues.apache.org/jira/browse/HIVE-20007
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Ryan Blue
>Assignee: Jesus Camacho Rodriguez
>Priority: Blocker
>  Labels: timestamp
> Attachments: HIVE-20007.patch
>
>
> Hive currently uses the "local" time of a java.sql.Timestamp to represent the 
> SQL data type TIMESTAMP WITHOUT TIME ZONE. The purpose is to be able to use 
> {{Timestamp#getYear()}} and similar methods to implement SQL functions like 
> {{year}}.
> When the SQL session's time zone is a DST zone, such as America/Los_Angeles 
> that alternates between PST and PDT, there are times that cannot be 
> represented because the effective zone skips them.
> {code}
> hive> select TIMESTAMP '2015-03-08 02:10:00.101';
> 2015-03-08 03:10:00.101
> {code}
> Using UTC instead of the SQL session time zone as the underlying zone for a 
> java.sql.Timestamp avoids this bug, while still returning correct values for 
> {{getYear}} etc. Using UTC as the convenience representation (timestamp 
> without time zone has no real zone) would make timestamp calculations more 
> consistent and avoid similar problems in the future.
> Notably, this would break the {{unix_timestamp}} UDF that specifies the 
> result is with respect to ["the default timezone and default 
> locale"|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions].
>  That function would need to be updated to use the 
> {{System.getProperty("user.timezone")}} zone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-26 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523928#comment-16523928
 ] 

Xuefu Zhang commented on HIVE-19671:


Yeah. I think it makes sense. Thank.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-21 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519830#comment-16519830
 ] 

Xuefu Zhang commented on HIVE-19671:


Printing a warning is good, but we may not know if a partitioning is 
non-deterministic. Let me know your idea. Thanks.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518745#comment-16518745
 ] 

Xuefu Zhang commented on HIVE-19671:


Based on your analysis, it seems that random(see) depends on a deterministic 
order of the data. Reading HDFS may guarantee the order, but probably not every 
data source has such guaranty. Also, a map or reduce logic may also generate 
nondeterministic order. Having said that, it appears to me that any 
partitioning that depends on a deterministic ordering of the data is doomed, 
include rand() and rand(seed). This is rather a user problem, for which I'm not 
sure if Hive needs to do anything. We may document this as a general bad 
practice, but blocking solves rand() problemI  but doesn't help other similar 
problems. I suggest we leave it to user to solve the problem. Thoughts?

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19937) Intern JobConf objects in Spark tasks

2018-06-19 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517674#comment-16517674
 ] 

Xuefu Zhang commented on HIVE-19937:


+1

> Intern JobConf objects in Spark tasks
> -
>
> Key: HIVE-19937
> URL: https://issues.apache.org/jira/browse/HIVE-19937
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-19937.1.patch
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the 
> {{JobConf}} object to prevent any {{ConcurrentModificationException}} from 
> being thrown. However, setting this variable comes at a cost of storing a 
> duplicate {{JobConf}} object for each Spark task. These objects can take up a 
> significant amount of memory, we should intern them so that Spark tasks 
> running in the same JVM don't store duplicate copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-05-29 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493796#comment-16493796
 ] 

Xuefu Zhang commented on HIVE-19671:


[~lirui] I think #1 is better. Nondeterministic partitioning including using 
random would be a problem in many aspects. #2 is a little harsh, as those are 
usually service level attributes. Thanks.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19523) Decimal truncation for trailing zeros in Hive 1.2.1

2018-05-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483051#comment-16483051
 ] 

Xuefu Zhang commented on HIVE-19523:


{{HiveDecimal}} doesn't carry precision and scale info even though it's backed 
by a {{BigDecimal}} object. Internally, Hive normalizes the {{BigDecimal}} 
input when storing it as a {{HiveDecimal}} instance. If desired, you can call 
{{HiveDecimal.enforcePrecisionScale()}} to generate an instance that has 
certain precision/scale.

> Decimal truncation for trailing zeros in Hive 1.2.1
> ---
>
> Key: HIVE-19523
> URL: https://issues.apache.org/jira/browse/HIVE-19523
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1, 2.3.1
>Reporter: vinay kant garg
>Priority: Critical
>
> This issue is related to the unnecessary truncation of zero's while 
> serializing BigDecimal object to HiveDecimal object.
> BigDecimal object has info about scale and precision still serializing ended 
> up modifying scale  and precision metadata as well truncation of zero's.
> Eg: if our BigDecimal val is : 47.6300, scale = 4 and precision = 6. If I am 
> serializing this to HiveDecimal by using API hive expose :
> static HiveDecimal create(BigDecimal b)
> static HiveDecimal create(BigDecimal b, boolean allowRounding)
> our output will be : val = 47.63, scale = 2 and precision = 4.
> or if IP: val = 47.00, scale = 2 and precision = 4 then
>        OP: val = 47, scale = 0 and precision = 2
> *In above example if we see there is no DATA CORRUPTION because 47.6300 is 
> equivalent to 47.63 or 47.00 is equivalent to 47 but later on if this data is 
> used to identify whether it is integer or decimal then we may run into weird 
> issues because 47.00 is decimal but 47 is an integer*.
> I am able to reproduce this issue even with standalone program:
> *import java.math.BigDecimal;*
> *import org.apache.hadoop.hive.common.type.HiveDecimal;*
> *public class testHivedecimal {*
> *public static void main(String[] argv){*
>  *BigDecimal bd = new BigDecimal ("47.6300");*
>  *bd.setScale(4);*
>  *System.out.println("bigdecimal object created and value is : " + 
> bd.toString());*
>  *System.out.println("precision : "+ bd.precision());*
>  *System.out.println("scale : "+ bd.scale());*
>  *HiveDecimal hv = HiveDecimal.create(bd);*
>  *String str = hv.toString();*
>  *System.out.println("value after serialization of bigdecimal to hivedecimal 
> : " + str);*
>  *System.out.println("precision after hivedecimal : "+ hv.precision());*
>  *System.out.println("scale after hivedecimal : "+ hv.scale());*
>  *}*
> *}*
> you can use above program to reproduce issue:
> To Compile:  
> 1) use jdk8_u72
> 2) command : javac -cp hive-common-1.2.1000.2.6.0.3-8.jar:. 
> testHivedecimal.java 
> To Run:
> 1) use jdk8_u72
> 2) command : java -cp hive-common-1.2.1000.2.6.0.3-8.jar:. testHivedecimal
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18525) Add explain plan to Hive on Spark Web UI

2018-03-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414268#comment-16414268
 ] 

Xuefu Zhang commented on HIVE-18525:


{quote}
I haven't found a way to do that in the Spark Web UI yet. This might be 
possible if we implement HIVE-18515, but that would require quite a bit of work.
{quote}
Maybe we can put the overall explain plan at the job description, similar to 
what you do here for stages.

> Add explain plan to Hive on Spark Web UI
> 
>
> Key: HIVE-18525
> URL: https://issues.apache.org/jira/browse/HIVE-18525
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18525.1.patch, HIVE-18525.2.patch, 
> HIVE-18525.3.patch, Job-Page-Collapsed.png, Job-Page-Expanded.png, 
> Map-Explain-Plan.png, Reduce-Explain-Plan.png
>
>
> More of an investigation JIRA. The Spark UI has a "long description" of each 
> stage in the Spark DAG. Typically one stage in the Spark DAG corresponds to 
> either a {{MapWork}} or {{ReduceWork}} object. It would be useful if the long 
> description contained the explain plan of the corresponding work object.
> I'm not sure how much additional overhead this would introduce. If not the 
> full explain plan, then maybe a modified one that just lists out all the 
> operator tree along with each operator name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18525) Add explain plan to Hive on Spark Web UI

2018-03-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408965#comment-16408965
 ] 

Xuefu Zhang commented on HIVE-18525:


The feature looks good and useful to me. I don't think generating the explain 
plan costs too much. However, you might to put some perf logs so it can be 
measured. For this, I don't think we need to make it configurable unless later 
perf log shows otherwise.

As a related question, do we show the plan at the job level? That is, show the 
whole query plan for a spark job. That could be useful too.

> Add explain plan to Hive on Spark Web UI
> 
>
> Key: HIVE-18525
> URL: https://issues.apache.org/jira/browse/HIVE-18525
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18525.1.patch, HIVE-18525.2.patch, 
> Job-Page-Collapsed.png, Job-Page-Expanded.png, Map-Explain-Plan.png, 
> Reduce-Explain-Plan.png
>
>
> More of an investigation JIRA. The Spark UI has a "long description" of each 
> stage in the Spark DAG. Typically one stage in the Spark DAG corresponds to 
> either a {{MapWork}} or {{ReduceWork}} object. It would be useful if the long 
> description contained the explain plan of the corresponding work object.
> I'm not sure how much additional overhead this would introduce. If not the 
> full explain plan, then maybe a modified one that just lists out all the 
> operator tree along with each operator name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18765) SparkClientImpl swallows exception messages from the RemoteDriver

2018-02-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372363#comment-16372363
 ] 

Xuefu Zhang commented on HIVE-18765:


+1 pending on test.

> SparkClientImpl swallows exception messages from the RemoteDriver
> -
>
> Key: HIVE-18765
> URL: https://issues.apache.org/jira/browse/HIVE-18765
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18765.1.patch
>
>
> {{SparkClientImpl#handle(ChannelHandlerContext, Error)}} swallows the cause 
> of the error message:
> {code}
> LOG.warn("Error reported from remote driver.", msg.cause);
> {code}
> There should be a '{}' in the message. Without it the {{msg.cause}} info gets 
> swallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18713) Optimize: Transform IN clauses to = when there's only one element

2018-02-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364495#comment-16364495
 ] 

Xuefu Zhang commented on HIVE-18713:


Okay. Make sense. Thanks for the explanation.

> Optimize: Transform IN clauses to = when there's only one element
> -
>
> Key: HIVE-18713
> URL: https://issues.apache.org/jira/browse/HIVE-18713
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Attachments: HIVE-18713.1.patch
>
>
> (col1) IN (col2) can be transformed to (col1) = (col2), to avoid the hash-set 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18713) Optimize: Transform IN clauses to = when there's only one element

2018-02-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364332#comment-16364332
 ] 

Xuefu Zhang commented on HIVE-18713:


I'm wondering if it's a good idea to extend this such that if the number of 
elements is less than a threshold (say, 3), we convert it to ( (col1) = (a) OR 
(col1) = (b) OR (col1) = (c) ).

> Optimize: Transform IN clauses to = when there's only one element
> -
>
> Key: HIVE-18713
> URL: https://issues.apache.org/jira/browse/HIVE-18713
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Attachments: HIVE-18713.1.patch
>
>
> (col1) IN (col2) can be transformed to (col1) = (col2), to avoid the hash-set 
> implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362690#comment-16362690
 ] 

Xuefu Zhang commented on HIVE-18442:


+1 to patch #1.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361407#comment-16361407
 ] 

Xuefu Zhang commented on HIVE-18442:


Okay. It's fine to use {{fs.nullscan.impl}} to solve the problem. However, I 
don't quite follow how it solves the classpath issue if hive-exec.jar isn't 
loadede. Can you shed some light on this? 

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359134#comment-16359134
 ] 

Xuefu Zhang commented on HIVE-18442:


[~lirui] Since it's a hack any way, is it okay just to use 
{{spark.yarn.user.classpath.first}} configuration? I know it's not ideal.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18513) Query results caching

2018-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350654#comment-16350654
 ] 

Xuefu Zhang commented on HIVE-18513:


That's good. As a minor comment, if we were only doing time-based invalidation, 
there wouldn't be much a different between internal table and external table.

> Query results caching
> -
>
> Key: HIVE-18513
> URL: https://issues.apache.org/jira/browse/HIVE-18513
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
> Attachments: HIVE-18513.1.patch, HIVE-18513.2.patch, 
> HIVE-18513.3.patch, HIVE-18513.4.patch, HIVE-18513.5.patch
>
>
> Add a query results cache that can save the results of an executed Hive query 
> for reuse on subsequent queries. This may be useful in cases where the same 
> query is issued many times, since Hive can return back the results of a 
> cached query rather than having to execute the full query on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350627#comment-16350627
 ] 

Xuefu Zhang commented on HIVE-18442:


Regarding loading the jar twice, since we are loading it when JMV starts, can 
we get rid of the other loading?

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-18513) Query results caching

2018-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349716#comment-16349716
 ] 

Xuefu Zhang edited comment on HIVE-18513 at 2/2/18 3:16 AM:


Thanks, [~jcamachorodriguez] and [~jdere]. I went thru the doc and is still 
unclear what mechanism was decided to invalidate cache. Is this doc still work 
in progress?


was (Author: xuefuz):
Thanks, [~jcamachorodriguez] and [~jdere]. I went thru the doc and is clear 
what mechanism was decided to invalidate cache. Is this doc still work in 
progress?

> Query results caching
> -
>
> Key: HIVE-18513
> URL: https://issues.apache.org/jira/browse/HIVE-18513
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
> Attachments: HIVE-18513.1.patch, HIVE-18513.2.patch, 
> HIVE-18513.3.patch, HIVE-18513.4.patch, HIVE-18513.5.patch
>
>
> Add a query results cache that can save the results of an executed Hive query 
> for reuse on subsequent queries. This may be useful in cases where the same 
> query is issued many times, since Hive can return back the results of a 
> cached query rather than having to execute the full query on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18513) Query results caching

2018-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349716#comment-16349716
 ] 

Xuefu Zhang commented on HIVE-18513:


Thanks, [~jcamachorodriguez] and [~jdere]. I went thru the doc and is clear 
what mechanism was decided to invalidate cache. Is this doc still work in 
progress?

> Query results caching
> -
>
> Key: HIVE-18513
> URL: https://issues.apache.org/jira/browse/HIVE-18513
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
> Attachments: HIVE-18513.1.patch, HIVE-18513.2.patch, 
> HIVE-18513.3.patch, HIVE-18513.4.patch, HIVE-18513.5.patch
>
>
> Add a query results cache that can save the results of an executed Hive query 
> for reuse on subsequent queries. This may be useful in cases where the same 
> query is issued many times, since Hive can return back the results of a 
> cached query rather than having to execute the full query on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18513) Query results caching

2018-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349703#comment-16349703
 ] 

Xuefu Zhang commented on HIVE-18513:


Could we have the high-level doc linked here so others interested can do a 
high-level review at least? Thanks.

> Query results caching
> -
>
> Key: HIVE-18513
> URL: https://issues.apache.org/jira/browse/HIVE-18513
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
> Attachments: HIVE-18513.1.patch, HIVE-18513.2.patch, 
> HIVE-18513.3.patch, HIVE-18513.4.patch, HIVE-18513.5.patch
>
>
> Add a query results cache that can save the results of an executed Hive query 
> for reuse on subsequent queries. This may be useful in cases where the same 
> query is issued many times, since Hive can return back the results of a 
> cached query rather than having to execute the full query on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-31 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347253#comment-16347253
 ] 

Xuefu Zhang commented on HIVE-18301:


It seems that IOContext is used in many places and the logics complicated. 
Instead of putting the input patch in each row, like what the patch is 
proposing, could we send a serialized IOContext object as a special row 
whenever the content of the object changes? I'm not sure how feasible it's, so 
it's just a rough idea to be explored.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-31 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347238#comment-16347238
 ] 

Xuefu Zhang commented on HIVE-18442:


Hi [~lirui], Thanks for the explanation. The patch looks fine. I'm wondering if 
there could be other similar issues. Thus, would either of the two options you 
mentioned works better?
{quote}
Unless it's added to the driver's extra class path or we enable 
{{spark.yarn.user.classpath.first}}.
{quote}

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18513) Query results caching

2018-01-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341418#comment-16341418
 ] 

Xuefu Zhang commented on HIVE-18513:


[~jdere] Thanks for working on this useful feature! The patch seems big, and 
for such a feature, a high-level doc might be helpful to those who are 
interested in knowing more it w/o reading the code changes. For example, 
besides the question of determining if two queries are the same asked above, 
I'd also like to know if the cache is distributed (shared by multiple HS2s). 
One might also ask about the eviction policy, etc. A high-level doc would be 
greatly helpful to meet the curiosity. Thanks.

> Query results caching
> -
>
> Key: HIVE-18513
> URL: https://issues.apache.org/jira/browse/HIVE-18513
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
> Attachments: HIVE-18513.1.patch, HIVE-18513.2.patch, 
> HIVE-18513.3.patch
>
>
> Add a query results cache that can save the results of an executed Hive query 
> for reuse on subsequent queries. This may be useful in cases where the same 
> query is issued many times, since Hive can return back the results of a 
> cached query rather than having to execute the full query on the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338668#comment-16338668
 ] 

Xuefu Zhang commented on HIVE-18368:


[~stakiar] It may not be meaningful to all users, but could be additional info 
for Hive developers for diagnosis. I feel it's probably better than just 
repeating the same info. However, I don't have a strong opinion about this.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338395#comment-16338395
 ] 

Xuefu Zhang commented on HIVE-18368:


Thanks, [~stakiar]. 

As to the duplication, is it possible that we name the call site differently so 
it is less confusing, such as "In ReduceTran"

The code looks fine, though I didn't get too much into the details. I will let 
[~lirui] share his comments.

One thing unclear to me is the reason we changed the test case.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17257) Hive should merge empty files

2018-01-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330994#comment-16330994
 ] 

Xuefu Zhang commented on HIVE-17257:


Thanks for the update, [~csun]. I also verified with the patch and it fixed the 
problem for both MR and Spark. Will commit the patch shortly.

> Hive should merge empty files
> -
>
> Key: HIVE-17257
> URL: https://issues.apache.org/jira/browse/HIVE-17257
> Project: Hive
>  Issue Type: Bug
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Attachments: HIVE-17257.0.patch, HIVE-17257.1.patch, 
> HIVE-17257.2.patch, HIVE-17257.3.patch
>
>
> Currently if merging file option is turned on and the dest dir contains large 
> number of empty files, Hive will not trigger merge task:
> {code}
>   private long getMergeSize(FileSystem inpFs, Path dirPath, long avgSize) {
> AverageSize averageSize = getAverageSize(inpFs, dirPath);
> if (averageSize.getTotalSize() <= 0) {
>   return -1;
> }
> if (averageSize.getNumFiles() <= 1) {
>   return -1;
> }
> if (averageSize.getTotalSize()/averageSize.getNumFiles() < avgSize) {
>   return averageSize.getTotalSize();
> }
> return -1;
>   }
> {code}
> This logic doesn't seem right as the it seems better to combine these empty 
> files into one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-14162) Allow disabling of long running job on Hive On Spark On YARN

2018-01-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329692#comment-16329692
 ] 

Xuefu Zhang commented on HIVE-14162:


Thanks, [~belugabehr]. I liked your thoughts and agreed that live drivers might 
be a concern for long idle sessions. Let's wait to get more inputs to see if it 
makes sense to add a knob on this.

> Allow disabling of long running job on Hive On Spark On YARN
> 
>
> Key: HIVE-14162
> URL: https://issues.apache.org/jira/browse/HIVE-14162
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Reporter: Thomas Scott
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HIVE-14162.1.patch
>
>
> Hive On Spark launches a long running process on the first query to handle 
> all queries for that user session. In some use cases this is not desired, for 
> instance when using Hue with large intervals between query executions.
> Could we have a property that would cause long running spark jobs to be 
> terminated after each query execution and started again for the next one?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17257) Hive should merge empty files

2018-01-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329018#comment-16329018
 ] 

Xuefu Zhang commented on HIVE-17257:


+1 for the patch. However, I'm not sure if those test failures are related.

> Hive should merge empty files
> -
>
> Key: HIVE-17257
> URL: https://issues.apache.org/jira/browse/HIVE-17257
> Project: Hive
>  Issue Type: Bug
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Attachments: HIVE-17257.0.patch, HIVE-17257.1.patch, 
> HIVE-17257.2.patch, HIVE-17257.3.patch
>
>
> Currently if merging file option is turned on and the dest dir contains large 
> number of empty files, Hive will not trigger merge task:
> {code}
>   private long getMergeSize(FileSystem inpFs, Path dirPath, long avgSize) {
> AverageSize averageSize = getAverageSize(inpFs, dirPath);
> if (averageSize.getTotalSize() <= 0) {
>   return -1;
> }
> if (averageSize.getNumFiles() <= 1) {
>   return -1;
> }
> if (averageSize.getTotalSize()/averageSize.getNumFiles() < avgSize) {
>   return averageSize.getTotalSize();
> }
> return -1;
>   }
> {code}
> This logic doesn't seem right as the it seems better to combine these empty 
> files into one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329014#comment-16329014
 ] 

Xuefu Zhang commented on HIVE-18442:


Hi [~lirui], thanks for working on this. Just curious, the stacktrace shows 
that ql package is loaded and this package is in hive-exec.jar, right. If 
nullscan class in in hive-exec.jar, how come it's not found?

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-14162) Allow disabling of long running job on Hive On Spark On YARN

2018-01-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324408#comment-16324408
 ] 

Xuefu Zhang commented on HIVE-14162:


The size of the driver is configurable. Minimum number of executors can be 0. 
Would this be a problem for you? 

> Allow disabling of long running job on Hive On Spark On YARN
> 
>
> Key: HIVE-14162
> URL: https://issues.apache.org/jira/browse/HIVE-14162
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Reporter: Thomas Scott
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HIVE-14162.1.patch
>
>
> Hive On Spark launches a long running process on the first query to handle 
> all queries for that user session. In some use cases this is not desired, for 
> instance when using Hue with large intervals between query executions.
> Could we have a property that would cause long running spark jobs to be 
> terminated after each query execution and started again for the next one?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-14162) Allow disabling of long running job on Hive On Spark On YARN

2018-01-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324349#comment-16324349
 ] 

Xuefu Zhang commented on HIVE-14162:


[~belugabehr], Spark on Yarn is powered by a feature call dynamic allocation, 
which is strongly recommended in a multi-tenancy or cost sensitive env. This is 
something that you might have missed. With it, unused executors are returned to 
the cluster so others can use.

SparkContext is stateful, and can be considered as an extension of HIve 
session. If you allow a SparkSession to time out, then part of the state is 
gone. In that case, you might just as well let the whole session expire.

What you requested isn't completely off, but I'd like to see if existing 
features are enough to reach what you want achieve.

> Allow disabling of long running job on Hive On Spark On YARN
> 
>
> Key: HIVE-14162
> URL: https://issues.apache.org/jira/browse/HIVE-14162
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Reporter: Thomas Scott
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HIVE-14162.1.patch
>
>
> Hive On Spark launches a long running process on the first query to handle 
> all queries for that user session. In some use cases this is not desired, for 
> instance when using Hue with large intervals between query executions.
> Could we have a property that would cause long running spark jobs to be 
> terminated after each query execution and started again for the next one?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18434) Type is not determined correctly for comparison between decimal column and string constant

2018-01-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323174#comment-16323174
 ] 

Xuefu Zhang commented on HIVE-18434:


Thanks for the explanation. While your patch may fix the particular problem, 
I'm afraid it might introduce inconsistency in treating string literals for 
numbers. My understanding is that Hive treats and implicitly converts number 
string literals (such as "3.1415926" as double rather than decimal. When a 
decimal (as precise data type) and a double (as imprecise data type) appear in 
the same numeric or logical operations, the result data type is double. For 
instance:
{code}
hive> explain select 3.14BD + "3.14";
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: _dummy_table
  Row Limit Per Split: 1
  Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column 
stats: COMPLETE
  Select Operator
expressions: 6.28 (type: double)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: COMPLETE
ListSink

hive> desc dec1;
OK
namestring  
value   decimal(5,2)

hive> explain select * from dec1 where value="3.14";
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: dec1
  filterExpr: (value = '3.14') (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Filter Operator
predicate: (value = '3.14') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Select Operator
  expressions: name (type: string), value (type: decimal(5,2))
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  ListSink
{code}
I believe that {{ filterExpr: (value = '3.14') (type: boolean)}} will convert 
both side to double and than compare internally.

The problem you described seems to be one related to constant optimization. 
Specifically, the select op should project {{a}} instead of the following:
{code}
  Select Operator
expressions: -1511503446182.5518 (type: decimal(19,6))
{code}
The numeric comparison should be still based on double. We cannot just use 
what's is used in the filter condition to rewrite the projection columns.

> Type is not determined correctly for comparison between decimal column and 
> string constant
> --
>
> Key: HIVE-18434
> URL: https://issues.apache.org/jira/browse/HIVE-18434
> Project: Hive
>  Issue Type: Bug
>  Components: Types
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-18434.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18434) Type is not determined correctly for comparison between decimal column and string constant

2018-01-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322251#comment-16322251
 ] 

Xuefu Zhang commented on HIVE-18434:


Hi [~ashutoshc], thanks for working on this. Do you mind providing details on 
what's the behavior w/o your patch here? Maybe you can share the console output 
of one of the test cases you have in the patch. Thanks.

> Type is not determined correctly for comparison between decimal column and 
> string constant
> --
>
> Key: HIVE-18434
> URL: https://issues.apache.org/jira/browse/HIVE-18434
> Project: Hive
>  Issue Type: Bug
>  Components: Types
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-18434.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-14162) Allow disabling of long running job on Hive On Spark On YARN

2018-01-10 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321254#comment-16321254
 ] 

Xuefu Zhang commented on HIVE-14162:


I don't quite follow why existing configurations cannot meet what you need. 
Session and operation timeouts are designed for the purpose described. I don't 
see why we need a new timeout.

> Allow disabling of long running job on Hive On Spark On YARN
> 
>
> Key: HIVE-14162
> URL: https://issues.apache.org/jira/browse/HIVE-14162
> Project: Hive
>  Issue Type: New Feature
>  Components: Spark
>Reporter: Thomas Scott
>Assignee: Aihua Xu
>Priority: Minor
> Attachments: HIVE-14162.1.patch
>
>
> Hive On Spark launches a long running process on the first query to handle 
> all queries for that user session. In some use cases this is not desired, for 
> instance when using Hue with large intervals between query executions.
> Could we have a property that would cause long running spark jobs to be 
> terminated after each query execution and started again for the next one?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319245#comment-16319245
 ] 

Xuefu Zhang commented on HIVE-18368:


Hi [~stakiar], thanks for working on this. I think this is very useful. I 
haven't looked at the patch, but I have a couple of high-level questions:

1. Can we get rid of code reference such as {{at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57}}. they don't seem 
useful.
2. Can you clarify what's the format of an RDD specification as shown in each 
line of the output. Besides the code reference, I'm not entirely sure what 
other elements means. For instance, I see many "[]" out there.
3. We have several internal object graphs, from Work graph, to SparkTran, and 
to RDD. We can Skip SparkTran entirely, but need to have a clear mapping from 
Work to RDD. Maybe reading the patch will give me the idea.


> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319187#comment-16319187
 ] 

Xuefu Zhang commented on HIVE-16484:


Cool. 
{quote}
Once that gets started I'll work on using InProcessLauncher in a new 
SparkClient.
{quote}
Another option is to use InProcessLauncher to replace what is current in 
SparkClientImpl that invokes SparkSubmit class directly, which isn't used much 
anyway. We can organize whatever a way to make code cleaner. 

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317336#comment-16317336
 ] 

Xuefu Zhang edited comment on HIVE-16484 at 1/8/18 11:57 PM:
-

[~stakiar], I'm not denying the potential benefits we might get, for which I'm 
totally up for them. However, I wouldn't feel comfortable to replace a critical 
code path that's proven working with something that's completely new. For this, 
a fallback is much better than a sheer replacement.


was (Author: xuefuz):
[~stakiar], I'm not denying the potential benefits we might get, for which I'm 
totally up for them. However, I wouldn't feel comfortable to replace a code 
path that's proven working with something completely new. For this, a fallback 
is much better than a sheer replacement.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317336#comment-16317336
 ] 

Xuefu Zhang commented on HIVE-16484:


[~stakiar], I'm not denying the potential benefits we might get, for which I'm 
totally up for them. However, I wouldn't feel comfortable to replace a code 
path that's proven working with something completely new. For this, a fallback 
is much better than a sheer replacement.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317034#comment-16317034
 ] 

Xuefu Zhang commented on HIVE-16484:


I'd echo with [~lirui], wondering the benefits the proposal brings. While I 
only gave a brief look on the patch, but from the conversations I found that 
SparkLauncher doesn't really offer all the advantages that are listed in the 
description. Rather, it brings uncertainty and possible stability issues in 
Hive.

We have been using HoS using spark-submit for our production. While it bears 
some imperfection (like launching a dummy process), it works for us. I'd feel 
nervous in completely different code path which is so critical. Moreover, 
security related stuff will need more testing at least.

Having said that, I'd suggest we keep existing implementation of Spark job 
submission. If we want to test out SparkLauncher, I think we can use it to 
replace the other code path where class {{org.apache.spark.deploy.SparkSubmit}} 
is directly invoked(, if that makes sense at all).

When SparkLauncher becomes mature and capable of replacing {{bin/spark-submit}} 
with the promised benefits, we can make a switch in later releases, which 
hopefully brings no impact to Hive on Spark users.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-18304) datediff() UDF returns a wrong result when dealing with a (date, string) input

2017-12-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297688#comment-16297688
 ] 

Xuefu Zhang edited comment on HIVE-18304 at 12/20/17 12:20 AM:
---

[~lirui] Can you comment on this issue? Changes here might need to be 
compatible to the timezone stuff you worked on previously. Thanks.


was (Author: xuefuz):
[~lirui] Can you comment on this issue? This might be related to the timezone 
stuff you worked on previously. Thanks.

> datediff() UDF returns a wrong result when dealing with a (date, string) input
> --
>
> Key: HIVE-18304
> URL: https://issues.apache.org/jira/browse/HIVE-18304
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Hengyu Dai
>Assignee: Hengyu Dai
>Priority: Minor
> Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a 
> java Date object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to 
> date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18304) datediff() UDF returns a wrong result when dealing with a (date, string) input

2017-12-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297688#comment-16297688
 ] 

Xuefu Zhang commented on HIVE-18304:


[~lirui] Can you comment on this issue? This might be related to the timezone 
stuff you worked on previously. Thanks.

> datediff() UDF returns a wrong result when dealing with a (date, string) input
> --
>
> Key: HIVE-18304
> URL: https://issues.apache.org/jira/browse/HIVE-18304
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Hengyu Dai
>Assignee: Hengyu Dai
>Priority: Minor
> Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a 
> java Date object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to 
> date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18283) Better error message and error code for HoS exceptions

2017-12-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297116#comment-16297116
 ] 

Xuefu Zhang commented on HIVE-18283:


+1

> Better error message and error code for HoS exceptions
> --
>
> Key: HIVE-18283
> URL: https://issues.apache.org/jira/browse/HIVE-18283
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Chao Sun
>Assignee: Chao Sun
> Attachments: HIVE-18283.0.patch, HIVE-18283.1.patch, 
> HIVE-18283.2.patch, HIVE-18283.3.patch
>
>
> Right now HoS only use a few error codes. For the majority of the errors, 
> user will see an error code 1 followed by a lengthy stacktrace. This is not 
> ideal since:
> 1. It is often hard to find the root cause - sometimes it is hidden deeply 
> inside the stacktrace.
> 2. After identifying the root cause, it is not easy to find a fix. Often user 
> have to copy & paste the error message and google them. 
> 3. It is not clear whether the error is transient or not, depending on which 
> user may want to retry the query. 
> To improve the above, this JIRA propose to assign error code & canonical 
> error messages for different HoS errors. We can take advantage of the 
> existing {{ErrorMsg}} class.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18291) An exception should be raised if the result is outside the range of decimal

2017-12-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297106#comment-16297106
 ] 

Xuefu Zhang commented on HIVE-18291:


I think Hive current behavior is well-established, understood, and accepted, 
and I don't see the need to change just because of certain standard, especially 
such a change alters the default behavior. Please note that the standard 
changes too!

Returning NULL in the said case is by design as well. Hive's decimal in fact 
follows more with mySQL, though the implementation borrowed a lot from SQL 
server.

When we test a query on certain DB, we need to note that a DB server may offer 
different mode such as a strict mode in MySQL 
(https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html) that dictates error 
handling. Data errors would throws an exception in strict mode, such as 
divide-by-zero. Otherwise, NULL will be returned.

Since Hive doesn't have a server strict mode, returning NULL for the case here 
is quite reasonable. If one likes to make the behavior configurable, 
introducing different modes in HS2 would be a more appropriate approach.

Thus, I would be -0 on introducing SQL compliance on this, but certainly -1 on 
changing the default behavior.

> An exception should be raised if the result is outside the range of decimal
> ---
>
> Key: HIVE-18291
> URL: https://issues.apache.org/jira/browse/HIVE-18291
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Daniel Voros
>
> Citing SQL:2011 on page 27 available at 
> http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip:
> {noformat}
> If the result cannot be represented exactly in the result type, then whether 
> it is rounded
> or truncated is implementation-defined. An exception condition is raised if 
> the result is
> outside the range of numeric values of the result type, or if the arithmetic 
> operation
> is not defined for the operands.
> {noformat}
> Currently Hive is returning NULL instead of throwing an exception if the 
> result is out of range, eg.:
> {code}
> > select 100.01*100.01;
> +---+
> |  _c0  |
> +---+
> | NULL  |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18283) Better error message and error code for HoS exceptions

2017-12-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295898#comment-16295898
 ] 

Xuefu Zhang commented on HIVE-18283:


Patch looks good to me except for a minor improvement on a faster match() with 
precompiled patterns.

> Better error message and error code for HoS exceptions
> --
>
> Key: HIVE-18283
> URL: https://issues.apache.org/jira/browse/HIVE-18283
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Chao Sun
>Assignee: Chao Sun
> Attachments: HIVE-18283.0.patch, HIVE-18283.1.patch, 
> HIVE-18283.2.patch
>
>
> Right now HoS only use a few error codes. For the majority of the errors, 
> user will see an error code 1 followed by a lengthy stacktrace. This is not 
> ideal since:
> 1. It is often hard to find the root cause - sometimes it is hidden deeply 
> inside the stacktrace.
> 2. After identifying the root cause, it is not easy to find a fix. Often user 
> have to copy & paste the error message and google them. 
> 3. It is not clear whether the error is transient or not, depending on which 
> user may want to retry the query. 
> To improve the above, this JIRA propose to assign error code & canonical 
> error messages for different HoS errors. We can take advantage of the 
> existing {{ErrorMsg}} class.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286811#comment-16286811
 ] 

Xuefu Zhang commented on HIVE-17486:


I meant, if FIL[52] and FIL[53] is the same in your example, then we should 
break after the filter op for M-M split. Looking forward to your complete 
design doc for this. Thanks.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-10 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285550#comment-16285550
 ] 

Xuefu Zhang commented on HIVE-17486:


[~kellyzly], Thanks for working on this. I'm not sure if we should just look at 
TS to determine whether to generate M-M-R. It seems that we can do so whenever 
an TS is connected to multiple RSs. The split point should happen at the fork. 
I'm not sure what's the best way to apply the optimization rules, but if you 
look at SparkProcessAnalyzeTable, it has an if statement to check if it's an 
analyze table command. If not, it doesn't do anything. Thus, you can have a 
super rule that covers both analyze table and the new rule you're adding.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-12-07 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282964#comment-16282964
 ] 

Xuefu Zhang commented on HIVE-17486:


Hi [~kellyzly], I think this thread is getting a little bit long and the 
problem doesn't seem trivial. Could you please create a doc that describes the 
problem or feature we are addressing and your proposal? That's probably easier 
to communicate. Thanks. 

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
> Attachments: HIVE-17486.1.patch, explain.28.share.false, 
> explain.28.share.true, scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257222#comment-16257222
 ] 

Xuefu Zhang commented on HIVE-17964:


+1

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17964.1.patch, HIVE-17964.2.patch
>
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18030) HCatalog can't be used with Pig on Spark

2017-11-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257212#comment-16257212
 ] 

Xuefu Zhang commented on HIVE-18030:


[~szita], as you noted,
{quote}
This feature was working previously by mapred.task.id property being set for 
Pig on MR/Tez jobs. In Spark mode this property is not used...
{quote}
I think setting {{mapred.task.id}} in Pig on Spark is a consistency as it's 
done for MR and Tez. Setting this property is not necessarily just for either 
Hive or Pig itself, but for downstream application as a bafkcompatibile measure.

> HCatalog can't be used with Pig on Spark
> 
>
> Key: HIVE-18030
> URL: https://issues.apache.org/jira/browse/HIVE-18030
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18030.0.patch
>
>
> When using Pig on Spark in cluster mode, all queries containing HCatalog 
> access are failing:
> {code}
> 2017-11-03 12:39:19,268 [dispatcher-event-loop-19] INFO  
> org.apache.spark.storage.BlockManagerInfo - Added broadcast_6_piece0 in 
> memory on <>:<> (size: 83.0 KB, free: 408.5 
> MB)
> 2017-11-03 12:39:19,277 [task-result-getter-0] WARN  
> org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 
> 0, <>, executor 2): java.lang.NullPointerException
>   at org.apache.hadoop.security.Credentials.addAll(Credentials.java:401)
>   at org.apache.hadoop.security.Credentials.addAll(Credentials.java:388)
>   at 
> org.apache.hive.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:128)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:147)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat$RecordReaderFactory.(PigInputFormat.java:115)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark$SparkRecordReaderFactory.(PigInputFormatSpark.java:126)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark.createRecordReader(PigInputFormatSpark.java:70)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:179)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252858#comment-16252858
 ] 

Xuefu Zhang commented on HIVE-17964:


That's good then.

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17964.1.patch, HIVE-17964.2.patch
>
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252653#comment-16252653
 ] 

Xuefu Zhang commented on HIVE-17964:


[~lirui] What's your thought on configurations such as {{spark.driver.cores}}?

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17964.1.patch, HIVE-17964.2.patch
>
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250420#comment-16250420
 ] 

Xuefu Zhang commented on HIVE-17964:


+1

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17964.1.patch
>
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250420#comment-16250420
 ] 

Xuefu Zhang edited comment on HIVE-17964 at 11/13/17 10:43 PM:
---

+1 pending on tests


was (Author: xuefuz):
+1

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17964.1.patch
>
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17976) HoS: don't set output collector if there's no data to process

2017-11-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249734#comment-16249734
 ] 

Xuefu Zhang commented on HIVE-17976:


Sounds good. Thanks for the explanation, [~lirui].

> HoS: don't set output collector if there's no data to process
> -
>
> Key: HIVE-17976
> URL: https://issues.apache.org/jira/browse/HIVE-17976
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17976.1.patch, HIVE-17976.2.patch
>
>
> MR doesn't set an output collector if no row is processed, i.e. 
> {{ExecMapper::map}} is never called. Let's investigate whether Spark should 
> do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17976) HoS: don't set output collector if there's no data to process

2017-11-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249133#comment-16249133
 ] 

Xuefu Zhang commented on HIVE-17976:


Patch looks good to me. +1

[~lirui] Do you know why setting OutputConnector at init time will generate an 
empty row when there are no input rows? Is it due to operator.close()? 

> HoS: don't set output collector if there's no data to process
> -
>
> Key: HIVE-17976
> URL: https://issues.apache.org/jira/browse/HIVE-17976
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17976.1.patch, HIVE-17976.2.patch
>
>
> MR doesn't set an output collector if no row is processed, i.e. 
> {{ExecMapper::map}} is never called. Let's investigate whether Spark should 
> do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17976) HoS: don't set output collector if there's no data to process

2017-11-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247079#comment-16247079
 ] 

Xuefu Zhang commented on HIVE-17976:


[~lirui] Thanks for working on this. I will take a look.

> HoS: don't set output collector if there's no data to process
> -
>
> Key: HIVE-17976
> URL: https://issues.apache.org/jira/browse/HIVE-17976
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17976.1.patch, HIVE-17976.2.patch
>
>
> MR doesn't set an output collector if no row is processed, i.e. 
> {{ExecMapper::map}} is never called. Let's investigate whether Spark should 
> do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246136#comment-16246136
 ] 

Xuefu Zhang edited comment on HIVE-17964 at 11/9/17 5:47 PM:
-

{quote}
I think renaming a bunch of configs is not very user friendly. Maybe we should 
differentiate these configs in our code.
{quote}
+1. Probably we can add a new param that enlists the params that require a 
session refresh.


was (Author: xuefuz):
{quote}
I think renaming a bunch of configs is not very user friendly. Maybe we should 
differentiate these configs in our code.
{quote}
Probably we can add a new param that enlists the params that require a session 
refresh.

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17964) HoS: some spark configs doesn't require re-creating a session

2017-11-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246136#comment-16246136
 ] 

Xuefu Zhang commented on HIVE-17964:


{quote}
I think renaming a bunch of configs is not very user friendly. Maybe we should 
differentiate these configs in our code.
{quote}
Probably we can add a new param that enlists the params that require a session 
refresh.

> HoS: some spark configs doesn't require re-creating a session
> -
>
> Key: HIVE-17964
> URL: https://issues.apache.org/jira/browse/HIVE-17964
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
>
> I guess the {{hive.spark.}} configs were initially intended for the RSC. 
> Therefore when they're changed, we'll re-create the session for them to take 
> effect. There're some configs not related to RSC that also start with 
> {{hive.spark.}}. We'd better rename them so that we don't unnecessarily 
> re-create sessions, which is usually time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18030) HCatalog can't be used with Pig on Spark

2017-11-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246089#comment-16246089
 ] 

Xuefu Zhang commented on HIVE-18030:


Patch looks fine. However, I'm wondering if it would be better to set 
mapred.task.id in Pig on Spark, as Hive on Spark sets it already.

> HCatalog can't be used with Pig on Spark
> 
>
> Key: HIVE-18030
> URL: https://issues.apache.org/jira/browse/HIVE-18030
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18030.0.patch
>
>
> When using Pig on Spark in cluster mode, all queries containing HCatalog 
> access are failing:
> {code}
> 2017-11-03 12:39:19,268 [dispatcher-event-loop-19] INFO  
> org.apache.spark.storage.BlockManagerInfo - Added broadcast_6_piece0 in 
> memory on <>:<> (size: 83.0 KB, free: 408.5 
> MB)
> 2017-11-03 12:39:19,277 [task-result-getter-0] WARN  
> org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 
> 0, <>, executor 2): java.lang.NullPointerException
>   at org.apache.hadoop.security.Credentials.addAll(Credentials.java:401)
>   at org.apache.hadoop.security.Credentials.addAll(Credentials.java:388)
>   at 
> org.apache.hive.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:128)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:147)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat$RecordReaderFactory.(PigInputFormat.java:115)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark$SparkRecordReaderFactory.(PigInputFormatSpark.java:126)
>   at 
> org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark.createRecordReader(PigInputFormatSpark.java:70)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:179)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18009) Multiple lateral view query is slow on hive on spark

2017-11-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244952#comment-16244952
 ] 

Xuefu Zhang commented on HIVE-18009:


[~aihuaxu] Thanks for working on this improvement. Patch looks good. +1. If you 
ever needs to update the patch, one minor suggestion, could you rename variable 
{{opSet}} to something like {{visited}} to be more meaningful? Never mind if 
the patch goes in smoothly.

> Multiple lateral view query is slow on hive on spark
> 
>
> Key: HIVE-18009
> URL: https://issues.apache.org/jira/browse/HIVE-18009
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 3.0.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-18009.1.patch, HIVE-18009.2.patch
>
>
> When running the query with multiple lateral view, HoS is busy with the 
> compilation. GenSparkUtils has an efficient implementation of 
> getChildOperator when we have diamond hierarchy in operator trees (lateral 
> view in this case) since the node may be visited multiple times.
> {noformat}
> at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:442)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> org.apache.hadoop.hive.ql.parse.spark.GenSparkUtils.getChildOperator(GenSparkUtils.java:438)
>   at 
> 

[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-11-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236087#comment-16236087
 ] 

Xuefu Zhang commented on HIVE-17486:


[~lirui] The reason was briefly given at 
https://issues.apache.org/jira/browse/HIVE-8920?focusedCommentId=14260846=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14260846.
 I was dealing with the IOContext initialization issue.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-11-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235137#comment-16235137
 ] 

Xuefu Zhang commented on HIVE-17486:


[~kellyzly] I think M->M->R is possible. It's just that the current planner 
doesn't do this, but in theory it can be done. Currently the assumption is that 
a Map task is always followed by a Reduce task. 

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS

2017-11-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235085#comment-16235085
 ] 

Xuefu Zhang commented on HIVE-17486:


Hi [~kellyzly], I think your observation is correct. Spark has certain 
limitations. In fact, the edge theory doesn't even apply to Spark. Spark uses 
RDD model. Internally Hive translates the DAG to RDD operations 
(transformations and actions). In the example of ( Map1->Reducer3, 
Map1->Reducer2), Hive on Spark actually has a plan like (map12 - > reduce2, 
map13 ->reduce3) with map12 = map13. This way, there will be two spark jobs. In 
the second job, the cached result is used instead of loading the data again. 
BTW, this is a multi-insert example.

Multiple edges between two vertices are even less thinkable. You might be able 
to turn this optimization for Spark, but Spark might not be able to run it. I'm 
not sure if there is any case that this optimization might help Spark. My gut 
feeling is that this needs to be combined with Spark RDD caching or HIve's 
materialized view.

> Enable SharedWorkOptimizer in tez on HOS
> 
>
> Key: HIVE-17486
> URL: https://issues.apache.org/jira/browse/HIVE-17486
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be 
> merged so the data is read only once. Optimization will be carried out at the 
> physical level.  In Hive on Spark, it caches the result of spark work if the 
> spark work is used by more than 1 child spark work. After sharedWorkOptimizer 
> is enabled in physical plan in HoS, the identical table scans are merged to 1 
> table scan. This result of table scan will be used by more 1 child spark 
> work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217431#comment-16217431
 ] 

Xuefu Zhang commented on HIVE-15104:


+1

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, 
> HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, 
> HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, 
> HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17684) HoS memory issues with MapJoinMemoryExhaustionHandler

2017-10-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216000#comment-16216000
 ] 

Xuefu Zhang commented on HIVE-17684:


We don't see this issue often, possibly because our settings are conservative. 
Because of the dynamic nature of GC and the possibly of different tasks that 
can concurrently run in an executor, completely avoiding this problem might be 
very hard.

When we do have memory issue when loading the hash map into memory, it's 
usually because the plan was wrong so that the map join isn't the right choice. 
For this, I think it might makes sense to keep track of the size of the hash 
map when it's written to disk. If it goes beyond a threshold (such as the value 
of noconditional.size), fail the task right way rather than later failing to 
load the table into memory.

> HoS memory issues with MapJoinMemoryExhaustionHandler
> -
>
> Key: HIVE-17684
> URL: https://issues.apache.org/jira/browse/HIVE-17684
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> We have seen a number of memory issues due the {{HashSinkOperator}} use of 
> the {{MapJoinMemoryExhaustionHandler}}. This handler is meant to detect 
> scenarios where the small table is taking too much space in memory, in which 
> case a {{MapJoinMemoryExhaustionError}} is thrown.
> The configs to control this logic are:
> {{hive.mapjoin.localtask.max.memory.usage}} (default 0.90)
> {{hive.mapjoin.followby.gby.localtask.max.memory.usage}} (default 0.55)
> The handler works by using the {{MemoryMXBean}} and uses the following logic 
> to estimate how much memory the {{HashMap}} is consuming: 
> {{MemoryMXBean#getHeapMemoryUsage().getUsed() / 
> MemoryMXBean#getHeapMemoryUsage().getMax()}}
> The issue is that {{MemoryMXBean#getHeapMemoryUsage().getUsed()}} can be 
> inaccurate. The value returned by this method returns all reachable and 
> unreachable memory on the heap, so there may be a bunch of garbage data, and 
> the JVM just hasn't taken the time to reclaim it all. This can lead to 
> intermittent failures of this check even though a simple GC would have 
> reclaimed enough space for the process to continue working.
> We should re-think the usage of {{MapJoinMemoryExhaustionHandler}} for HoS. 
> In Hive-on-MR this probably made sense to use because every Hive task was run 
> in a dedicated container, so a Hive Task could assume it created most of the 
> data on the heap. However, in Hive-on-Spark there can be multiple Hive Tasks 
> running in a single executor, each doing different things.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17868) Make queries in spark_local_queries.q have deterministic output

2017-10-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215443#comment-16215443
 ] 

Xuefu Zhang edited comment on HIVE-17868 at 10/23/17 4:53 PM:
--

Makes sense, [~asherman]. Thanks for the explanation.

+1


was (Author: xuefuz):
Makes sense, [~asherman]. Thanks for the explanation.

> Make queries in spark_local_queries.q have deterministic output
> ---
>
> Key: HIVE-17868
> URL: https://issues.apache.org/jira/browse/HIVE-17868
> Project: Hive
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Andrew Sherman
> Attachments: HIVE-17868.1.patch
>
>
> Add 'order by' to queries so that output is always the same



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17868) Make queries in spark_local_queries.q have deterministic output

2017-10-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215443#comment-16215443
 ] 

Xuefu Zhang commented on HIVE-17868:


Makes sense, [~asherman]. Thanks for the explanation.

> Make queries in spark_local_queries.q have deterministic output
> ---
>
> Key: HIVE-17868
> URL: https://issues.apache.org/jira/browse/HIVE-17868
> Project: Hive
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Andrew Sherman
> Attachments: HIVE-17868.1.patch
>
>
> Add 'order by' to queries so that output is always the same



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16601) Display Session Id and Query Name / Id in Spark UI

2017-10-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215254#comment-16215254
 ] 

Xuefu Zhang commented on HIVE-16601:


The new screen shot looks great! +1 on that. I didn't review the code, but I 
think it's fine since other folks have reviewed that.

> Display Session Id and Query Name / Id in Spark UI
> --
>
> Key: HIVE-16601
> URL: https://issues.apache.org/jira/browse/HIVE-16601
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16601.1.patch, HIVE-16601.2.patch, 
> HIVE-16601.3.patch, HIVE-16601.4.patch, HIVE-16601.5.patch, 
> HIVE-16601.6.patch, HIVE-16601.7.patch, HIVE-16601.8.patch, Spark UI 
> Applications List.png, Spark UI Jobs List.png
>
>
> We should display the session id for each HoS Application Launched, and the 
> Query Name / Id and Dag Id for each Spark job launched. Hive-on-MR does 
> something similar via the {{mapred.job.name}} parameter. The query name is 
> displayed in the Job Name of the MR app.
> The changes here should also allow us to leverage the config 
> {{hive.query.name}} for HoS.
> This should help with debuggability of HoS applications. The Hive-on-Tez UI 
> does something similar.
> Related issues for Hive-on-Tez: HIVE-12357, HIVE-12523



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17868) Make queries in spark_local_queries.q have deterministic output

2017-10-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213178#comment-16213178
 ] 

Xuefu Zhang commented on HIVE-17868:


Adding order by might slow down the test. Preferable way is to add tag 
"--sort_query_result" (something like that).

> Make queries in spark_local_queries.q have deterministic output
> ---
>
> Key: HIVE-17868
> URL: https://issues.apache.org/jira/browse/HIVE-17868
> Project: Hive
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Andrew Sherman
>
> Add 'order by' to queries so that output is always the same



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16601) Display Session Id and Query Name / Id in Spark UI

2017-10-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212089#comment-16212089
 ] 

Xuefu Zhang commented on HIVE-16601:


Thanks for the update. Personally I like the way that app name is formatted. 
However, job group portion is less readable. To format job group in a similar 
way as formatting app name would be great. (Instead of just "", maybe 
we can have "query_id="). Thoughts?

> Display Session Id and Query Name / Id in Spark UI
> --
>
> Key: HIVE-16601
> URL: https://issues.apache.org/jira/browse/HIVE-16601
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16601.1.patch, HIVE-16601.2.patch, 
> HIVE-16601.3.patch, HIVE-16601.4.patch, HIVE-16601.5.patch, 
> HIVE-16601.6.patch, Spark UI Applications List.png, Spark UI Jobs List.png
>
>
> We should display the session id for each HoS Application Launched, and the 
> Query Name / Id and Dag Id for each Spark job launched. Hive-on-MR does 
> something similar via the {{mapred.job.name}} parameter. The query name is 
> displayed in the Job Name of the MR app.
> The changes here should also allow us to leverage the config 
> {{hive.query.name}} for HoS.
> This should help with debuggability of HoS applications. The Hive-on-Tez UI 
> does something similar.
> Related issues for Hive-on-Tez: HIVE-12357, HIVE-12523



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16601) Display Session Id and Query Name / Id in Spark UI

2017-10-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211277#comment-16211277
 ] 

Xuefu Zhang commented on HIVE-16601:


Thanks for working on this. Could we have updated screenshots for the 
improvement? Thanks.

> Display Session Id and Query Name / Id in Spark UI
> --
>
> Key: HIVE-16601
> URL: https://issues.apache.org/jira/browse/HIVE-16601
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16601.1.patch, HIVE-16601.2.patch, 
> HIVE-16601.3.patch, HIVE-16601.4.patch, HIVE-16601.5.patch, 
> HIVE-16601.6.patch, Spark UI Applications List.png, Spark UI Jobs List.png
>
>
> We should display the session id for each HoS Application Launched, and the 
> Query Name / Id and Dag Id for each Spark job launched. Hive-on-MR does 
> something similar via the {{mapred.job.name}} parameter. The query name is 
> displayed in the Job Name of the MR app.
> The changes here should also allow us to leverage the config 
> {{hive.query.name}} for HoS.
> This should help with debuggability of HoS applications. The Hive-on-Tez UI 
> does something similar.
> Related issues for Hive-on-Tez: HIVE-12357, HIVE-12523



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17756) Enable subquery related Qtests for Hive on Spark

2017-10-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206076#comment-16206076
 ] 

Xuefu Zhang commented on HIVE-17756:


I think it might be better if we fix the problem in a separate JIRA. 
[~dapengsun], could you look into the problem? Thanks.

> Enable subquery related Qtests for Hive on Spark
> 
>
> Key: HIVE-17756
> URL: https://issues.apache.org/jira/browse/HIVE-17756
> Project: Hive
>  Issue Type: Sub-task
>  Components: Logical Optimizer
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
> Fix For: 3.0.0
>
> Attachments: HIVE-17756.001.patch
>
>
> HIVE-15456 and HIVE-15192 using Calsite to decorrelate and plan subqueries. 
> This JIRA is to indroduce subquery test and verify the subqueries plan for 
> Hive on Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203867#comment-16203867
 ] 

Xuefu Zhang commented on HIVE-15104:


I think it's fairly safe to assume that hive-exec.jar and the new jar are in 
the same location. We can error out if the jar cannot be found in that location.

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17756) Enable subquery related Qtests for Hive on Spark

2017-10-13 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-17756:
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Patch committed to master. Thanks to Dapeng for the contribution.

> Enable subquery related Qtests for Hive on Spark
> 
>
> Key: HIVE-17756
> URL: https://issues.apache.org/jira/browse/HIVE-17756
> Project: Hive
>  Issue Type: Sub-task
>  Components: Logical Optimizer
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
> Fix For: 3.0.0
>
> Attachments: HIVE-17756.001.patch
>
>
> HIVE-15456 and HIVE-15192 using Calsite to decorrelate and plan subqueries. 
> This JIRA is to indroduce subquery test and verify the subqueries plan for 
> Hive on Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202897#comment-16202897
 ] 

Xuefu Zhang commented on HIVE-15104:


Hi [~lirui], to locate the jar, can we assume that the jar is located somewhere 
in Hive's installation path? I'm not sure where (Hive, spark-submit, or remote 
driver) we need to find the location of the jar.

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
> Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17786) JdbcConnectionParams set exact host and port in Utils.java

2017-10-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201962#comment-16201962
 ] 

Xuefu Zhang commented on HIVE-17786:


+1 pending on test.

> JdbcConnectionParams set exact host and port in Utils.java
> --
>
> Key: HIVE-17786
> URL: https://issues.apache.org/jira/browse/HIVE-17786
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Saijin Huang
>Assignee: Saijin Huang
>Priority: Minor
> Attachments: HIVE-17786.1.patch
>
>
> In Utils.java,line 557、558,connParams.setHost and connParams.setPort should 
> be changed to the exact value



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17756) Enable subquery related Qtests for Hive on Spark

2017-10-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200622#comment-16200622
 ] 

Xuefu Zhang commented on HIVE-17756:


+1

> Enable subquery related Qtests for Hive on Spark
> 
>
> Key: HIVE-17756
> URL: https://issues.apache.org/jira/browse/HIVE-17756
> Project: Hive
>  Issue Type: Sub-task
>  Components: Logical Optimizer
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
> Attachments: HIVE-17756.001.patch
>
>
> HIVE-15456 and HIVE-15192 using Calsite to decorrelate and plan subqueries. 
> This JIRA is to indroduce subquery test and verify the subqueries plan for 
> Hive on Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17111) Add TestLocalSparkCliDriver

2017-10-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197213#comment-16197213
 ] 

Xuefu Zhang commented on HIVE-17111:


Thanks, guys! This is cool.
+1

> Add TestLocalSparkCliDriver
> ---
>
> Key: HIVE-17111
> URL: https://issues.apache.org/jira/browse/HIVE-17111
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17111.1.patch
>
>
> The TestSparkCliDriver sets the spark.master to local-cluster[2,2,1024] but 
> the HoS still uses decides to use the RemoteHiveSparkClient rather than the 
> LocalHiveSparkClient.
> The issue is with the following check in HiveSparkClientFactory:
> {code}
> if (master.equals("local") || master.startsWith("local[")) {
>   // With local spark context, all user sessions share the same spark 
> context.
>   return LocalHiveSparkClient.getInstance(generateSparkConf(sparkConf));
> } else {
>   return new RemoteHiveSparkClient(hiveconf, sparkConf);
> }
> {code}
> When {{master.startsWith("local[")}} it checks the value of spark.master and 
> sees that it doesn't start with {{local[}} and then decides to use the 
> RemoteHiveSparkClient.
> We should fix this so that the LocalHiveSparkClient is used. It should speed 
> up some of the tests, and also makes qtests easier to debug since everything 
> will now be run in the same process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17111) Add TestLocalSparkCliDriver

2017-10-06 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195239#comment-16195239
 ] 

Xuefu Zhang commented on HIVE-17111:


I think this is good. However, I have two questions:

1. Other test driver class is generated, so I'm wondering why we actually have 
a nongenerated class for TestLocalSparkCliDriver.
2. With this change, are we able to run any .q test using this test driver 
class?

> Add TestLocalSparkCliDriver
> ---
>
> Key: HIVE-17111
> URL: https://issues.apache.org/jira/browse/HIVE-17111
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17111.1.patch
>
>
> The TestSparkCliDriver sets the spark.master to local-cluster[2,2,1024] but 
> the HoS still uses decides to use the RemoteHiveSparkClient rather than the 
> LocalHiveSparkClient.
> The issue is with the following check in HiveSparkClientFactory:
> {code}
> if (master.equals("local") || master.startsWith("local[")) {
>   // With local spark context, all user sessions share the same spark 
> context.
>   return LocalHiveSparkClient.getInstance(generateSparkConf(sparkConf));
> } else {
>   return new RemoteHiveSparkClient(hiveconf, sparkConf);
> }
> {code}
> When {{master.startsWith("local[")}} it checks the value of spark.master and 
> sees that it doesn't start with {{local[}} and then decides to use the 
> RemoteHiveSparkClient.
> We should fix this so that the LocalHiveSparkClient is used. It should speed 
> up some of the tests, and also makes qtests easier to debug since everything 
> will now be run in the same process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181938#comment-16181938
 ] 

Xuefu Zhang commented on HIVE-17586:


[~lirui] Thanks for sharing your findings. I think variable pool size and 
allowing core threads to idle out are two different aspects of thread pooling. 
They don't contradicting with each other. Plus, the patch doesn't change any 
behavior, but just making the threadpool be more genera.

On the other hand, it does bring an interesting point about the ojbect/memory 
leaks that I observed in our production cluster. i guess I will go back to 
investigate a little bit more.

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.1.patch, HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181919#comment-16181919
 ] 

Xuefu Zhang edited comment on HIVE-17586 at 9/27/17 3:06 AM:
-

Actually no according to java doc at 
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html:

{quote}
If the pool currently has more than corePoolSize threads, excess threads will 
be terminated if they have been idle for more than the keepAliveTime (see 
getKeepAliveTime(java.util.concurrent.TimeUnit)).
{quote}
 
Because we use fixed pool size, there are no "excess threads" to idle out.


was (Author: xuefuz):
Actually no according java doc at 
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html:

{quote}
If the pool currently has more than corePoolSize threads, excess threads will 
be terminated if they have been idle for more than the keepAliveTime (see 
getKeepAliveTime(java.util.concurrent.TimeUnit)).
{quote}
 
Because we use fixed pool size, there are no "excess threads" to idle out.

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.1.patch, HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181919#comment-16181919
 ] 

Xuefu Zhang commented on HIVE-17586:


Actually no according java doc at 
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html:

{quote}
If the pool currently has more than corePoolSize threads, excess threads will 
be terminated if they have been idle for more than the keepAliveTime (see 
getKeepAliveTime(java.util.concurrent.TimeUnit)).
{quote}
 
Because we use fixed pool size, there are no "excess threads" to idle out.

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.1.patch, HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181857#comment-16181857
 ] 

Xuefu Zhang commented on HIVE-17586:


Above test failures don't seem related to patch.

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.1.patch, HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-25 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-17586:
---
Attachment: HIVE-17586.1.patch

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.1.patch, HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-25 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-17586:
---
Attachment: HIVE-17586.patch

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-25 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-17586:
---
Status: Patch Available  (was: Open)

> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-17586.patch
>
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-22 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-17586:
--


> Make HS2 BackgroundOperationPool not fixed
> --
>
> Key: HIVE-17586
> URL: https://issues.apache.org/jira/browse/HIVE-17586
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>
> Currently the threadpool for background asynchronous operatons has a fixed 
> size controled by {{hive.server2.async.exec.threads}}. However, the thread 
> factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} 
> which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the 
> thread is actually never killed, defecting the purpose of garbage cleanup as 
> noted in the thread class name. On the other hand, since these threads never 
> go away, significant resources such as threadlocal variables (classloaders, 
> hiveconfs, etc) are holding up even if there is no operation running. This 
> can lead to escalated HS2 memory usage.
> Ideally, the threadpool should not be fixed, allowing thread to die out so 
> resources can be reclaimed. The existing config 
> {{hive.server2.async.exec.threads}} is treated as the max, and we can add a 
> min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value 
> for this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-17548) ThriftCliService reports inaccurate the number of current sessions in the log message

2017-09-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned HIVE-17548:
--

Assignee: Xuefu Zhang

> ThriftCliService reports inaccurate the number of current sessions in the log 
> message
> -
>
> Key: HIVE-17548
> URL: https://issues.apache.org/jira/browse/HIVE-17548
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>
> Currently ThriftCliService uses an atomic integer to keep track of the number 
> of currently open sessions. It reports it through the following two log 
> messages:
> {code}
> 2017-09-18 04:14:31,722 INFO [HiveServer2-Handler-Pool: Thread-729979]: 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Opened a session: 
> SessionHandle [99ec30d7-5c44-4a45-a8d6-0f0e7ecf4879], current sessions: 345
> 2017-09-18 04:14:41,926 INFO [HiveServer2-Handler-Pool: Thread-717542]: 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Closed session: 
> SessionHandle [f38f7890-cba4-459c-872e-4c261b897e00], current sessions: 344
> {code}
> This assumes that all sessions are closed or opened thru Thrift API. This 
> assumption isn't correct because sessions may be closed by the server such as 
> in case of timeout. Therefore, such log messages tends to over-report the 
> number of open sessions.
> In order to accurately report the number of outstanding sessions, session 
> manager should be consulted instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-14836) Test the predicate pushing down support for Parquet vectorization read path

2017-09-15 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168207#comment-16168207
 ] 

Xuefu Zhang commented on HIVE-14836:


Hi [~Ferd], thanks for your patch. I'm a little confused. The JIRA title and 
the patch itself seem about adding tests, but the JIRA description suggests 
some feature. Could you clarify a little bit? Thanks.

> Test the predicate pushing down support for Parquet vectorization read path
> ---
>
> Key: HIVE-14836
> URL: https://issues.apache.org/jira/browse/HIVE-14836
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>  Labels: pull-request-available
> Attachments: HIVE-14836.patch
>
>
> Currently we filter blocks using Predict pushing down. We should support it 
> in page reader as well to improve its efficiency. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (HIVE-7292) Hive on Spark

2017-09-11 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved HIVE-7292.
---
   Resolution: Done
Fix Version/s: 1.1.0

As the feature is already released in Hive and remaining issues have dedicated 
JIRAs to track, I'm closing this JIRA as "done".

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Fix For: 1.1.0
>
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-7292) Hive on Spark

2017-09-11 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162071#comment-16162071
 ] 

Xuefu Zhang commented on HIVE-7292:
---

[~bastrich], thanks for your explanation. In fact, anyone can create a JIRA 
requesting a bug fix or a feature. Nevertheless, I created HIVE-17507 to 
request the support.

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-7292) Hive on Spark

2017-09-10 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160444#comment-16160444
 ] 

Xuefu Zhang commented on HIVE-7292:
---

[~bastrich], the answer is no. However, if there is a strong demand, Mesos 
support can be added.

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17401) Hive session idle timeout doesn't function properly

2017-09-06 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-17401:
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

> Hive session idle timeout doesn't function properly
> ---
>
> Key: HIVE-17401
> URL: https://issues.apache.org/jira/browse/HIVE-17401
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 3.0.0
>
> Attachments: HIVE-17401.1.patch, HIVE-17401.2.patch, HIVE-17401.patch
>
>
> It's apparent in our production environment that HS2 leaks sessions, which at 
> least contributed to memory leaks in HS2. We further found that idle HS2 
> sessions rarely get timed out and the number of live session keeps increasing 
> as time goes on. Eventually, HS2 becomes irresponsive and demands a restart.
> Investigation shows that session idle timeout doesn't work appropriately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17401) Hive session idle timeout doesn't function properly

2017-09-06 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155703#comment-16155703
 ] 

Xuefu Zhang commented on HIVE-17401:


Above test failures don't seem related to the patch. The patch fixes the 
implementation bug that causes session and so memory leak for HS2, and corrects 
a potential synchronization problem. It also refactored the code a little to 
make it easier to read and understand.

Patch #2 is committed to master. Thanks to Peter for the review.

> Hive session idle timeout doesn't function properly
> ---
>
> Key: HIVE-17401
> URL: https://issues.apache.org/jira/browse/HIVE-17401
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 3.0.0
>
> Attachments: HIVE-17401.1.patch, HIVE-17401.2.patch, HIVE-17401.patch
>
>
> It's apparent in our production environment that HS2 leaks sessions, which at 
> least contributed to memory leaks in HS2. We further found that idle HS2 
> sessions rarely get timed out and the number of live session keeps increasing 
> as time goes on. Eventually, HS2 becomes irresponsive and demands a restart.
> Investigation shows that session idle timeout doesn't work appropriately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   3   4   5   6   7   8   9   10   >