[jira] [Created] (HIVE-24280) Fix a potential NPE
Xuefu Zhang created HIVE-24280: -- Summary: Fix a potential NPE Key: HIVE-24280 URL: https://issues.apache.org/jira/browse/HIVE-24280 Project: Hive Issue Type: Improvement Components: Vectorization Affects Versions: 3.1.2 Reporter: Xuefu Zhang Assignee: Xuefu Zhang {code:java} case STRING: case CHAR: case VARCHAR: { BytesColumnVector bcv = (BytesColumnVector) cols[colIndex]; String sVal = value.toString(); if (sVal == null) { bcv.noNulls = false; bcv.isNull[0] = true; bcv.isRepeating = true; } else { bcv.fill(sVal.getBytes()); } } break; {code} The above code snippet seems assuming that sVal can be null, but didn't handle the case where value is null. However, if value is not null, it's unlikely that value.toString() returns null. We treat partition column value for default partition of string types as null, not as "__HIVE_DEFAULT_PARTITION__", which Hive assumes. Thus, we actually hit the problem that sVal is null. I propose a harmless fix, as shown in the attached patch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed
Xuefu Zhang created HIVE-17586: -- Summary: Make HS2 BackgroundOperationPool not fixed Key: HIVE-17586 URL: https://issues.apache.org/jira/browse/HIVE-17586 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Currently the threadpool for background asynchronous operatons has a fixed size controled by {{hive.server2.async.exec.threads}}. However, the thread factory supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} which creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the thread is actually never killed, defecting the purpose of garbage cleanup as noted in the thread class name. On the other hand, since these threads never go away, significant resources such as threadlocal variables (classloaders, hiveconfs, etc) are holding up even if there is no operation running. This can lead to escalated HS2 memory usage. Ideally, the threadpool should not be fixed, allowing thread to die out so resources can be reclaimed. The existing config {{hive.server2.async.exec.threads}} is treated as the max, and we can add a min for the threadpool {{hive.server2.async.exec.min.threads}}. Default value for this configure is -1, which keeps the existing behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17548) ThriftCliService reports inaccurate the number of current sessions in the log message
Xuefu Zhang created HIVE-17548: -- Summary: ThriftCliService reports inaccurate the number of current sessions in the log message Key: HIVE-17548 URL: https://issues.apache.org/jira/browse/HIVE-17548 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 1.1.0 Reporter: Xuefu Zhang Currently ThriftCliService uses an atomic integer to keep track of the number of currently open sessions. It reports it through the following two log messages: {code} 2017-09-18 04:14:31,722 INFO [HiveServer2-Handler-Pool: Thread-729979]: org.apache.hive.service.cli.thrift.ThriftCLIService: Opened a session: SessionHandle [99ec30d7-5c44-4a45-a8d6-0f0e7ecf4879], current sessions: 345 2017-09-18 04:14:41,926 INFO [HiveServer2-Handler-Pool: Thread-717542]: org.apache.hive.service.cli.thrift.ThriftCLIService: Closed session: SessionHandle [f38f7890-cba4-459c-872e-4c261b897e00], current sessions: 344 {code} This assumes that all sessions are closed or opened thru Thrift API. This assumption isn't correct because sessions may be closed by the server such as in case of timeout. Therefore, such log messages tends to over-report the number of open sessions. In order to accurately report the number of outstanding sessions, session manager should be consulted instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17507) Support Mesos for Hive on Spark
Xuefu Zhang created HIVE-17507: -- Summary: Support Mesos for Hive on Spark Key: HIVE-17507 URL: https://issues.apache.org/jira/browse/HIVE-17507 Project: Hive Issue Type: Improvement Components: Spark Reporter: Xuefu Zhang >From the comment in HIVE-7292: {quote} I see the following case: I use Mesos DC/OS and Spark on Mesos. Because it's very convenient. But if I want to use Hive on Spark in Mesos DC/OS, I need special framework Apache Myriad to run YARN on Mesos. It's very cluttering because I run one Resource Manager on another Resource Manager, and it creates a lot of redundant abstraction levels. And there are questions about that on the Internet (e.g. http://grokbase.com/t/hive/user/15997dye2q/hive-on-spark-on-mesos) Can we create the new sub-task for this feature? {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17401) Hive session idle timeout doesn't function properly
Xuefu Zhang created HIVE-17401: -- Summary: Hive session idle timeout doesn't function properly Key: HIVE-17401 URL: https://issues.apache.org/jira/browse/HIVE-17401 Project: Hive Issue Type: Bug Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang It's apparent in our production environment that HS2 leaks sessions, which at least contributed to memory leaks in HS2. We further found that idle HS2 sessions rarely get timed out and the number of live session keeps increasing as time goes on. Eventually, HS2 becomes irresponsive and demands a restart. Investigation shows that session idle timeout doesn't work appropriately. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16962) Better error msg for Hive on Spark in case user cancels query and closes session
Xuefu Zhang created HIVE-16962: -- Summary: Better error msg for Hive on Spark in case user cancels query and closes session Key: HIVE-16962 URL: https://issues.apache.org/jira/browse/HIVE-16962 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang In case user cancels a query and closes the session, Hive marks the query as failed. However, the error message is a little confusing. It still says: {quote} org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create spark client. This is likely because the queue you assigned to does not have free resource at the moment to start the job. Please check your queue usage and try the query again later. {quote} followed by some InterruptedException. Ideally, the error should clearly indicates the fact that user cancels the execution. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16961) Hive on Spark leaks spark application in case user cancels query and closes session
Xuefu Zhang created HIVE-16961: -- Summary: Hive on Spark leaks spark application in case user cancels query and closes session Key: HIVE-16961 URL: https://issues.apache.org/jira/browse/HIVE-16961 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang It's found that a Spark application is leaked when user cancels query and closes the session while Hive is waiting for remote driver to connect back. This is found for asynchronous query execution, but seemingly equally applicable for synchronous submission when session is abruptly closed. The leaked Spark application that runs Spark driver connects back to Hive successfully and run for ever (until HS2 restarts), but receives no job submission because the session is already closed. Ideally, Hive should rejects the connection from the driver so the driver will exist. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16854) SparkClientFactory is locked too aggressively
Xuefu Zhang created HIVE-16854: -- Summary: SparkClientFactory is locked too aggressively Key: HIVE-16854 URL: https://issues.apache.org/jira/browse/HIVE-16854 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Most methods in SparkClientFactory are synchronized on the SparkClientFactory singleton. However, some methods are very expensive, such as createClient(), which returns a SparkClientImpl instance. However, creating a SparkClientImpl instance requires starting a remote driver to connect back to RPCServer. This process can take a long time such as in case of a busy yarn queue. When this happens, all pending calls on SparkClientFactory will have to wait for a long time. In our case, hive.spark.client.server.connect.timeout is set to 1hr. This makes some queries waiting for hours before starting. The current implementation seems pretty much making all remote driver launches serialized. If one of them takes time, the following ones will have to wait. HS2 stacktrace is attached for reference. It's based on earlier version of Hive, so the line numbers might be slightly off. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16799) Control the max number of task for a stage in a spark job
Xuefu Zhang created HIVE-16799: -- Summary: Control the max number of task for a stage in a spark job Key: HIVE-16799 URL: https://issues.apache.org/jira/browse/HIVE-16799 Project: Hive Issue Type: Improvement Reporter: Xuefu Zhang Assignee: Xuefu Zhang HIVE-16552 gives admin an option to control the maximum number of tasks a Spark job may have. However, this may not be sufficient as this tends to penalize jobs that have many stages while favoring jobs that has fewer stages. Ideally, we should also limit the number of tasks in a stage, which is closer to the maximum number of mappers or reducers in a MR job. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16552) Limit the number of tasks a Spark job may contain
Xuefu Zhang created HIVE-16552: -- Summary: Limit the number of tasks a Spark job may contain Key: HIVE-16552 URL: https://issues.apache.org/jira/browse/HIVE-16552 Project: Hive Issue Type: Improvement Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang It's commonly desirable to block bad and big queries that takes a lot of YARN resources. One approach, similar to mapreduce.job.max.map in MapReduce, is to stop a query that invokes a Spark job that contains too many tasks. The proposal here is to introduce hive.spark.job.max.tasks with a default value of -1 (no limit), which an admin can set to block queries that trigger too many spark tasks. Please note that this control knob applies to a spark job, though it's possible that one query can trigger multiple Spark jobs (such as in case of map-join). Nevertheless, the proposed approach is still helpful. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16196) UDFJson having thread-safety issues
Xuefu Zhang created HIVE-16196: -- Summary: UDFJson having thread-safety issues Key: HIVE-16196 URL: https://issues.apache.org/jira/browse/HIVE-16196 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Followup for HIVE-16183, there seems to be some concurrency issues in UDFJson.java, especially around static class variables. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16183) Fix potential thread safety issues with static variables
Xuefu Zhang created HIVE-16183: -- Summary: Fix potential thread safety issues with static variables Key: HIVE-16183 URL: https://issues.apache.org/jira/browse/HIVE-16183 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Many concurrency issues have been found with respect to class static variable usages. With fact that HS2 supports concurrent compilation and task execution as well as some backend engines (such as Spark) running multiple tasks in a single JVM, traditional assumption (or mindset) of single threaded execution needs to be abandoned. This purpose of this JIRA is to do a global scan of static variables in Hive code base, and correct potential thread-safety issues. However, it's not meant to be exhaustive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16179) HoS tasks may fail due to ArrayIndexOutOfBoundException in BinarySortableSerDe
Xuefu Zhang created HIVE-16179: -- Summary: HoS tasks may fail due to ArrayIndexOutOfBoundException in BinarySortableSerDe Key: HIVE-16179 URL: https://issues.apache.org/jira/browse/HIVE-16179 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Stacktrace: {code} java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error: Unable to deserialize reduce input key from x1x100x101x97x51x49x50x97x102x45x97x98x56x52x45x52x102x52x53x45x56x49x101x99x45x49x99x100x98x55x97x51x52x100x49x49x55x0x1x128x0x0x0x0x0x0x19x1x128x0x0x0x0x0x0x3x1x128x0x66x179x1x192x244x45x90x1x85x98x101x114x0x1x76x111x115x32x65x110x103x101x108x101x115x0x1x2x128x0x0x2x50x51x57x51x0x1x192x55x238x20x122x225x71x174x1x128x0x0x0x87x240x169x195x1x50x48x49x54x45x49x48x45x48x49x32x50x51x58x51x49x58x51x49x0x1x117x98x101x114x88x0x255 with properties {columns=_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe, serialization.sort.order=, columns.types=string,bigint,bigint,date,int,varchar(50),varchar(255),decimal(12,2),double,bigint,string,varchar(255)} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:339) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2004) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2004) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error: Unable to deserialize reduce input key from x1x100x101x97x51x49x50x97x102x45x97x98x56x52x45x52x102x52x53x45x56x49x101x99x45x49x99x100x98x55x97x51x52x100x49x49x55x0x1x128x0x0x0x0x0x0x19x1x128x0x0x0x0x0x0x3x1x128x0x66x179x1x192x244x45x90x1x85x98x101x114x0x1x76x111x115x32x65x110x103x101x108x101x115x0x1x2x128x0x0x2x50x51x57x51x0x1x192x55x238x20x122x225x71x174x1x128x0x0x0x87x240x169x195x1x50x48x49x54x45x49x48x45x48x49x32x50x51x58x51x49x58x51x49x0x1x117x98x101x114x88x0x255 with properties {columns=_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11, serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe, serialization.sort.order=, columns.types=string,bigint,bigint,date,int,varchar(50),varchar(255),decimal(12,2),double,bigint,string,varchar(255)} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:311) ... 16 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:413) at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:190) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:309) ... 16 more {code} It seems to be a synchronization issue in BinarySortableSerDe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16156) FileSinkOperator should delete existing output target when renaming
Xuefu Zhang created HIVE-16156: -- Summary: FileSinkOperator should delete existing output target when renaming Key: HIVE-16156 URL: https://issues.apache.org/jira/browse/HIVE-16156 Project: Hive Issue Type: Bug Components: Operators Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang If a task get killed (for whatever a reason) after it completes the renaming the temp output to final output during commit, subsequent task attempts will fail when renaming because of the existence of the target output. This can happen, however rarely. Hive should check the existence of the target output and delete it before renaming. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15893) Followup on HIVE-15671
Xuefu Zhang created HIVE-15893: -- Summary: Followup on HIVE-15671 Key: HIVE-15893 URL: https://issues.apache.org/jira/browse/HIVE-15893 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang In HIVE-15671, we fixed a type where server.connect.timeout is used in the place of client.connect.timeout. This might solve some potential problems, but the original problem reported in HIVE-15671 might still exist. (Not sure if HIVE-15860 helps). Here is the proposal suggested by Marcelo: {quote} bq: server detecting a driver problem after it has connected back to the server. Hmm. That is definitely not any of the "connect" timeouts, which probably means it isn't configured and is just using netty's default (which is probably no timeout?). Would probably need something using io.netty.handler.timeout.IdleStateHandler, and also some periodic "ping" so that the connection isn't torn down without reason. {code} We will use this JIRA to track the issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15683) Measure performance impact on group by by HIVE-15580
Xuefu Zhang created HIVE-15683: -- Summary: Measure performance impact on group by by HIVE-15580 Key: HIVE-15683 URL: https://issues.apache.org/jira/browse/HIVE-15683 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang HIVE-15580 changed the way the data is shuffled for order by: instead of using Spark's groupByKey to shuffle data, Hive on Spark now uses repartitionAndSortWithinPartitions(), which generates (key, value) pairs instead of original (key, value iterator). This might have some performance implications, but it's needed to get rid of unbound memory usage by {{groupByKey}}. Here we'd like to compare group by performance with or w/o HIVE-15580. If the impact is significant, we can provide a configuration that allows user to switch back to the original way of shuffling. This work should be ideally done after HIVE-15682 as the optimization there should help the performance here as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15682) Eliminate the dummy iterator and optimize the per row based reducer-side processing
Xuefu Zhang created HIVE-15682: -- Summary: Eliminate the dummy iterator and optimize the per row based reducer-side processing Key: HIVE-15682 URL: https://issues.apache.org/jira/browse/HIVE-15682 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang HIVE-15580 introduced a dummy iterator per input row which can be eliminated. This is because {{SparkReduceRecordHandler}} is able to handle single key value pairs. We can refactor this part of code 1. to remove the need for a iterator and 2. to optimize the code path for per (key, value) based (instead of (key, value iterator)) processing. It would be also great if we can measure the performance after the optimizations and compare to performance prior to HIVE-15580. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
Xuefu Zhang created HIVE-15671: -- Summary: RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout Key: HIVE-15671 URL: https://issues.apache.org/jira/browse/HIVE-15671 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang {code} /** * Tells the RPC server to expect a connection from a new client. * ... */ public Future registerClient(final String clientId, String secret, RpcDispatcher serverDispatcher) { return registerClient(clientId, secret, serverDispatcher, config.getServerConnectTimeoutMs()); } {code} config.getServerConnectTimeoutMs() returns value for hive.spark.client.server.connect.timeout, which is meant for timeout for handshake between Hive client and remote Spark driver. Instead, the timeout should be hive.spark.client.connect.timeout, which is for timeout for remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory
Xuefu Zhang created HIVE-15580: -- Summary: Replace Spark's groupByKey operator with something with bounded memory Key: HIVE-15580 URL: https://issues.apache.org/jira/browse/HIVE-15580 Project: Hive Issue Type: Improvement Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15543) Don't try to get memory/cores to decide parallelism when Spark dynamic allocation is enabled
Xuefu Zhang created HIVE-15543: -- Summary: Don't try to get memory/cores to decide parallelism when Spark dynamic allocation is enabled Key: HIVE-15543 URL: https://issues.apache.org/jira/browse/HIVE-15543 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Presently Hive tries to get numbers for memory and cores from the Spark application and use them to determine RS parallelism. However, this doesn't make sense when Spark dynamic allocation is enabled because the current numbers doesn't represent available computing resources, especially when SparkContext is initially launched. Thus, it makes send not to do that when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark
Xuefu Zhang created HIVE-15527: -- Summary: Memory usage is unbound in SortByShuffler for Spark Key: HIVE-15527 URL: https://issues.apache.org/jira/browse/HIVE-15527 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang In SortByShuffler.java, an ArrayList is used to back the iterator for values that have the same key in shuffled result produced by spark transformation sortByKey. It's possible that memory can be exhausted because of a large key group. {code} @Override public Tuple2> next() { // TODO: implement this by accumulating rows with the same key into a list. // Note that this list needs to improved to prevent excessive memory usage, but this // can be done in later phase. while (it.hasNext()) { Tuple2 pair = it.next(); if (curKey != null && !curKey.equals(pair._1())) { HiveKey key = curKey; List values = curValues; curKey = pair._1(); curValues = new ArrayList(); curValues.add(pair._2()); return new Tuple2>(key, values); } curKey = pair._1(); curValues.add(pair._2()); } if (curKey == null) { throw new NoSuchElementException(); } // if we get here, this should be the last element we have HiveKey key = curKey; curKey = null; return new Tuple2>(key, curValues); } {code} Since the output from sortByKey is already sorted on key, it's possible to backup the value iterable using the input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15237) Propagate Spark job failure to Hive
Xuefu Zhang created HIVE-15237: -- Summary: Propagate Spark job failure to Hive Key: HIVE-15237 URL: https://issues.apache.org/jira/browse/HIVE-15237 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.1.0 Reporter: Xuefu Zhang If a Spark job failed for some reason, Hive doesn't get any additional error message, which makes it very hard for user to figure out why. Here is an example: {code} Status: Running (Hive on Spark job[0]) Job Progress Format CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost] 2016-11-17 21:32:53,134 Stage-0_0: 0/23 Stage-1_0: 0/28 2016-11-17 21:32:55,156 Stage-0_0: 0(+1)/23 Stage-1_0: 0/28 2016-11-17 21:32:57,167 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 2016-11-17 21:33:00,216 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 2016-11-17 21:33:03,251 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 2016-11-17 21:33:06,286 Stage-0_0: 0(+4)/23 Stage-1_0: 0/28 2016-11-17 21:33:09,308 Stage-0_0: 0(+2,-3)/23 Stage-1_0: 0/28 2016-11-17 21:33:12,332 Stage-0_0: 0(+2,-3)/23 Stage-1_0: 0/28 2016-11-17 21:33:13,338 Stage-0_0: 0(+21,-3)/23 Stage-1_0: 0/28 2016-11-17 21:33:15,349 Stage-0_0: 0(+21,-5)/23 Stage-1_0: 0/28 2016-11-17 21:33:16,358 Stage-0_0: 0(+18,-8)/23 Stage-1_0: 0/28 2016-11-17 21:33:19,373 Stage-0_0: 0(+21,-8)/23 Stage-1_0: 0/28 2016-11-17 21:33:22,400 Stage-0_0: 0(+18,-14)/23Stage-1_0: 0/28 2016-11-17 21:33:23,404 Stage-0_0: 0(+15,-20)/23Stage-1_0: 0/28 2016-11-17 21:33:24,408 Stage-0_0: 0(+12,-23)/23Stage-1_0: 0/28 2016-11-17 21:33:25,417 Stage-0_0: 0(+9,-26)/23 Stage-1_0: 0/28 2016-11-17 21:33:26,420 Stage-0_0: 0(+12,-26)/23Stage-1_0: 0/28 2016-11-17 21:33:28,427 Stage-0_0: 0(+9,-29)/23 Stage-1_0: 0/28 2016-11-17 21:33:29,432 Stage-0_0: 0(+12,-29)/23Stage-1_0: 0/28 2016-11-17 21:33:31,444 Stage-0_0: 0(+18,-29)/23Stage-1_0: 0/28 2016-11-17 21:33:34,464 Stage-0_0: 0(+18,-29)/23Stage-1_0: 0/28 Status: Failed FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask {code} It would be better if we can propagate Spark error to Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14885) Support PPD for nested columns
Xuefu Zhang created HIVE-14885: -- Summary: Support PPD for nested columns Key: HIVE-14885 URL: https://issues.apache.org/jira/browse/HIVE-14885 Project: Hive Issue Type: Improvement Components: Logical Optimizer, Serializers/Deserializers Affects Versions: 2.1.0 Reporter: Xuefu Zhang It looks like that PPD doesn't work for nested columns, at least not for Parquet. For a given schema {code} hive> desc nested; OK a int b string c struct {code} PPD works for a query like {code} select * from nested where a=1; {code} while NOT for {code} select * from nested where c.d=2; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14617) NPE in UDF MapValues() if input is null
Xuefu Zhang created HIVE-14617: -- Summary: NPE in UDF MapValues() if input is null Key: HIVE-14617 URL: https://issues.apache.org/jira/browse/HIVE-14617 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 2.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Job fails with error msg as follows: {code} Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"ts":null,"_max_added_id":null,"identity_info":null,"vehicle_specs":null,"tracking_info":null,"color_info":null,"vehicle_traits":null,"detail_info":null,"_row_key":null,"_shard":null,"image_info":null,"vehicle_tags":null,"activation_info":null,"flavor_info":null,"sounds":null,"legacy_info":null,"images":null,"datestr":"2016-08-24"} at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"ts":null,"_max_added_id":null,"identity_info":null,"vehicle_specs":null,"tracking_info":null,"color_info":null,"vehicle_traits":null,"detail_info":null,"_row_key":null,"_shard":null,"image_info":null,"vehicle_tags":null,"activation_info":null,"flavor_info":null,"sounds":null,"legacy_info":null,"images":null,"datestr":"2016-08-24"} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating map_values(vehicle_traits.vehicle_traits) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator.processOp(LateralViewForwardOperator.java:37) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) ... 9 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDFMapValues.evaluate(GenericUDFMapValues.java:64) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13873) Column pruning for nested fields
Xuefu Zhang created HIVE-13873: -- Summary: Column pruning for nested fields Key: HIVE-13873 URL: https://issues.apache.org/jira/browse/HIVE-13873 Project: Hive Issue Type: New Feature Components: Logical Optimizer Reporter: Xuefu Zhang Some columnar file formats such as Parquet store fields in struct type also column by column using encoding described in Google Dramel pager. It's very common in big data where data are stored in structs while queries only needs a subset of the the fields in the structs. However, presently Hive still needs to read the whole struct regardless whether all fields are selected. Therefore, pruning unwanted sub-fields in struct or nested fields at file reading time would be a big performance boost for such scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13276) Hive on Spark doesn't work when spark.master=local
Xuefu Zhang created HIVE-13276: -- Summary: Hive on Spark doesn't work when spark.master=local Key: HIVE-13276 URL: https://issues.apache.org/jira/browse/HIVE-13276 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang The following problem occurs with latest Hive master and Spark 1.6.1. I'm using hive CLI on mac. {code} set mapreduce.job.reduces= java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ at org.apache.spark.SparkContext.withScope(SparkContext.scala:714) at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:991) at org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:419) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:205) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:145) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:117) at org.apache.hadoop.hive.ql.exec.spark.LocalHiveSparkClient.execute(LocalHiveSparkClient.java:130) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:71) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:94) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:156) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:101) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1837) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1578) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1351) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1122) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1110) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Could not initialize class org.apache.spark.rdd.RDDOperationScope$ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12951) Reduce Spark executor prewarm timeout to 5s
Xuefu Zhang created HIVE-12951: -- Summary: Reduce Spark executor prewarm timeout to 5s Key: HIVE-12951 URL: https://issues.apache.org/jira/browse/HIVE-12951 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Currently it's set to 30s, which tends to be longer than needed. Reduce it to 5s, only considering jvm startup time. (Eventually, we may want to make this configurable.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12828) Update Spark version to 1.6
Xuefu Zhang created HIVE-12828: -- Summary: Update Spark version to 1.6 Key: HIVE-12828 URL: https://issues.apache.org/jira/browse/HIVE-12828 Project: Hive Issue Type: Task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12811) Name yarn application name more meaning than just "Hive on Spark"
Xuefu Zhang created HIVE-12811: -- Summary: Name yarn application name more meaning than just "Hive on Spark" Key: HIVE-12811 URL: https://issues.apache.org/jira/browse/HIVE-12811 Project: Hive Issue Type: Improvement Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang MR uses the query as the application name. Hopefully this can be set via spark.app.name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12708) Hive on Spark doesn't work with Kerboresed HBase [Spark Branch]
Xuefu Zhang created HIVE-12708: -- Summary: Hive on Spark doesn't work with Kerboresed HBase [Spark Branch] Key: HIVE-12708 URL: https://issues.apache.org/jira/browse/HIVE-12708 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0, 1.2.0, 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Spark application launcher (spark-submit) acquires HBase delegation token on Hive user's behalf when the application is launched. This mechanism, which doesn't work for long-running sessions, is not in line with what Hive is doing. Hive actually acquires the token automatically whenever a job needs it. The right approach for Spark should be allowing applications to dynamically add whatever tokens they need to the spark context. While this needs work on Spark side, we provide a workaround solution in Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12611) Make sure spark.yarn.queue is effective and takes the value from mapreduce.job.queuename if given [Spark Branch]
Xuefu Zhang created HIVE-12611: -- Summary: Make sure spark.yarn.queue is effective and takes the value from mapreduce.job.queuename if given [Spark Branch] Key: HIVE-12611 URL: https://issues.apache.org/jira/browse/HIVE-12611 Project: Hive Issue Type: Improvement Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Hive users sometimes specifies a job queue name for the submitted MR jobs. For spark, the property name is spark.yarn.queue. We need to make sure that user is able to submit spark jobs to the given queue. If user specifies the MR property, then Hive on Spark should take that as well to make it backward compatible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]
Xuefu Zhang created HIVE-12569: -- Summary: Excessive console message from SparkClientImpl [Spark Branch] Key: HIVE-12569 URL: https://issues.apache.org/jira/browse/HIVE-12569 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang {code} 15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: Application report for application_1442517343449_0038 (state: RUNNING) 15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: Application report for application_1442517343449_0038 (state: RUNNING) 15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: Application report for application_1442517343449_0038 (state: RUNNING) 15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: Application report for application_1442517343449_0038 (state: RUNNING) 15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: Application report for application_1442517343449_0038 (state: RUNNING) {code} I see this using Hive CLI after a spark job is launched and it goes non-stopping even if the job is finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12568) Use the same logic finding HS2 host name in Spark client [Spark Branch]
Xuefu Zhang created HIVE-12568: -- Summary: Use the same logic finding HS2 host name in Spark client [Spark Branch] Key: HIVE-12568 URL: https://issues.apache.org/jira/browse/HIVE-12568 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Spark client sends a pair of host name and port number to the remote driver so that the driver can connects back to HS2 where the user session is. Spark client has its own way determining the host name, and pick one network interface if the host happens to have multiple network interfaces. This can be problematic. For that, there is parameter, hive.spark.client.server.address, which user can pick an interface. Unfortunately, this interface isn't exposed. Instead of exposing this parameter, we can use the same logic as Hive in determining the host name. Therefore, the remote driver connecting to HS2 using the same network interface as a HS2 client would do. There might be a case where user may want the remote driver to use a different network. This is rare if at all. Thus, for now it should be sufficient to use the same network interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12554) Fix Spark branch build after merge [Spark Branch]
Xuefu Zhang created HIVE-12554: -- Summary: Fix Spark branch build after merge [Spark Branch] Key: HIVE-12554 URL: https://issues.apache.org/jira/browse/HIVE-12554 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang Assignee: Rui Li The previous merge from master broke spark branch build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12461) Branch-1 -Phadoop-1 build is broken
Xuefu Zhang created HIVE-12461: -- Summary: Branch-1 -Phadoop-1 build is broken Key: HIVE-12461 URL: https://issues.apache.org/jira/browse/HIVE-12461 Project: Hive Issue Type: Bug Affects Versions: 1.3.0 Reporter: Xuefu Zhang {code} [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hive-exec --- [INFO] Compiling 2423 source files to /Users/xzhang/apache/hive-git-commit/ql/target/classes [INFO] - [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /Users/xzhang/apache/hive-git-commit/ql/src/java/org/apache/hadoop/hive/ql/Context.java:[352,10] error: cannot find symbol [INFO] 1 error [INFO] - [INFO] [INFO] Reactor Summary: [INFO] [INFO] Hive ... SUCCESS [ 2.636 s] [INFO] Hive Shims Common .. SUCCESS [ 3.270 s] [INFO] Hive Shims 0.20S ... SUCCESS [ 1.052 s] [INFO] Hive Shims 0.23 SUCCESS [ 3.550 s] [INFO] Hive Shims Scheduler ... SUCCESS [ 1.076 s] [INFO] Hive Shims . SUCCESS [ 1.472 s] [INFO] Hive Common SUCCESS [ 5.989 s] [INFO] Hive Serde . SUCCESS [ 6.923 s] [INFO] Hive Metastore . SUCCESS [ 19.424 s] [INFO] Hive Ant Utilities . SUCCESS [ 0.516 s] [INFO] Spark Remote Client SUCCESS [ 3.305 s] [INFO] Hive Query Language FAILURE [ 34.276 s] [INFO] Hive Service ... SKIPPED {code} Part of the code that's being complained: {code} 343 /** 344* Remove any created scratch directories. 345*/ 346 public void removeScratchDir() { 347 for (Map.Entry entry : fsScratchDirs.entrySet()) { 348 try { 349 Path p = entry.getValue(); 350 FileSystem fs = p.getFileSystem(conf); 351 fs.delete(p, true); 352 fs.cancelDeleteOnExit(p); 353 } catch (Exception e) { 354 LOG.warn("Error Removing Scratch: " 355 + StringUtils.stringifyException(e)); 356 } {code} might be related to HIVE-12268. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12460) Fix branch-1 build
Xuefu Zhang created HIVE-12460: -- Summary: Fix branch-1 build Key: HIVE-12460 URL: https://issues.apache.org/jira/browse/HIVE-12460 Project: Hive Issue Type: Bug Components: Build Infrastructure Affects Versions: 1.3.0 Reporter: Xuefu Zhang Caused by a merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12434) Merge spark into master 11/17/1015
Xuefu Zhang created HIVE-12434: -- Summary: Merge spark into master 11/17/1015 Key: HIVE-12434 URL: https://issues.apache.org/jira/browse/HIVE-12434 Project: Hive Issue Type: Task Components: Spark Affects Versions: 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang There are still a few patches that are in Spark branch only. We need to merge them to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12433) Merge trunk into spark 11/17/2015 [Spark Branch]
Xuefu Zhang created HIVE-12433: -- Summary: Merge trunk into spark 11/17/2015 [Spark Branch] Key: HIVE-12433 URL: https://issues.apache.org/jira/browse/HIVE-12433 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Brock Noland Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12398) Create format checker for Parquet
Xuefu Zhang created HIVE-12398: -- Summary: Create format checker for Parquet Key: HIVE-12398 URL: https://issues.apache.org/jira/browse/HIVE-12398 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 2.0.0 Reporter: Xuefu Zhang See HIVE-11120 and related. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12390) Merge master to Spark branch 11/11/2015 [Spark Branch]
Xuefu Zhang created HIVE-12390: -- Summary: Merge master to Spark branch 11/11/2015 [Spark Branch] Key: HIVE-12390 URL: https://issues.apache.org/jira/browse/HIVE-12390 Project: Hive Issue Type: Task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang To fix some test failures such as those for Llap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12284) CLONE - Merge master to Spark branch 10/26/2015 [Spark Branch]
Xuefu Zhang created HIVE-12284: -- Summary: CLONE - Merge master to Spark branch 10/26/2015 [Spark Branch] Key: HIVE-12284 URL: https://issues.apache.org/jira/browse/HIVE-12284 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: spark-branch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client
Xuefu Zhang created HIVE-12205: -- Summary: Spark: unify spark statististics aggregation between local and remote spark client Key: HIVE-12205 URL: https://issues.apache.org/jira/browse/HIVE-12205 Project: Hive Issue Type: Task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark statistics aggregation are done similar but in different code paths. Ideally, we should have a unified approach to simply maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12063) Pad Decimal numbers with trailing zeros to the scale of the column
Xuefu Zhang created HIVE-12063: -- Summary: Pad Decimal numbers with trailing zeros to the scale of the column Key: HIVE-12063 URL: https://issues.apache.org/jira/browse/HIVE-12063 Project: Hive Issue Type: Improvement Components: Types Affects Versions: 1.1.0, 1.2.0, 1.0.0, 0.14.0, 0.13 Reporter: Xuefu Zhang Assignee: Xuefu Zhang HIVE-7373 was to address the problem of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11844) CMerge master to Spark branch 9/16/2015 [Spark Branch]
Xuefu Zhang created HIVE-11844: -- Summary: CMerge master to Spark branch 9/16/2015 [Spark Branch] Key: HIVE-11844 URL: https://issues.apache.org/jira/browse/HIVE-11844 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11835) Type decimal(1,1) reads 0.0, 0.00, etc from text file as NULL
Xuefu Zhang created HIVE-11835: -- Summary: Type decimal(1,1) reads 0.0, 0.00, etc from text file as NULL Key: HIVE-11835 URL: https://issues.apache.org/jira/browse/HIVE-11835 Project: Hive Issue Type: Bug Components: Types Affects Versions: 1.1.0, 1.2.0, 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Steps to reproduce: 1. create a text file with values like 0.0, 0.00, etc. 2. create table in hive with type decimal(1,1). 3. run "load data local inpath ..." to load data into the table. 4. run select * on the table. You will see that NULL is displayed for 0.0, 0.00, .0, etc. Instead, these should be read as 0.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11549) Hide Hive configuration from spark driver launching process
Xuefu Zhang created HIVE-11549: -- Summary: Hide Hive configuration from spark driver launching process Key: HIVE-11549 URL: https://issues.apache.org/jira/browse/HIVE-11549 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.2.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Hive uses Spark application submission script, spark-submit, to launch remote spark driver. Starting from Spark 1.4, this script also does a lot of things that Hive doesn't need, for instance, accessing metastore for delegation tokens. Hive on Spark doesn't need this, and one way to do this is hide Hive configuration from being visible by that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11434) Followup for HIVE-10166: reuse existing configurations for prewarming Spark executors
Xuefu Zhang created HIVE-11434: -- Summary: Followup for HIVE-10166: reuse existing configurations for prewarming Spark executors Key: HIVE-11434 URL: https://issues.apache.org/jira/browse/HIVE-11434 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang It appears that the patch other than the latest from HIVE- was committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11433) NPE for a multiple inner join query
Xuefu Zhang created HIVE-11433: -- Summary: NPE for a multiple inner join query Key: HIVE-11433 URL: https://issues.apache.org/jira/browse/HIVE-11433 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 1.2.0, 1.1.0, 2.0.0 Reporter: Xuefu Zhang NullPointException is thrown for query that has multiple (greater than 3) inner joins. Stacktrace for 1.1.0 {code} NullPointerException null java.lang.NullPointerException at org.apache.hadoop.hive.ql.parse.ParseUtils.getIndex(ParseUtils.java:149) at org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:166) at org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:185) at org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:185) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.mergeJoins(SemanticAnalyzer.java:8257) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.mergeJoinTree(SemanticAnalyzer.java:8422) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9805) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9714) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10150) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10161) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10078) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:222) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:421) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:307) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1110) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1104) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:101) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:172) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:257) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:386) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:373) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:271) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:486) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:692) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code}. However, the problem can also be reproduced in latest master branch. Further investigation shows that the following code (in ParseUtils.java) is problematic: {code} static int getIndex(String[] list, String elem) { for(int i=0; i < list.length; i++) { if (list[i].toLowerCase().equals(elem)) { return i; } } return -1; } {code} The code assumes that every element in the list is not null, which isn't true because of the following code in SemanticAnalyzer.java (method genJoinTree()): {code} if ((right.getToken().getType() == HiveParser.TOK_TABREF) || (right.getToken().getType() == HiveParser.TOK_SUBQUERY) || (right.getToken().getType() == HiveParser.TOK_PTBLFUNCTION)) { String tableName = getUnescapedUnqualifiedTableName((ASTNode) right.getChild(0)) .toLowerCase(); String alias = extractJoinAlias(right, tableName); String[] rightAliases = new String[1]; rightAliases[0] = alias; joinTree.setRightAliases(rightAliases); String[] children = joinTree.getBaseSrc(); if (children == null) { children = new String[2]; } children[1] = alias; joinTree.setBaseSrc(children); joinTree.setId(qb.getId()); joinTree.getAliasToOpInfo().put( getModifiedAlias(qb, alias), aliasToOpInfo.get(
[jira] [Created] (HIVE-11430) Followup HIVE-10166: investigate and fix the two test failures
Xuefu Zhang created HIVE-11430: -- Summary: Followup HIVE-10166: investigate and fix the two test failures Key: HIVE-11430 URL: https://issues.apache.org/jira/browse/HIVE-11430 Project: Hive Issue Type: Bug Components: Test Affects Versions: 2.0.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang {code} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_convert_enum_to_string org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynamic_rdd_cache {code} As show in . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11363) Prewarm Hive on Spark containers [Spark Branch]
Xuefu Zhang created HIVE-11363: -- Summary: Prewarm Hive on Spark containers [Spark Branch] Key: HIVE-11363 URL: https://issues.apache.org/jira/browse/HIVE-11363 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang When Hive job is launched by Oozie, a Hive session is created and job script is executed. Session is closed when Hive job is completed. Thus, Hive session is not shared among Hive jobs either in an Oozie workflow or across workflows. Since the parallelism of a Hive job executed on Spark is impacted by the available executors, such Hive jobs will suffer the executor ramp-up overhead. The idea here is to wait a bit so that enough executors can come up before a job can be executed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11314) Print "Execution completed successfully" as part of spark job info [Spark Branch]
Xuefu Zhang created HIVE-11314: -- Summary: Print "Execution completed successfully" as part of spark job info [Spark Branch] Key: HIVE-11314 URL: https://issues.apache.org/jira/browse/HIVE-11314 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Like Hive on MR, Hive on Spark should print "Execution completed successfully" as part of the spark job info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]
Xuefu Zhang created HIVE-11276: -- Summary: Optimization around job submission and adding jars [Spark Branch] Key: HIVE-11276 URL: https://issues.apache.org/jira/browse/HIVE-11276 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang It seems that Hive on Spark has some room for performance improvement on job submission. Specifically, we are calling refreshLocalResources() for every job submission despite there is are no changes in the jar list. Since Hive on Spark is reusing the containers in the whole user session, we might be able to optimize that. We do need to take into consideration the case of dynamic allocation, in which new executors might be added. This task is some R&D in this area. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11275) Merge master to beeline-cli branch 07/14/2015
Xuefu Zhang created HIVE-11275: -- Summary: Merge master to beeline-cli branch 07/14/2015 Key: HIVE-11275 URL: https://issues.apache.org/jira/browse/HIVE-11275 Project: Hive Issue Type: Sub-task Components: CLI Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11250) Change in spark.executor.instances (and others) doesn't take effect after RSC is launched for HS2 [Spark Brnach]
Xuefu Zhang created HIVE-11250: -- Summary: Change in spark.executor.instances (and others) doesn't take effect after RSC is launched for HS2 [Spark Brnach] Key: HIVE-11250 URL: https://issues.apache.org/jira/browse/HIVE-11250 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang Hive CLI works as expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11240) Change value time from int to long for HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE
Xuefu Zhang created HIVE-11240: -- Summary: Change value time from int to long for HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE Key: HIVE-11240 URL: https://issues.apache.org/jira/browse/HIVE-11240 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 1.2.0, 1.1.0 Reporter: Xuefu Zhang Currently in HiveMetaStore.java, we are getting an integer value from this property: {code} int maxMessageSize = conf.getIntVar(HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE); {code} While this is sufficient most of the time, there can be cases where msg size might needs to be greater than INT_MAX. We should use long instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11088) Investigate intermitten failure of join28.q for Spark
Xuefu Zhang created HIVE-11088: -- Summary: Investigate intermitten failure of join28.q for Spark Key: HIVE-11088 URL: https://issues.apache.org/jira/browse/HIVE-11088 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.3.0 Reporter: Xuefu Zhang Assignee: Mohit Sabharwal Please refer to https://issues.apache.org/jira/browse/HIVE-10996?focusedCommentId=14598349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14598349. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11067) Merge master to Spark branch 6/20/2015 [Spark Branch]
Xuefu Zhang created HIVE-11067: -- Summary: Merge master to Spark branch 6/20/2015 [Spark Branch] Key: HIVE-11067 URL: https://issues.apache.org/jira/browse/HIVE-11067 Project: Hive Issue Type: Sub-task Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11000) Hive not able to pass Hive's Kerberos credential to spark-submit process [Spark Branch]
Xuefu Zhang created HIVE-11000: -- Summary: Hive not able to pass Hive's Kerberos credential to spark-submit process [Spark Branch] Key: HIVE-11000 URL: https://issues.apache.org/jira/browse/HIVE-11000 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang The end of the result is that manual kinit with Hive's keytab on the host where HS2 is running, or the following error may appear: {code} 2015-04-29 15:49:34,614 INFO org.apache.hive.spark.client.SparkClientImpl: 15/04/29 15:49:34 WARN UserGroupInformation: PriviledgedActionException as:hive (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2015-04-29 15:49:34,652 INFO org.apache.hive.spark.client.SparkClientImpl: Exception in thread "main" java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "secure-hos-1.ent.cloudera.com/10.20.77.79"; destination host is: "secure-hos-1.ent.cloudera.com":8032; 2015-04-29 15:49:34,653 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) 2015-04-29 15:49:34,653 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.ipc.Client.call(Client.java:1472) 2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.ipc.Client.call(Client.java:1399) 2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) 2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl: at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source) 2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202) 2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl: at java.lang.reflect.Method.invoke(Method.java:606) 2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:461) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:91) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:91) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.Logging$class.logInfo(Logging.scala:59) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:49) 2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:90) 2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client.run(Client.scala:619) 2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647) 2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl: at org.apache.spark.deploy.yarn.Client.main(Client.scala) 2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2015-04-
[jira] [Created] (HIVE-10999) Upgrade Spark dependency to 1.4
Xuefu Zhang created HIVE-10999: -- Summary: Upgrade Spark dependency to 1.4 Key: HIVE-10999 URL: https://issues.apache.org/jira/browse/HIVE-10999 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 1.4.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10962) Merge master to Spark branch 6/7/2015 [Spark Branch]
Xuefu Zhang created HIVE-10962: -- Summary: Merge master to Spark branch 6/7/2015 [Spark Branch] Key: HIVE-10962 URL: https://issues.apache.org/jira/browse/HIVE-10962 Project: Hive Issue Type: Sub-task Reporter: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10870) Merge Spark branch to trunk 5/29/2015
Xuefu Zhang created HIVE-10870: -- Summary: Merge Spark branch to trunk 5/29/2015 Key: HIVE-10870 URL: https://issues.apache.org/jira/browse/HIVE-10870 Project: Hive Issue Type: Task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10868) Update release note for 1.2.0 and 1.1.0
Xuefu Zhang created HIVE-10868: -- Summary: Update release note for 1.2.0 and 1.1.0 Key: HIVE-10868 URL: https://issues.apache.org/jira/browse/HIVE-10868 Project: Hive Issue Type: Task Components: Documentation Affects Versions: 1.2.0, 1.1.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang It's recently found that Hive's release notes don't contain all JIRAs fixed. This happened due to a lack of correct or missing fix version in a JIRA. A large chunk of such JIRAs are due to the fact that their fix versions didn't get updated when a merge from feature branch to trunk (master). This JIRA is to fix such JIRAs related to Hive on Spark work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10863) Merge trunk to Spark branch 5/28/2015 [Spark Branch]
Xuefu Zhang created HIVE-10863: -- Summary: Merge trunk to Spark branch 5/28/2015 [Spark Branch] Key: HIVE-10863 URL: https://issues.apache.org/jira/browse/HIVE-10863 Project: Hive Issue Type: Sub-task Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10855) Make HIVE-10568 work with Spark [Spark Branch]
Xuefu Zhang created HIVE-10855: -- Summary: Make HIVE-10568 work with Spark [Spark Branch] Key: HIVE-10855 URL: https://issues.apache.org/jira/browse/HIVE-10855 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Rui Li HIVE-10001 only works with Tez. It's good to make it also work for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10854) Make HIVE-10001 work with Spark [Spark Branch]
Xuefu Zhang created HIVE-10854: -- Summary: Make HIVE-10001 work with Spark [Spark Branch] Key: HIVE-10854 URL: https://issues.apache.org/jira/browse/HIVE-10854 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang HIVE-10001 only works with Tez. It's good to make it also work for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10850) Followup for HIVE-10550, check performance w.r.t. persistency level
Xuefu Zhang created HIVE-10850: -- Summary: Followup for HIVE-10550, check performance w.r.t. persistency level Key: HIVE-10850 URL: https://issues.apache.org/jira/browse/HIVE-10850 Project: Hive Issue Type: Task Components: Spark Affects Versions: 1.2.0, 1.1.0 Reporter: Xuefu Zhang Assignee: Chengxiang Li In HIVE-10550, there was a discussion on the persistence level and whether we need to give user some control over this. This JIRA is to investigate more, especially measuring performance under difference conditions, and further the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10810) Document Beeline/CLI changes
Xuefu Zhang created HIVE-10810: -- Summary: Document Beeline/CLI changes Key: HIVE-10810 URL: https://issues.apache.org/jira/browse/HIVE-10810 Project: Hive Issue Type: Sub-task Components: CLI Reporter: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10671) yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch]
Xuefu Zhang created HIVE-10671: -- Summary: yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch] Key: HIVE-10671 URL: https://issues.apache.org/jira/browse/HIVE-10671 Project: Hive Issue Type: Bug Components: Spark Reporter: Xuefu Zhang With Hive on Spark, users noticed that in certain cases spark.master=yarn-client offers 2x or 3x better performance than if spark.master=yarn-cluster. However, yarn-cluster is what we recommend and support. Thus, we should investigate and fix the problem. One of the such queries is TPC-H 22. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10579) Fix -Phadoop-1 build
Xuefu Zhang created HIVE-10579: -- Summary: Fix -Phadoop-1 build Key: HIVE-10579 URL: https://issues.apache.org/jira/browse/HIVE-10579 Project: Hive Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10574) Metastore to handle expired tokens inline
Xuefu Zhang created HIVE-10574: -- Summary: Metastore to handle expired tokens inline Key: HIVE-10574 URL: https://issues.apache.org/jira/browse/HIVE-10574 Project: Hive Issue Type: Bug Components: Metastore Reporter: Xuefu Zhang This is a followup for HIVE-9625. Metastore has a garbage collection thread that removes expired tokens. However that still leaves a window (1 hour by default) where clients could retrieve a token that's expired or about to expire. An option is for metastore handle expired tokens inline. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10516) Measure Hive CLI's performance difference before and after implementation is switched
Xuefu Zhang created HIVE-10516: -- Summary: Measure Hive CLI's performance difference before and after implementation is switched Key: HIVE-10516 URL: https://issues.apache.org/jira/browse/HIVE-10516 Project: Hive Issue Type: Sub-task Components: CLI Affects Versions: 0.10.0 Reporter: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10515) Create tests to cover existing (supported) Hive CLI functionality
Xuefu Zhang created HIVE-10515: -- Summary: Create tests to cover existing (supported) Hive CLI functionality Key: HIVE-10515 URL: https://issues.apache.org/jira/browse/HIVE-10515 Project: Hive Issue Type: Sub-task Components: CLI Affects Versions: 0.10.0 Reporter: Xuefu Zhang After removing HiveServer1, Hive CLI's functionality is reduced to its original use case, a thick client application. Let's identify this so that we maintain it when implementation is changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10511) Unify Hive CLI and Beeline
Xuefu Zhang created HIVE-10511: -- Summary: Unify Hive CLI and Beeline Key: HIVE-10511 URL: https://issues.apache.org/jira/browse/HIVE-10511 Project: Hive Issue Type: Bug Components: CLI Affects Versions: 0.10.0 Reporter: Xuefu Zhang Hive CLI is a legacy tool which had two main use cases: 1. a thick client for SQL on hadoop 2. a command line tool for HiveServer1. HiveServer1 is already deprecated and removed from Hive code base, so use case #2 is out of the question. For #1, Beeline provides or is supposed to provides equal functionality, yet is implemented differently from Hive CLI. As it has been a while that Hive community has been recommending Beeline + HS2 configuration, ideally we should deprecating Hive CLI. Because of wide use of Hive CLI, we instead propose replacing Hive CLI's implementation with Beeline plus embedded HS2 so that Hive community only needs to maintain a single code path. In this way, Hive CLI is just an alias to Beeline at either shell script level or at high code level. The goal is that no changes or minimum changes are expected from existing user scrip using Hive CLI. This is an Umbrella JIRA covering all tasks related to this initiative. Over the last year or two, Beeline has been improved significantly to match what Hive CLI offers. Still, there may still be some gaps or deficiency to be discovered and fixed. In the meantime, we also want to make sure the enough tests are included and performance impact is identified and addressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10166) Merge Spark branch to trunk 3/31/2015
Xuefu Zhang created HIVE-10166: -- Summary: Merge Spark branch to trunk 3/31/2015 Key: HIVE-10166 URL: https://issues.apache.org/jira/browse/HIVE-10166 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 1.1.0 Reporter: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10134) Fix test failures after HIVE-10130 [Spark Branch]
Xuefu Zhang created HIVE-10134: -- Summary: Fix test failures after HIVE-10130 [Spark Branch] Key: HIVE-10134 URL: https://issues.apache.org/jira/browse/HIVE-10134 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Xuefu Zhang Complete test run: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/812/#showFailuresLink *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nonmr_fetch org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union31 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_22 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_6_subq org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10130) Merge from Spark branch to trunk 03/27/2015
Xuefu Zhang created HIVE-10130: -- Summary: Merge from Spark branch to trunk 03/27/2015 Key: HIVE-10130 URL: https://issues.apache.org/jira/browse/HIVE-10130 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10084) Improve common join performance [Spark Branch]
Xuefu Zhang created HIVE-10084: -- Summary: Improve common join performance [Spark Branch] Key: HIVE-10084 URL: https://issues.apache.org/jira/browse/HIVE-10084 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Benchmark shows that Hive on Spark shows some numbers which indicate that common join performance can be improved. This task is to investigate and fix the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9990) TestMultiSessionsHS2WithLocalClusterSpark is failing
Xuefu Zhang created HIVE-9990: - Summary: TestMultiSessionsHS2WithLocalClusterSpark is failing Key: HIVE-9990 URL: https://issues.apache.org/jira/browse/HIVE-9990 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 1.2.0 Reporter: Xuefu Zhang At least sometimes. I can reproduce it with "mvn test -Dtest=TestMultiSessionsHS2WithLocalClusterSpark -Phadoop-2" consistently on my local box. {code} --- T E S T S --- Running org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 92.438 sec <<< FAILURE! - in org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark testSparkQuery(org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark) Time elapsed: 21.514 sec <<< ERROR! java.util.concurrent.ExecutionException: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:392) at org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.verifyResult(TestMultiSessionsHS2WithLocalClusterSpark.java:244) at org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testKvQuery(TestMultiSessionsHS2WithLocalClusterSpark.java:220) at org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.access$000(TestMultiSessionsHS2WithLocalClusterSpark.java:53) {code} The error was also seen in HIVE-9934 test run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9889) Merge trunk to Spark branch 3/6/2015 [Spark Branch]
Xuefu Zhang created HIVE-9889: - Summary: Merge trunk to Spark branch 3/6/2015 [Spark Branch] Key: HIVE-9889 URL: https://issues.apache.org/jira/browse/HIVE-9889 Project: Hive Issue Type: Task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9863) Querying parquet tables fails with IllegalStateException [Spark Branch]
Xuefu Zhang created HIVE-9863: - Summary: Querying parquet tables fails with IllegalStateException [Spark Branch] Key: HIVE-9863 URL: https://issues.apache.org/jira/browse/HIVE-9863 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Not necessarily happens only in spark branch, queries such as select count(*) from table_name fails with error: {code} hive> select * from content limit 2; OK Failed with exception java.io.IOException:java.lang.IllegalStateException: All the offsets listed in the split should be found in the file. expected: [4, 4] found: [BlockMetaData{69644, 881917418 [ColumnMetaData{GZIP [guid] BINARY [PLAIN, BIT_PACKED], 4}, ColumnMetaData{GZIP [collection_name] BINARY [PLAIN_DICTIONARY, BIT_PACKED], 389571}, ColumnMetaData{GZIP [doc_type] BINARY [PLAIN_DICTIONARY, BIT_PACKED], 389790}, ColumnMetaData{GZIP [stage] INT64 [PLAIN_DICTIONARY, BIT_PACKED], 389887}, ColumnMetaData{GZIP [meta_timestamp] INT64 [RLE, PLAIN_DICTIONARY, BIT_PACKED], 397673}, ColumnMetaData{GZIP [doc_timestamp] INT64 [RLE, PLAIN_DICTIONARY, BIT_PACKED], 422161}, ColumnMetaData{GZIP [meta_size] INT32 [RLE, PLAIN_DICTIONARY, BIT_PACKED], 460215}, ColumnMetaData{GZIP [content_size] INT32 [RLE, PLAIN_DICTIONARY, BIT_PACKED], 521728}, ColumnMetaData{GZIP [source] BINARY [RLE, PLAIN, BIT_PACKED], 683740}, ColumnMetaData{GZIP [delete_flag] BOOLEAN [RLE, PLAIN, BIT_PACKED], 683787}, ColumnMetaData{GZIP [meta] BINARY [RLE, PLAIN, BIT_PACKED], 683834}, ColumnMetaData{GZIP [content] BINARY [RLE, PLAIN, BIT_PACKED], 6992365}]}] out of: [4, 129785482, 260224757] in range 0, 134217728 Time taken: 0.253 seconds hive> {code} I can reproduce the problem with either local or yarn-cluster. It seems happening to MR also. Thus, I suspect this is an parquet problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9812) Merge trunk to Spark branch 02/27/2015 [Spark Branch]
Xuefu Zhang created HIVE-9812: - Summary: Merge trunk to Spark branch 02/27/2015 [Spark Branch] Key: HIVE-9812 URL: https://issues.apache.org/jira/browse/HIVE-9812 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9671) Support Impersonation [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332211#comment-14332211 ] Xuefu Zhang commented on HIVE-9671: --- Patch looks good. On minor nit: one space seems missing: {code} user =Utils.getUGI().getShortUserName(); {code} Besides that, the code additions in shim seem identical, so it might make sense to have a private method to reuse the code instead. > Support Impersonation [Spark Branch] > > > Key: HIVE-9671 > URL: https://issues.apache.org/jira/browse/HIVE-9671 > Project: Hive > Issue Type: Sub-task > Components: Spark >Affects Versions: spark-branch >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-9671.1-spark.patch, HIVE-9671.1-spark.patch, > HIVE-9671.2-spark.patch > > > SPARK-5493 in 1.3 implemented proxy user authentication. We need to implement > using this option in spark client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9671) Support Impersonation [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9671: -- Status: Open (was: Patch Available) > Support Impersonation [Spark Branch] > > > Key: HIVE-9671 > URL: https://issues.apache.org/jira/browse/HIVE-9671 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Brock Noland >Assignee: Brock Noland > > SPARK-5493 in 1.3 implemented proxy user authentication. We need to implement > using this option in spark client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9745) predicate evaluation of character fields with spaces and literals with spaces returns unexpected result
[ https://issues.apache.org/jira/browse/HIVE-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9745: -- Description: The following query should return 5 rows but Hive returns 3 {code} select rnum, tchar.cchar from tchar where not ( tchar.cchar = ' ' or ( tchar.cchar is null and ' ' is null )) {code} Consider the following project of the base table {code} select rnum, tchar.cchar, case tchar.cchar when ' ' then 'space' else 'not space' end, case when tchar.cchar is null then 'is null' else 'not null' end, case when ' ' is null then 'is null' else 'not null' end from tchar order by rnum {code} Row 0 is a NULL Row 1 was loaded with a zero length string '' Row 2 was loaded with a single space ' ' {code} rnumtchar.cchar _c2 _c3 _c4 0 not space is null not null 1 not space not null not null 2 not space not null not null 3 BB not space not null not null 4 EE not space not null not null 5 FF not space not null not null {code} Explicitly type cast the literal which many SQL developers would not expect need to do. {code} select rnum, tchar.cchar, case tchar.cchar when cast(' ' as char(1)) then 'space' else 'not space' end, case when tchar.cchar is null then 'is null' else 'not null' end, case when cast( ' ' as char(1)) is null then 'is null' else 'not null' end from tchar order by rnum rnumtchar.cchar _c2 _c3 _c4 0 not space is null not null 1 space not nullnot null 2 space not nullnot null 3 BB not space not null not null 4 EE not space not null not null 5 FF not space not null not null create table if not exists T_TCHAR ( RNUM int , CCHAR char(32 )) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE ; 0|\N 1| 2| 3|BB 4|EE 5|FF create table if not exists TCHAR ( RNUM int , CCHAR char(32 )) STORED AS orc ; insert overwrite table TCHAR select * from T_TCHAR; {code} was: The following query should return 5 rows but Hive returns 3 select rnum, tchar.cchar from tchar where not ( tchar.cchar = ' ' or ( tchar.cchar is null and ' ' is null )) Consider the following project of the base table select rnum, tchar.cchar, case tchar.cchar when ' ' then 'space' else 'not space' end, case when tchar.cchar is null then 'is null' else 'not null' end, case when ' ' is null then 'is null' else 'not null' end from tchar order by rnum Row 0 is a NULL Row 1 was loaded with a zero length string '' Row 2 was loaded with a single space ' ' rnumtchar.cchar _c2 _c3 _c4 0 not space is null not null 1 not space not null not null 2 not space not null not null 3 BB not space not null not null 4 EE not space not null not null 5 FF not space not null not null Explicitly type cast the literal which many SQL developers would not expect need to do. select rnum, tchar.cchar, case tchar.cchar when cast(' ' as char(1)) then 'space' else 'not space' end, case when tchar.cchar is null then 'is null' else 'not null' end, case when cast( ' ' as char(1)) is null then 'is null' else 'not null' end from tchar order by rnum rnumtchar.cchar _c2 _c3 _c4 0 not space is null not null 1 space not nullnot null 2 space not nullnot null 3 BB not space not null not null 4 EE not space not null not null 5 FF not space not null not null create table if not exists T_TCHAR ( RNUM int , CCHAR char(32 )) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE ; 0|\N 1| 2| 3|BB 4|EE 5|FF create table if not exists TCHAR ( RNUM int , CCHAR char(32 )) STORED AS orc ; insert overwrite table TCHAR select * from T_TCHAR; > predicate evaluation of character fields with spaces and literals with sp
[jira] [Commented] (HIVE-9726) Upgrade to spark 1.3 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328971#comment-14328971 ] Xuefu Zhang commented on HIVE-9726: --- +1 > Upgrade to spark 1.3 [Spark Branch] > --- > > Key: HIVE-9726 > URL: https://issues.apache.org/jira/browse/HIVE-9726 > Project: Hive > Issue Type: Sub-task > Components: Spark >Affects Versions: spark-branch >Reporter: Brock Noland >Assignee: Brock Noland > Attachments: HIVE-9671.1-spark.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015
[ https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326814#comment-14326814 ] Xuefu Zhang commented on HIVE-9703: --- No doc is needed for this JIRA. Any doc impact should be tracked by respective JIRAs on Spark branch. Going over the patch shows there is nothing to be documented, however. > Merge from Spark branch to trunk 02/16/2015 > --- > > Key: HIVE-9703 > URL: https://issues.apache.org/jira/browse/HIVE-9703 > Project: Hive > Issue Type: Task >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 1.2.0 > > Attachments: HIVE-9703.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-7292) Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326336#comment-14326336 ] Xuefu Zhang edited comment on HIVE-7292 at 2/18/15 6:37 PM: Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out. was (Author: xuefuz): Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out. > Hive on Spark > - > > Key: HIVE-7292 > URL: https://issues.apache.org/jira/browse/HIVE-7292 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5 > Attachments: Hive-on-Spark.pdf > > > Spark as an open-source data analytics cluster computing framework has gained > significant momentum recently. Many Hive users already have Spark installed > as their computing backbone. To take advantages of Hive, they still need to > have either MapReduce or Tez on their cluster. This initiative will provide > user a new alternative so that those user can consolidate their backend. > Secondly, providing such an alternative further increases Hive's adoption as > it exposes Spark users to a viable, feature-rich de facto standard SQL tools > on Hadoop. > Finally, allowing Hive to run on Spark also has performance benefits. Hive > queries, especially those involving multiple reducer stages, will run faster, > thus improving user experience as Tez does. > This is an umbrella JIRA which will cover many coming subtask. Design doc > will be attached here shortly, and will be on the wiki as well. Feedback from > the community is greatly appreciated! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7292) Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326336#comment-14326336 ] Xuefu Zhang commented on HIVE-7292: --- Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out. > Hive on Spark > - > > Key: HIVE-7292 > URL: https://issues.apache.org/jira/browse/HIVE-7292 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5 > Attachments: Hive-on-Spark.pdf > > > Spark as an open-source data analytics cluster computing framework has gained > significant momentum recently. Many Hive users already have Spark installed > as their computing backbone. To take advantages of Hive, they still need to > have either MapReduce or Tez on their cluster. This initiative will provide > user a new alternative so that those user can consolidate their backend. > Secondly, providing such an alternative further increases Hive's adoption as > it exposes Spark users to a viable, feature-rich de facto standard SQL tools > on Hadoop. > Finally, allowing Hive to run on Spark also has performance benefits. Hive > queries, especially those involving multiple reducer stages, will run faster, > thus improving user experience as Tez does. > This is an umbrella JIRA which will cover many coming subtask. Design doc > will be attached here shortly, and will be on the wiki as well. Feedback from > the community is greatly appreciated! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-7292) Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326335#comment-14326335 ] Xuefu Zhang commented on HIVE-7292: --- Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out. > Hive on Spark > - > > Key: HIVE-7292 > URL: https://issues.apache.org/jira/browse/HIVE-7292 > Project: Hive > Issue Type: Improvement > Components: Spark >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5 > Attachments: Hive-on-Spark.pdf > > > Spark as an open-source data analytics cluster computing framework has gained > significant momentum recently. Many Hive users already have Spark installed > as their computing backbone. To take advantages of Hive, they still need to > have either MapReduce or Tez on their cluster. This initiative will provide > user a new alternative so that those user can consolidate their backend. > Secondly, providing such an alternative further increases Hive's adoption as > it exposes Spark users to a viable, feature-rich de facto standard SQL tools > on Hadoop. > Finally, allowing Hive to run on Spark also has performance benefits. Hive > queries, especially those involving multiple reducer stages, will run faster, > thus improving user experience as Tez does. > This is an umbrella JIRA which will cover many coming subtask. Design doc > will be attached here shortly, and will be on the wiki as well. Feedback from > the community is greatly appreciated! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9561: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) [~lirui], no worries. I just committed this to the Spark branch. Thanks, Rui. > SHUFFLE_SORT should only be used for order by query [Spark Branch] > -- > > Key: HIVE-9561 > URL: https://issues.apache.org/jira/browse/HIVE-9561 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Fix For: spark-branch > > Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, > HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch, > HIVE-9561.6-spark.patch > > > The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance > and are difficult to control. So we should limit the use of {{sortByKey}} to > order by query only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015
[ https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9703: -- Resolution: Fixed Fix Version/s: 1.2.0 Status: Resolved (was: Patch Available) Committed to trunk. Thanks to Brock for the review. > Merge from Spark branch to trunk 02/16/2015 > --- > > Key: HIVE-9703 > URL: https://issues.apache.org/jira/browse/HIVE-9703 > Project: Hive > Issue Type: Task >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 1.2.0 > > Attachments: HIVE-9703.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9607) Remove unnecessary attach-jdbc-driver execution from package/pom.xml
[ https://issues.apache.org/jira/browse/HIVE-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9607: -- Resolution: Fixed Fix Version/s: 1.2.0 Status: Resolved (was: Patch Available) Committed to trunk. thanks, alex. > Remove unnecessary attach-jdbc-driver execution from package/pom.xml > > > Key: HIVE-9607 > URL: https://issues.apache.org/jira/browse/HIVE-9607 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure >Reporter: Alexander Pivovarov >Assignee: Alexander Pivovarov >Priority: Minor > Fix For: 1.2.0 > > Attachments: HIVE-9607.1.patch > > > Looks like build-helper-maven-plugin block which has execution > attach-jdbc-driver is not needed in package/pom.xml > package/pom.xml has maven-dependency-plugin which copies hive-jdbc-standalone > to project.build.directory > I removed build-helper-maven-plugin block and rebuilt hive > hive-jdbc-standalone.jar is still placed to project.build.directory > {code} > $ mvn clean install -Phadoop-2 -Pdist -DskipTests > $ find . -name "apache-hive*jdbc.jar" -exec ls -la {} \; > 16844023 Feb 6 17:45 ./packaging/target/apache-hive-1.2.0-SNAPSHOT-jdbc.jar > $ find . -name "hive-jdbc*standalone.jar" -exec ls -la {} \; > 16844023 Feb 6 17:45 > ./packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin/lib/hive-jdbc-1.2.0-SNAPSHOT-standalone.jar > 16844023 Feb 6 17:45 ./jdbc/target/hive-jdbc-1.2.0-SNAPSHOT-standalone.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9561: -- Attachment: HIVE-9561.6-spark.patch > SHUFFLE_SORT should only be used for order by query [Spark Branch] > -- > > Key: HIVE-9561 > URL: https://issues.apache.org/jira/browse/HIVE-9561 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, > HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch, > HIVE-9561.6-spark.patch > > > The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance > and are difficult to control. So we should limit the use of {{sortByKey}} to > order by query only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9708) Remove testlibs directory
[ https://issues.apache.org/jira/browse/HIVE-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324860#comment-14324860 ] Xuefu Zhang commented on HIVE-9708: --- +1 > Remove testlibs directory > - > > Key: HIVE-9708 > URL: https://issues.apache.org/jira/browse/HIVE-9708 > Project: Hive > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Brock Noland >Assignee: Brock Noland > Fix For: 1.1.0 > > Attachments: HIVE-9708.patch > > > The {{testlibs}} directory is left over from the old ant build. We can delete > it as it's downloaded by maven now: > https://github.com/apache/hive/blob/trunk/pom.xml#L610 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9561: -- Attachment: HIVE-9561.5-spark.patch Rebased. > SHUFFLE_SORT should only be used for order by query [Spark Branch] > -- > > Key: HIVE-9561 > URL: https://issues.apache.org/jira/browse/HIVE-9561 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, > HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch > > > The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance > and are difficult to control. So we should limit the use of {{sortByKey}} to > order by query only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323759#comment-14323759 ] Xuefu Zhang commented on HIVE-9561: --- Unfortunately the patch doesn't apply any more after recent trunk to branch merge. Could you please rebase? > SHUFFLE_SORT should only be used for order by query [Spark Branch] > -- > > Key: HIVE-9561 > URL: https://issues.apache.org/jira/browse/HIVE-9561 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, > HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch > > > The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance > and are difficult to control. So we should limit the use of {{sortByKey}} to > order by query only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323756#comment-14323756 ] Xuefu Zhang commented on HIVE-9561: --- +1 > SHUFFLE_SORT should only be used for order by query [Spark Branch] > -- > > Key: HIVE-9561 > URL: https://issues.apache.org/jira/browse/HIVE-9561 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Assignee: Rui Li > Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, > HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch > > > The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance > and are difficult to control. So we should limit the use of {{sortByKey}} to > order by query only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9696) Address RB comments for HIVE-9425 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9696: -- Fix Version/s: spark-branch > Address RB comments for HIVE-9425 [Spark Branch] > > > Key: HIVE-9696 > URL: https://issues.apache.org/jira/browse/HIVE-9696 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Priority: Trivial > Fix For: spark-branch > > Attachments: HIVE-9696.1-spark.patch, HIVE-9696.1-spark.patch, > HIVE-9696.1-spark.patch > > > A followup task of HIVE-9425. > The pending RB comment can be found > [here|https://reviews.apache.org/r/30984/#comment118482]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9696) Address RB comments for HIVE-9425 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9696: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks, Rui. > Address RB comments for HIVE-9425 [Spark Branch] > > > Key: HIVE-9696 > URL: https://issues.apache.org/jira/browse/HIVE-9696 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Rui Li >Priority: Trivial > Attachments: HIVE-9696.1-spark.patch, HIVE-9696.1-spark.patch, > HIVE-9696.1-spark.patch > > > A followup task of HIVE-9425. > The pending RB comment can be found > [here|https://reviews.apache.org/r/30984/#comment118482]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015
[ https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9703: -- Status: Patch Available (was: Open) > Merge from Spark branch to trunk 02/16/2015 > --- > > Key: HIVE-9703 > URL: https://issues.apache.org/jira/browse/HIVE-9703 > Project: Hive > Issue Type: Task >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-9703.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)