from:"Xuefu Zhang \(Jira\)"

[jira] [Created] (HIVE-24280) Fix a potential NPE

2020-10-15 Thread Xuefu Zhang (Jira)

Xuefu Zhang created HIVE-24280:
--

 Summary: Fix a potential NPE
 Key: HIVE-24280
 URL: https://issues.apache.org/jira/browse/HIVE-24280
 Project: Hive
  Issue Type: Improvement
  Components: Vectorization
Affects Versions: 3.1.2
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


{code:java}
case STRING:
case CHAR:
case VARCHAR: {
  BytesColumnVector bcv = (BytesColumnVector) cols[colIndex];
  String sVal = value.toString();
  if (sVal == null) {
bcv.noNulls = false;
bcv.isNull[0] = true;
bcv.isRepeating = true;
  } else {
bcv.fill(sVal.getBytes());
  }
}
break;
{code}
The above code snippet seems assuming that sVal can be null, but didn't handle 
the case where value is null. However, if value is not null, it's unlikely that 
value.toString() returns null.

We treat partition column value for default partition of string types as null, 
not as "__HIVE_DEFAULT_PARTITION__", which Hive assumes. Thus, we actually hit 
the problem that sVal is null.

I propose a harmless fix, as shown in the attached patch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-17586) Make HS2 BackgroundOperationPool not fixed

2017-09-22 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-17586:
--

 Summary: Make HS2 BackgroundOperationPool not fixed
 Key: HIVE-17586
 URL: https://issues.apache.org/jira/browse/HIVE-17586
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Currently the threadpool for background asynchronous operatons has a fixed size 
controled by {{hive.server2.async.exec.threads}}. However, the thread factory 
supplied for this threadpool is {{ThreadFactoryWithGarbageCleanup}} which 
creates ThreadWithGarbageCleanup. Since this is a fixed threadpool, the thread 
is actually never killed, defecting the purpose of garbage cleanup as noted in 
the thread class name. On the other hand, since these threads never go away, 
significant resources such as threadlocal variables (classloaders, hiveconfs, 
etc) are holding up even if there is no operation running. This can lead to 
escalated HS2 memory usage.

Ideally, the threadpool should not be fixed, allowing thread to die out so 
resources can be reclaimed. The existing config 
{{hive.server2.async.exec.threads}} is treated as the max, and we can add a min 
for the threadpool {{hive.server2.async.exec.min.threads}}. Default value for 
this configure is -1, which keeps the existing behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17548) ThriftCliService reports inaccurate the number of current sessions in the log message

2017-09-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-17548:
--

 Summary: ThriftCliService reports inaccurate the number of current 
sessions in the log message
 Key: HIVE-17548
 URL: https://issues.apache.org/jira/browse/HIVE-17548
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


Currently ThriftCliService uses an atomic integer to keep track of the number 
of currently open sessions. It reports it through the following two log 
messages:
{code}
2017-09-18 04:14:31,722 INFO [HiveServer2-Handler-Pool: Thread-729979]: 
org.apache.hive.service.cli.thrift.ThriftCLIService: Opened a session: 
SessionHandle [99ec30d7-5c44-4a45-a8d6-0f0e7ecf4879], current sessions: 345
2017-09-18 04:14:41,926 INFO [HiveServer2-Handler-Pool: Thread-717542]: 
org.apache.hive.service.cli.thrift.ThriftCLIService: Closed session: 
SessionHandle [f38f7890-cba4-459c-872e-4c261b897e00], current sessions: 344
{code}
This assumes that all sessions are closed or opened thru Thrift API. This 
assumption isn't correct because sessions may be closed by the server such as 
in case of timeout. Therefore, such log messages tends to over-report the 
number of open sessions.

In order to accurately report the number of outstanding sessions, session 
manager should be consulted instead.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17507) Support Mesos for Hive on Spark

2017-09-11 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-17507:
--

 Summary: Support Mesos for Hive on Spark
 Key: HIVE-17507
 URL: https://issues.apache.org/jira/browse/HIVE-17507
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang


>From the comment in HIVE-7292:
{quote}
I see the following case: I use Mesos DC/OS and Spark on Mesos. Because it's 
very convenient. But if I want to use Hive on Spark in Mesos DC/OS, I need 
special framework Apache Myriad to run YARN on Mesos. It's very cluttering 
because I run one Resource Manager on another Resource Manager, and it creates 
a lot of redundant abstraction levels.
And there are questions about that on the Internet (e.g. 
http://grokbase.com/t/hive/user/15997dye2q/hive-on-spark-on-mesos)
Can we create the new sub-task for this feature?
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17401) Hive session idle timeout doesn't function properly

2017-08-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-17401:
--

 Summary: Hive session idle timeout doesn't function properly
 Key: HIVE-17401
 URL: https://issues.apache.org/jira/browse/HIVE-17401
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


It's apparent in our production environment that HS2 leaks sessions, which at 
least contributed to memory leaks in HS2. We further found that idle HS2 
sessions rarely get timed out and the number of live session keeps increasing 
as time goes on. Eventually, HS2 becomes irresponsive and demands a restart.

Investigation shows that session idle timeout doesn't work appropriately.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-16962) Better error msg for Hive on Spark in case user cancels query and closes session

2017-06-26 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16962:
--

 Summary: Better error msg for Hive on Spark in case user cancels 
query and closes session
 Key: HIVE-16962
 URL: https://issues.apache.org/jira/browse/HIVE-16962
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


In case user cancels a query and closes the session, Hive marks the query as 
failed. However, the error message is a little confusing. It still says:
{quote}
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create spark client. 
This is likely because the queue you assigned to does not have free resource at 
the moment to start the job. Please check your queue usage and try the query 
again later.
{quote}
followed by some InterruptedException.
Ideally, the error should clearly indicates the fact that user cancels the 
execution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-16961) Hive on Spark leaks spark application in case user cancels query and closes session

2017-06-26 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16961:
--

 Summary: Hive on Spark leaks spark application in case user 
cancels query and closes session
 Key: HIVE-16961
 URL: https://issues.apache.org/jira/browse/HIVE-16961
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


It's found that a Spark application is leaked when user cancels query and 
closes the session while Hive is waiting for remote driver to connect back. 
This is found for asynchronous query execution, but seemingly equally 
applicable for synchronous submission when session is abruptly closed. The 
leaked Spark application that runs Spark driver connects back to Hive 
successfully and run for ever (until HS2 restarts), but receives no job 
submission because the session is already closed. Ideally, Hive should rejects 
the connection from the driver so the driver will exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-16854) SparkClientFactory is locked too aggressively

2017-06-07 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16854:
--

 Summary: SparkClientFactory is locked too aggressively
 Key: HIVE-16854
 URL: https://issues.apache.org/jira/browse/HIVE-16854
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


Most methods in SparkClientFactory are synchronized on the SparkClientFactory 
singleton. However, some methods are very expensive, such as createClient(), 
which returns a SparkClientImpl instance. However, creating a SparkClientImpl 
instance requires starting a remote driver to connect back to RPCServer. This 
process can take a long time such as in case of a busy yarn queue. When this 
happens, all pending  calls on SparkClientFactory will have to wait for a long 
time.

In our case, hive.spark.client.server.connect.timeout is set to 1hr. This makes 
some queries waiting for hours before starting.

The current implementation seems pretty much making all remote driver launches 
serialized. If one of them takes time, the following ones will have to wait.

HS2 stacktrace is attached for reference. It's based on earlier version of 
Hive, so the line numbers might be slightly off.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16799) Control the max number of task for a stage in a spark job

2017-05-31 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16799:
--

 Summary: Control the max number of task for a stage in a spark job
 Key: HIVE-16799
 URL: https://issues.apache.org/jira/browse/HIVE-16799
 Project: Hive
  Issue Type: Improvement
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


HIVE-16552 gives admin an option to control the maximum number of tasks a Spark 
job may have. However, this may not be sufficient as this tends to penalize 
jobs that have many stages while favoring jobs that has fewer stages. Ideally, 
we should also limit the number of tasks in a stage, which is closer to the 
maximum number of mappers or reducers in a MR job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16552) Limit the number of tasks a Spark job may contain

2017-04-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16552:
--

 Summary: Limit the number of tasks a Spark job may contain
 Key: HIVE-16552
 URL: https://issues.apache.org/jira/browse/HIVE-16552
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


It's commonly desirable to block bad and big queries that takes a lot of YARN 
resources. One approach, similar to mapreduce.job.max.map in MapReduce, is to 
stop a query that invokes a Spark job that contains too many tasks. The 
proposal here is to introduce hive.spark.job.max.tasks with a default value of 
-1 (no limit), which an admin can set to block queries that trigger too many 
spark tasks.

Please note that this control knob applies to a spark job, though it's possible 
that one query can trigger multiple Spark jobs (such as in case of map-join). 
Nevertheless, the proposed approach is still helpful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16196) UDFJson having thread-safety issues

2017-03-13 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16196:
--

 Summary: UDFJson having thread-safety issues
 Key: HIVE-16196
 URL: https://issues.apache.org/jira/browse/HIVE-16196
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Followup for HIVE-16183, there seems to be some concurrency issues in 
UDFJson.java, especially around static class variables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16183) Fix potential thread safety issues with static variables

2017-03-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16183:
--

 Summary: Fix potential thread safety issues with static variables
 Key: HIVE-16183
 URL: https://issues.apache.org/jira/browse/HIVE-16183
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Many concurrency issues have been found with respect to class static variable 
usages. With fact that HS2 supports concurrent compilation and task execution 
as well as some backend engines (such as Spark) running multiple tasks in a 
single JVM, traditional assumption (or mindset) of single threaded execution 
needs to be abandoned.

This purpose of this JIRA is to do a global scan of static variables in Hive 
code base, and correct potential thread-safety issues. However, it's not meant 
to be exhaustive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16179) HoS tasks may fail due to ArrayIndexOutOfBoundException in BinarySortableSerDe

2017-03-10 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16179:
--

 Summary: HoS tasks may fail due to ArrayIndexOutOfBoundException 
in BinarySortableSerDe
 Key: HIVE-16179
 URL: https://issues.apache.org/jira/browse/HIVE-16179
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Stacktrace:
{code}
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
Hive Runtime Error: Unable to deserialize reduce input key from 
x1x100x101x97x51x49x50x97x102x45x97x98x56x52x45x52x102x52x53x45x56x49x101x99x45x49x99x100x98x55x97x51x52x100x49x49x55x0x1x128x0x0x0x0x0x0x19x1x128x0x0x0x0x0x0x3x1x128x0x66x179x1x192x244x45x90x1x85x98x101x114x0x1x76x111x115x32x65x110x103x101x108x101x115x0x1x2x128x0x0x2x50x51x57x51x0x1x192x55x238x20x122x225x71x174x1x128x0x0x0x87x240x169x195x1x50x48x49x54x45x49x48x45x48x49x32x50x51x58x51x49x58x51x49x0x1x117x98x101x114x88x0x255
 with properties 
{columns=_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11,
 
serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe,
 serialization.sort.order=, 
columns.types=string,bigint,bigint,date,int,varchar(50),varchar(255),decimal(12,2),double,bigint,string,varchar(255)}
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:339)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:54)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at 
org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2004)
at 
org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2004)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
Error: Unable to deserialize reduce input key from 
x1x100x101x97x51x49x50x97x102x45x97x98x56x52x45x52x102x52x53x45x56x49x101x99x45x49x99x100x98x55x97x51x52x100x49x49x55x0x1x128x0x0x0x0x0x0x19x1x128x0x0x0x0x0x0x3x1x128x0x66x179x1x192x244x45x90x1x85x98x101x114x0x1x76x111x115x32x65x110x103x101x108x101x115x0x1x2x128x0x0x2x50x51x57x51x0x1x192x55x238x20x122x225x71x174x1x128x0x0x0x87x240x169x195x1x50x48x49x54x45x49x48x45x48x49x32x50x51x58x51x49x58x51x49x0x1x117x98x101x114x88x0x255
 with properties 
{columns=_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11,
 
serialization.lib=org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe,
 serialization.sort.order=, 
columns.types=string,bigint,bigint,date,int,varchar(50),varchar(255),decimal(12,2),double,bigint,string,varchar(255)}
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:311)
... 16 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at 
org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:413)
at 
org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.deserialize(BinarySortableSerDe.java:190)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:309)
... 16 more
{code}

It seems to be a synchronization issue in BinarySortableSerDe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-16156) FileSinkOperator should delete existing output target when renaming

2017-03-09 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-16156:
--

 Summary: FileSinkOperator should delete existing output target 
when renaming
 Key: HIVE-16156
 URL: https://issues.apache.org/jira/browse/HIVE-16156
 Project: Hive
  Issue Type: Bug
  Components: Operators
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


If a task get killed (for whatever a reason) after it completes the renaming 
the temp output to final output during commit, subsequent task attempts will 
fail when renaming because of the existence of the target output. This can 
happen, however rarely.

Hive should check the existence of the target output and delete it before 
renaming.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-15893) Followup on HIVE-15671

2017-02-13 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15893:
--

 Summary: Followup on HIVE-15671
 Key: HIVE-15893
 URL: https://issues.apache.org/jira/browse/HIVE-15893
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 2.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


In HIVE-15671, we fixed a type where server.connect.timeout is used in the 
place of client.connect.timeout. This might solve some potential problems, but 
the original problem reported in HIVE-15671 might still exist. (Not sure if 
HIVE-15860 helps). Here is the proposal suggested by Marcelo:
{quote}
bq: server detecting a driver problem after it has connected back to the server.

Hmm. That is definitely not any of the "connect" timeouts, which probably means 
it isn't configured and is just using netty's default (which is probably no 
timeout?). Would probably need something using 
io.netty.handler.timeout.IdleStateHandler, and also some periodic "ping" so 
that the connection isn't torn down without reason.
{code}

We will use this JIRA to track the issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVE-15683) Measure performance impact on group by by HIVE-15580

2017-01-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15683:
--

 Summary: Measure performance impact on group by by HIVE-15580
 Key: HIVE-15683
 URL: https://issues.apache.org/jira/browse/HIVE-15683
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 2.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


HIVE-15580 changed the way the data is shuffled for order by: instead of using 
Spark's groupByKey to shuffle data, Hive on Spark now uses 
repartitionAndSortWithinPartitions(), which generates (key, value) pairs 
instead of original (key, value iterator). This might have some performance 
implications, but it's needed to get rid of unbound memory usage by 
{{groupByKey}}.

Here we'd like to compare group by performance with or w/o HIVE-15580. If the 
impact is significant, we can provide a configuration that allows user to 
switch back to the original way of shuffling.

This work should be ideally done after HIVE-15682 as the optimization there 
should help the performance here as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15682) Eliminate the dummy iterator and optimize the per row based reducer-side processing

2017-01-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15682:
--

 Summary: Eliminate the dummy iterator and optimize the per row 
based reducer-side processing
 Key: HIVE-15682
 URL: https://issues.apache.org/jira/browse/HIVE-15682
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 2.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


HIVE-15580 introduced a dummy iterator per input row which can be eliminated. 
This is because {{SparkReduceRecordHandler}} is able to handle single key value 
pairs. We can refactor this part of code 1. to remove the need for a iterator 
and 2. to optimize the code path for per (key, value) based (instead of (key, 
value iterator)) processing. It would be also great if we can measure the 
performance after the optimizations and compare to performance prior to 
HIVE-15580.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-01-19 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15671:
--

 Summary: RPCServer.registerClient() erroneously uses server/client 
handshake timeout for connection timeout
 Key: HIVE-15671
 URL: https://issues.apache.org/jira/browse/HIVE-15671
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


{code}
  /**
   * Tells the RPC server to expect a connection from a new client.
   * ...
   */
  public Future registerClient(final String clientId, String secret,
  RpcDispatcher serverDispatcher) {
return registerClient(clientId, secret, serverDispatcher, 
config.getServerConnectTimeoutMs());
  }
{code}

config.getServerConnectTimeoutMs() returns value for 
hive.spark.client.server.connect.timeout, which is meant for timeout for 
handshake between Hive client and remote Spark driver. Instead, the timeout 
should be hive.spark.client.connect.timeout, which is for timeout for remote 
Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-10 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15580:
--

 Summary: Replace Spark's groupByKey operator with something with 
bounded memory
 Key: HIVE-15580
 URL: https://issues.apache.org/jira/browse/HIVE-15580
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15543) Don't try to get memory/cores to decide parallelism when Spark dynamic allocation is enabled

2017-01-04 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15543:
--

 Summary: Don't try to get memory/cores to decide parallelism when 
Spark dynamic allocation is enabled
 Key: HIVE-15543
 URL: https://issues.apache.org/jira/browse/HIVE-15543
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 2.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Presently Hive tries to get numbers for memory and cores from the Spark 
application and use them to determine RS parallelism. However, this doesn't 
make sense when Spark dynamic allocation is enabled because the current numbers 
doesn't represent available computing resources, especially when SparkContext 
is initially launched.

Thus, it makes send not to do that when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2016-12-30 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15527:
--

 Summary: Memory usage is unbound in SortByShuffler for Spark
 Key: HIVE-15527
 URL: https://issues.apache.org/jira/browse/HIVE-15527
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


In SortByShuffler.java, an ArrayList is used to back the iterator for values 
that have the same key in shuffled result produced by spark transformation 
sortByKey. It's possible that memory can be exhausted because of a large key 
group.

{code}
@Override
public Tuple2> next() {
  // TODO: implement this by accumulating rows with the same key 
into a list.
  // Note that this list needs to improved to prevent excessive 
memory usage, but this
  // can be done in later phase.
  while (it.hasNext()) {
Tuple2 pair = it.next();
if (curKey != null && !curKey.equals(pair._1())) {
  HiveKey key = curKey;
  List values = curValues;
  curKey = pair._1();
  curValues = new ArrayList();
  curValues.add(pair._2());
  return new Tuple2>(key, 
values);
}
curKey = pair._1();
curValues.add(pair._2());
  }
  if (curKey == null) {
throw new NoSuchElementException();
  }
  // if we get here, this should be the last element we have
  HiveKey key = curKey;
  curKey = null;
  return new Tuple2>(key, 
curValues);
}
{code}

Since the output from sortByKey is already sorted on key, it's possible to 
backup the value iterable using the input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15237) Propagate Spark job failure to Hive

2016-11-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-15237:
--

 Summary: Propagate Spark job failure to Hive
 Key: HIVE-15237
 URL: https://issues.apache.org/jira/browse/HIVE-15237
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 2.1.0
Reporter: Xuefu Zhang


If a Spark job failed for some reason, Hive doesn't get any additional error 
message, which makes it very hard for user to figure out why. Here is an 
example:
{code}
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-11-17 21:32:53,134 Stage-0_0: 0/23 Stage-1_0: 0/28 
2016-11-17 21:32:55,156 Stage-0_0: 0(+1)/23 Stage-1_0: 0/28 
2016-11-17 21:32:57,167 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 
2016-11-17 21:33:00,216 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 
2016-11-17 21:33:03,251 Stage-0_0: 0(+3)/23 Stage-1_0: 0/28 
2016-11-17 21:33:06,286 Stage-0_0: 0(+4)/23 Stage-1_0: 0/28 
2016-11-17 21:33:09,308 Stage-0_0: 0(+2,-3)/23  Stage-1_0: 0/28 
2016-11-17 21:33:12,332 Stage-0_0: 0(+2,-3)/23  Stage-1_0: 0/28 
2016-11-17 21:33:13,338 Stage-0_0: 0(+21,-3)/23 Stage-1_0: 0/28 
2016-11-17 21:33:15,349 Stage-0_0: 0(+21,-5)/23 Stage-1_0: 0/28 
2016-11-17 21:33:16,358 Stage-0_0: 0(+18,-8)/23 Stage-1_0: 0/28 
2016-11-17 21:33:19,373 Stage-0_0: 0(+21,-8)/23 Stage-1_0: 0/28 
2016-11-17 21:33:22,400 Stage-0_0: 0(+18,-14)/23Stage-1_0: 0/28 
2016-11-17 21:33:23,404 Stage-0_0: 0(+15,-20)/23Stage-1_0: 0/28 
2016-11-17 21:33:24,408 Stage-0_0: 0(+12,-23)/23Stage-1_0: 0/28 
2016-11-17 21:33:25,417 Stage-0_0: 0(+9,-26)/23 Stage-1_0: 0/28 
2016-11-17 21:33:26,420 Stage-0_0: 0(+12,-26)/23Stage-1_0: 0/28 
2016-11-17 21:33:28,427 Stage-0_0: 0(+9,-29)/23 Stage-1_0: 0/28 
2016-11-17 21:33:29,432 Stage-0_0: 0(+12,-29)/23Stage-1_0: 0/28 
2016-11-17 21:33:31,444 Stage-0_0: 0(+18,-29)/23Stage-1_0: 0/28 
2016-11-17 21:33:34,464 Stage-0_0: 0(+18,-29)/23Stage-1_0: 0/28 
Status: Failed
FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}
It would be better if we can propagate Spark error to Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14885) Support PPD for nested columns

2016-10-04 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-14885:
--

 Summary: Support PPD for nested columns
 Key: HIVE-14885
 URL: https://issues.apache.org/jira/browse/HIVE-14885
 Project: Hive
  Issue Type: Improvement
  Components: Logical Optimizer, Serializers/Deserializers
Affects Versions: 2.1.0
Reporter: Xuefu Zhang


It looks like that PPD doesn't work for nested columns, at least not for 
Parquet. For a given schema
{code}
hive> desc nested;
OK
a   int 
b   string  
c   struct  
{code}
PPD works for a query like
{code}
select * from nested where a=1;
{code}
while NOT for
{code}
select * from nested where c.d=2;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14617) NPE in UDF MapValues() if input is null

2016-08-24 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-14617:
--

 Summary: NPE in UDF MapValues() if input is null
 Key: HIVE-14617
 URL: https://issues.apache.org/jira/browse/HIVE-14617
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 2.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Job fails with error msg as follows:
{code}
Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row 
{"ts":null,"_max_added_id":null,"identity_info":null,"vehicle_specs":null,"tracking_info":null,"color_info":null,"vehicle_traits":null,"detail_info":null,"_row_key":null,"_shard":null,"image_info":null,"vehicle_tags":null,"activation_info":null,"flavor_info":null,"sounds":null,"legacy_info":null,"images":null,"datestr":"2016-08-24"}
 at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row 
{"ts":null,"_max_added_id":null,"identity_info":null,"vehicle_specs":null,"tracking_info":null,"color_info":null,"vehicle_traits":null,"detail_info":null,"_row_key":null,"_shard":null,"image_info":null,"vehicle_tags":null,"activation_info":null,"flavor_info":null,"sounds":null,"legacy_info":null,"images":null,"datestr":"2016-08-24"}
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) ... 8 
more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating map_values(vehicle_traits.vehicle_traits) at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82) 
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at 
org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator.processOp(LateralViewForwardOperator.java:37)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
 at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) 
... 9 more Caused by: java.lang.NullPointerException at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFMapValues.evaluate(GenericUDFMapValues.java:64)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:185)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
 at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77) 
... 15 more 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-13873) Column pruning for nested fields

2016-05-26 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-13873:
--

 Summary: Column pruning for nested fields
 Key: HIVE-13873
 URL: https://issues.apache.org/jira/browse/HIVE-13873
 Project: Hive
  Issue Type: New Feature
  Components: Logical Optimizer
Reporter: Xuefu Zhang


Some columnar file formats such as Parquet store fields in struct type also 
column by column using encoding described in Google Dramel pager. It's very 
common in big data where data are stored in structs while queries only needs a 
subset of the the fields in the structs. However, presently Hive still needs to 
read the whole struct regardless whether all fields are selected. Therefore, 
pruning unwanted sub-fields in struct or nested fields at file reading time 
would be a big performance boost for such scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-13276) Hive on Spark doesn't work when spark.master=local

2016-03-13 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-13276:
--

 Summary: Hive on Spark doesn't work when spark.master=local
 Key: HIVE-13276
 URL: https://issues.apache.org/jira/browse/HIVE-13276
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 2.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


The following problem occurs with latest Hive master and Spark 1.6.1. I'm using 
hive CLI on mac.

{code}
  set mapreduce.job.reduces=
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:991)
at 
org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:419)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:205)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:145)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:117)
at 
org.apache.hadoop.hive.ql.exec.spark.LocalHiveSparkClient.execute(LocalHiveSparkClient.java:130)
at 
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:71)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:94)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:156)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:101)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1837)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1578)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1351)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1110)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: Execution Error, return code -101 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask. Could not initialize class 
org.apache.spark.rdd.RDDOperationScope$
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12951) Reduce Spark executor prewarm timeout to 5s

2016-01-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12951:
--

 Summary: Reduce Spark executor prewarm timeout to 5s
 Key: HIVE-12951
 URL: https://issues.apache.org/jira/browse/HIVE-12951
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Currently it's set to 30s, which tends to be longer than needed. Reduce it to 
5s, only considering jvm startup time. (Eventually, we may want to make this 
configurable.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12828) Update Spark version to 1.6

2016-01-09 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12828:
--

 Summary: Update Spark version to 1.6
 Key: HIVE-12828
 URL: https://issues.apache.org/jira/browse/HIVE-12828
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12811) Name yarn application name more meaning than just "Hive on Spark"

2016-01-08 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12811:
--

 Summary: Name yarn application name more meaning than just "Hive 
on Spark"
 Key: HIVE-12811
 URL: https://issues.apache.org/jira/browse/HIVE-12811
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


MR uses the query as the application name. Hopefully this can be set via 
spark.app.name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12708) Hive on Spark doesn't work with Kerboresed HBase [Spark Branch]

2015-12-18 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12708:
--

 Summary: Hive on Spark doesn't work with Kerboresed HBase [Spark 
Branch]
 Key: HIVE-12708
 URL: https://issues.apache.org/jira/browse/HIVE-12708
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0, 1.2.0, 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Spark application launcher (spark-submit) acquires HBase delegation token on 
Hive user's behalf when the application is launched. This mechanism, which 
doesn't work for long-running sessions, is not in line with what Hive is doing. 
Hive actually acquires the token automatically whenever a job needs it. The 
right approach for Spark should be allowing applications to dynamically add 
whatever tokens they need to the spark context. While this needs work on Spark 
side, we provide a workaround solution in Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12611) Make sure spark.yarn.queue is effective and takes the value from mapreduce.job.queuename if given [Spark Branch]

2015-12-07 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12611:
--

 Summary: Make sure spark.yarn.queue is effective and takes the 
value from mapreduce.job.queuename if given [Spark Branch]
 Key: HIVE-12611
 URL: https://issues.apache.org/jira/browse/HIVE-12611
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Hive users sometimes specifies a job queue name for the submitted MR jobs. For 
spark, the property name is spark.yarn.queue. We need to make sure that user is 
able to submit spark jobs to the given queue. If user specifies the MR 
property, then Hive on Spark should take that as well to make it backward 
compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12569) Excessive console message from SparkClientImpl [Spark Branch]

2015-12-02 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12569:
--

 Summary: Excessive console message from SparkClientImpl [Spark 
Branch]
 Key: HIVE-12569
 URL: https://issues.apache.org/jira/browse/HIVE-12569
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


{code}
15/12/02 11:00:46 INFO client.SparkClientImpl: 15/12/02 11:00:46 INFO Client: 
Application report for application_1442517343449_0038 (state: RUNNING)
15/12/02 11:00:47 INFO client.SparkClientImpl: 15/12/02 11:00:47 INFO Client: 
Application report for application_1442517343449_0038 (state: RUNNING)
15/12/02 11:00:48 INFO client.SparkClientImpl: 15/12/02 11:00:48 INFO Client: 
Application report for application_1442517343449_0038 (state: RUNNING)
15/12/02 11:00:49 INFO client.SparkClientImpl: 15/12/02 11:00:49 INFO Client: 
Application report for application_1442517343449_0038 (state: RUNNING)
15/12/02 11:00:50 INFO client.SparkClientImpl: 15/12/02 11:00:50 INFO Client: 
Application report for application_1442517343449_0038 (state: RUNNING)
{code}
I see this using Hive CLI after a spark job is launched and it goes 
non-stopping even if the job is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12568) Use the same logic finding HS2 host name in Spark client [Spark Branch]

2015-12-02 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12568:
--

 Summary: Use the same logic finding HS2 host name in Spark client 
[Spark Branch]
 Key: HIVE-12568
 URL: https://issues.apache.org/jira/browse/HIVE-12568
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Spark client sends a pair of host name and port number to the remote driver so 
that the driver can connects back to HS2 where the user session is. Spark 
client has its own way determining the host name, and pick one network 
interface if the host happens to have multiple network interfaces. This can be 
problematic. For that, there is parameter, hive.spark.client.server.address, 
which user can pick an interface. Unfortunately, this interface isn't exposed.

Instead of exposing this parameter, we can use the same logic as Hive in 
determining the host name. Therefore, the remote driver connecting to HS2 using 
the same network interface as a HS2 client would do.

There might be a case where user may want the remote driver to use a different 
network. This is rare if at all. Thus, for now it should be sufficient to use 
the same network interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12554) Fix Spark branch build after merge [Spark Branch]

2015-12-01 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12554:
--

 Summary: Fix Spark branch build after merge [Spark Branch]
 Key: HIVE-12554
 URL: https://issues.apache.org/jira/browse/HIVE-12554
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li


The previous merge from master broke spark branch build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12461) Branch-1 -Phadoop-1 build is broken

2015-11-18 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12461:
--

 Summary: Branch-1 -Phadoop-1 build is broken
 Key: HIVE-12461
 URL: https://issues.apache.org/jira/browse/HIVE-12461
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Xuefu Zhang


{code}
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ hive-exec ---
[INFO] Compiling 2423 source files to 
/Users/xzhang/apache/hive-git-commit/ql/target/classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/Users/xzhang/apache/hive-git-commit/ql/src/java/org/apache/hadoop/hive/ql/Context.java:[352,10]
 error: cannot find symbol
[INFO] 1 error
[INFO] -
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Hive ... SUCCESS [  2.636 s]
[INFO] Hive Shims Common .. SUCCESS [  3.270 s]
[INFO] Hive Shims 0.20S ... SUCCESS [  1.052 s]
[INFO] Hive Shims 0.23  SUCCESS [  3.550 s]
[INFO] Hive Shims Scheduler ... SUCCESS [  1.076 s]
[INFO] Hive Shims . SUCCESS [  1.472 s]
[INFO] Hive Common  SUCCESS [  5.989 s]
[INFO] Hive Serde . SUCCESS [  6.923 s]
[INFO] Hive Metastore . SUCCESS [ 19.424 s]
[INFO] Hive Ant Utilities . SUCCESS [  0.516 s]
[INFO] Spark Remote Client  SUCCESS [  3.305 s]
[INFO] Hive Query Language  FAILURE [ 34.276 s]
[INFO] Hive Service ... SKIPPED
{code}

Part of the code that's being complained:
{code}
343   /**
344* Remove any created scratch directories.
345*/
346   public void removeScratchDir() {
347 for (Map.Entry entry : fsScratchDirs.entrySet()) {
348   try {
349 Path p = entry.getValue();
350 FileSystem fs = p.getFileSystem(conf);
351 fs.delete(p, true);
352 fs.cancelDeleteOnExit(p);
353   } catch (Exception e) {
354 LOG.warn("Error Removing Scratch: "
355 + StringUtils.stringifyException(e));
356   }
{code}
might be related to HIVE-12268.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12460) Fix branch-1 build

2015-11-18 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12460:
--

 Summary: Fix branch-1 build
 Key: HIVE-12460
 URL: https://issues.apache.org/jira/browse/HIVE-12460
 Project: Hive
  Issue Type: Bug
  Components: Build Infrastructure
Affects Versions: 1.3.0
Reporter: Xuefu Zhang


Caused by a merge.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12434) Merge spark into master 11/17/1015

2015-11-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12434:
--

 Summary: Merge spark into master 11/17/1015
 Key: HIVE-12434
 URL: https://issues.apache.org/jira/browse/HIVE-12434
 Project: Hive
  Issue Type: Task
  Components: Spark
Affects Versions: 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


There are still a few patches that are in Spark branch only. We need to merge 
them to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12433) Merge trunk into spark 11/17/2015 [Spark Branch]

2015-11-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12433:
--

 Summary: Merge trunk into spark 11/17/2015 [Spark Branch]
 Key: HIVE-12433
 URL: https://issues.apache.org/jira/browse/HIVE-12433
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12398) Create format checker for Parquet

2015-11-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12398:
--

 Summary: Create format checker for Parquet
 Key: HIVE-12398
 URL: https://issues.apache.org/jira/browse/HIVE-12398
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 2.0.0
Reporter: Xuefu Zhang


See HIVE-11120 and related.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12390) Merge master to Spark branch 11/11/2015 [Spark Branch]

2015-11-11 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12390:
--

 Summary: Merge master to Spark branch 11/11/2015 [Spark Branch]
 Key: HIVE-12390
 URL: https://issues.apache.org/jira/browse/HIVE-12390
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


To fix some test failures such as those for Llap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12284) CLONE - Merge master to Spark branch 10/26/2015 [Spark Branch]

2015-10-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12284:
--

 Summary: CLONE - Merge master to Spark branch 10/26/2015 [Spark 
Branch]
 Key: HIVE-12284
 URL: https://issues.apache.org/jira/browse/HIVE-12284
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: spark-branch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12205) Spark: unify spark statististics aggregation between local and remote spark client

2015-10-16 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12205:
--

 Summary: Spark: unify spark statististics aggregation between 
local and remote spark client
 Key: HIVE-12205
 URL: https://issues.apache.org/jira/browse/HIVE-12205
 Project: Hive
  Issue Type: Task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


In class {{LocalSparkJobStatus}} and {{RemoteSparkJobStatus}}, spark statistics 
aggregation are done similar but in different code paths. Ideally, we should 
have a unified approach to simply maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12063) Pad Decimal numbers with trailing zeros to the scale of the column

2015-10-07 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-12063:
--

 Summary: Pad Decimal numbers with trailing zeros to the scale of 
the column
 Key: HIVE-12063
 URL: https://issues.apache.org/jira/browse/HIVE-12063
 Project: Hive
  Issue Type: Improvement
  Components: Types
Affects Versions: 1.1.0, 1.2.0, 1.0.0, 0.14.0, 0.13
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


HIVE-7373 was to address the problem of trimming tailing zeros by Hive, which 
caused many problems including treating 0.0, 0.00 and so on as 0, which has 
different precision/scale. Please refer to HIVE-7373 description. However, 
HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. 
HIVE-11835 was resolved recently to address one of the problems, where 0.0, 
0.00, and so cannot be read into decimal(1,1).

However, HIVE-11835 didn't address the problem of showing as 0 in query result 
for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 
0.0 have different precision/scale than 0.

The proposal here is to pad zeros for query result to the type's scale. This 
not only removes the confusion described above, but also aligns with many other 
DBs. Internal decimal number representation doesn't change, however.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11844) CMerge master to Spark branch 9/16/2015 [Spark Branch]

2015-09-16 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11844:
--

 Summary: CMerge master to Spark branch 9/16/2015 [Spark Branch]
 Key: HIVE-11844
 URL: https://issues.apache.org/jira/browse/HIVE-11844
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11835) Type decimal(1,1) reads 0.0, 0.00, etc from text file as NULL

2015-09-15 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11835:
--

 Summary: Type decimal(1,1) reads 0.0, 0.00, etc from text file as 
NULL
 Key: HIVE-11835
 URL: https://issues.apache.org/jira/browse/HIVE-11835
 Project: Hive
  Issue Type: Bug
  Components: Types
Affects Versions: 1.1.0, 1.2.0, 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Steps to reproduce:
1. create a text file with values like 0.0, 0.00, etc.
2. create table in hive with type decimal(1,1).
3. run "load data local inpath ..." to load data into the table.
4. run select * on the table.
You will see that NULL is displayed for 0.0, 0.00, .0, etc. Instead, these 
should be read as 0.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11549) Hide Hive configuration from spark driver launching process

2015-08-13 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11549:
--

 Summary: Hide Hive configuration from spark driver launching 
process
 Key: HIVE-11549
 URL: https://issues.apache.org/jira/browse/HIVE-11549
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.2.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


Hive uses Spark application submission script, spark-submit, to launch remote 
spark driver. Starting from Spark 1.4, this script also does a lot of things 
that Hive doesn't need, for instance, accessing metastore for delegation 
tokens. Hive on Spark doesn't need this, and one way to do this is hide Hive 
configuration from being visible by that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11434) Followup for HIVE-10166: reuse existing configurations for prewarming Spark executors

2015-08-01 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11434:
--

 Summary: Followup for HIVE-10166: reuse existing configurations 
for prewarming Spark executors
 Key: HIVE-11434
 URL: https://issues.apache.org/jira/browse/HIVE-11434
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


It appears that the patch other than the latest from HIVE- was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11433) NPE for a multiple inner join query

2015-07-31 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11433:
--

 Summary: NPE for a multiple inner join query
 Key: HIVE-11433
 URL: https://issues.apache.org/jira/browse/HIVE-11433
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.0, 1.1.0, 2.0.0
Reporter: Xuefu Zhang


NullPointException is thrown for query that has multiple (greater than 3) inner 
joins. Stacktrace for 1.1.0
{code}
NullPointerException null
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.parse.ParseUtils.getIndex(ParseUtils.java:149)
at 
org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:166)
at 
org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:185)
at 
org.apache.hadoop.hive.ql.parse.ParseUtils.checkJoinFilterRefersOneAlias(ParseUtils.java:185)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.mergeJoins(SemanticAnalyzer.java:8257)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.mergeJoinTree(SemanticAnalyzer.java:8422)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9805)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9714)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10150)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10161)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10078)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:222)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:421)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:307)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1110)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1104)
at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:101)
at 
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:172)
at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:257)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:386)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:373)
at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:271)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:486)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:692)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}.
However, the problem can also be reproduced in latest master branch. Further 
investigation shows that the following code (in ParseUtils.java) is problematic:
{code}
  static int getIndex(String[] list, String elem) {
for(int i=0; i < list.length; i++) {
  if (list[i].toLowerCase().equals(elem)) {
return i;
  }
}
return -1;
  }
{code}
The code assumes that every element in the list is not null, which isn't true 
because of the following code in SemanticAnalyzer.java (method genJoinTree()):
{code}
if ((right.getToken().getType() == HiveParser.TOK_TABREF)
|| (right.getToken().getType() == HiveParser.TOK_SUBQUERY)
|| (right.getToken().getType() == HiveParser.TOK_PTBLFUNCTION)) {
  String tableName = getUnescapedUnqualifiedTableName((ASTNode) 
right.getChild(0))
  .toLowerCase();
  String alias = extractJoinAlias(right, tableName);
  String[] rightAliases = new String[1];
  rightAliases[0] = alias;
  joinTree.setRightAliases(rightAliases);
  String[] children = joinTree.getBaseSrc();
  if (children == null) {
children = new String[2];
  }
  children[1] = alias;
  joinTree.setBaseSrc(children);
  joinTree.setId(qb.getId());
  joinTree.getAliasToOpInfo().put(
  getModifiedAlias(qb, alias), aliasToOpInfo.get(

[jira] [Created] (HIVE-11430) Followup HIVE-10166: investigate and fix the two test failures

2015-07-31 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11430:
--

 Summary: Followup HIVE-10166: investigate and fix the two test 
failures
 Key: HIVE-11430
 URL: https://issues.apache.org/jira/browse/HIVE-11430
 Project: Hive
  Issue Type: Bug
  Components: Test
Affects Versions: 2.0.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


{code}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_convert_enum_to_string
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynamic_rdd_cache
{code}

As show in .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11363) Prewarm Hive on Spark containers [Spark Branch]

2015-07-23 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11363:
--

 Summary: Prewarm Hive on Spark containers [Spark Branch]
 Key: HIVE-11363
 URL: https://issues.apache.org/jira/browse/HIVE-11363
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


When Hive job is launched by Oozie, a Hive session is created and job script is 
executed. Session is closed when Hive job is completed. Thus, Hive session is 
not shared among Hive jobs either in an Oozie workflow or across workflows. 
Since the parallelism of a Hive job executed on Spark is impacted by the 
available executors, such Hive jobs will suffer the executor ramp-up overhead. 
The idea here is to wait a bit so that enough executors can come up before a 
job can be executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11314) Print "Execution completed successfully" as part of spark job info [Spark Branch]

2015-07-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11314:
--

 Summary: Print "Execution completed successfully" as part of spark 
job info [Spark Branch]
 Key: HIVE-11314
 URL: https://issues.apache.org/jira/browse/HIVE-11314
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


Like Hive on MR, Hive on Spark should print "Execution completed successfully" 
as part of the spark job info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11276) Optimization around job submission and adding jars [Spark Branch]

2015-07-15 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11276:
--

 Summary: Optimization around job submission and adding jars [Spark 
Branch]
 Key: HIVE-11276
 URL: https://issues.apache.org/jira/browse/HIVE-11276
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


It seems that Hive on Spark has some room for performance improvement on job 
submission. Specifically, we are calling refreshLocalResources() for every job 
submission despite there is are no changes in the jar list. Since Hive on Spark 
is reusing the containers in the whole user session, we might be able to 
optimize that.

We do need to take into consideration the case of dynamic allocation, in which 
new executors might be added.

This task is some R&D in this area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11275) Merge master to beeline-cli branch 07/14/2015

2015-07-15 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11275:
--

 Summary: Merge master to beeline-cli branch 07/14/2015
 Key: HIVE-11275
 URL: https://issues.apache.org/jira/browse/HIVE-11275
 Project: Hive
  Issue Type: Sub-task
  Components: CLI
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11250) Change in spark.executor.instances (and others) doesn't take effect after RSC is launched for HS2 [Spark Brnach]

2015-07-14 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11250:
--

 Summary: Change in spark.executor.instances (and others) doesn't 
take effect after RSC is launched for HS2 [Spark Brnach]
 Key: HIVE-11250
 URL: https://issues.apache.org/jira/browse/HIVE-11250
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


Hive CLI works as expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11240) Change value time from int to long for HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE

2015-07-13 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11240:
--

 Summary: Change value time from int to long for 
HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE
 Key: HIVE-11240
 URL: https://issues.apache.org/jira/browse/HIVE-11240
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 1.2.0, 1.1.0
Reporter: Xuefu Zhang


Currently in HiveMetaStore.java, we are getting an integer value from this 
property:
{code}
  int maxMessageSize = 
conf.getIntVar(HiveConf.ConfVars.METASTORESERVERMAXMESSAGESIZE);
{code}
While this is sufficient most of the time, there can be cases where msg size 
might needs to be greater than INT_MAX. We should use long instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11088) Investigate intermitten failure of join28.q for Spark

2015-06-23 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11088:
--

 Summary: Investigate intermitten failure of join28.q for Spark
 Key: HIVE-11088
 URL: https://issues.apache.org/jira/browse/HIVE-11088
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.3.0
Reporter: Xuefu Zhang
Assignee: Mohit Sabharwal


Please refer to 
https://issues.apache.org/jira/browse/HIVE-10996?focusedCommentId=14598349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14598349.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11067) Merge master to Spark branch 6/20/2015 [Spark Branch]

2015-06-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11067:
--

 Summary: Merge master to Spark branch 6/20/2015 [Spark Branch]
 Key: HIVE-11067
 URL: https://issues.apache.org/jira/browse/HIVE-11067
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11000) Hive not able to pass Hive's Kerberos credential to spark-submit process [Spark Branch]

2015-06-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-11000:
--

 Summary: Hive not able to pass Hive's Kerberos credential to 
spark-submit process [Spark Branch]
 Key: HIVE-11000
 URL: https://issues.apache.org/jira/browse/HIVE-11000
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


The end of the result is that manual kinit with Hive's keytab on the host where 
HS2 is running, or the following error may appear:
{code}
2015-04-29 15:49:34,614 INFO org.apache.hive.spark.client.SparkClientImpl: 
15/04/29 15:49:34 WARN UserGroupInformation: PriviledgedActionException as:hive 
(auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
2015-04-29 15:49:34,652 INFO org.apache.hive.spark.client.SparkClientImpl: 
Exception in thread "main" java.io.IOException: Failed on local exception: 
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]; Host Details : local host is: 
"secure-hos-1.ent.cloudera.com/10.20.77.79"; destination host is: 
"secure-hos-1.ent.cloudera.com":8032;
2015-04-29 15:49:34,653 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
2015-04-29 15:49:34,653 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
2015-04-29 15:49:34,654 INFO org.apache.hive.spark.client.SparkClientImpl:  
at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:202)
2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl:  
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2015-04-29 15:49:34,655 INFO org.apache.hive.spark.client.SparkClientImpl:  
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl:  
at java.lang.reflect.Method.invoke(Method.java:606)
2015-04-29 15:49:34,656 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:461)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:91)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at 
org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:91)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:49)
2015-04-29 15:49:34,657 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:90)
2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.deploy.yarn.Client.run(Client.scala:619)
2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl:  
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
2015-04-29 15:49:34,658 INFO org.apache.hive.spark.client.SparkClientImpl:  
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2015-04-

[jira] [Created] (HIVE-10999) Upgrade Spark dependency to 1.4

2015-06-12 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10999:
--

 Summary: Upgrade Spark dependency to 1.4
 Key: HIVE-10999
 URL: https://issues.apache.org/jira/browse/HIVE-10999
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


Spark 1.4.0 is release. Let's update the dependency version from 1.3.1 to 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10962) Merge master to Spark branch 6/7/2015 [Spark Branch]

2015-06-08 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10962:
--

 Summary: Merge master to Spark branch 6/7/2015 [Spark Branch]
 Key: HIVE-10962
 URL: https://issues.apache.org/jira/browse/HIVE-10962
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10870) Merge Spark branch to trunk 5/29/2015

2015-05-29 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10870:
--

 Summary: Merge Spark branch to trunk 5/29/2015
 Key: HIVE-10870
 URL: https://issues.apache.org/jira/browse/HIVE-10870
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10868) Update release note for 1.2.0 and 1.1.0

2015-05-29 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10868:
--

 Summary: Update release note for 1.2.0 and 1.1.0
 Key: HIVE-10868
 URL: https://issues.apache.org/jira/browse/HIVE-10868
 Project: Hive
  Issue Type: Task
  Components: Documentation
Affects Versions: 1.2.0, 1.1.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang


It's recently found that Hive's release notes don't contain all JIRAs fixed. 
This happened due to a lack of correct or missing fix version in a JIRA. A 
large chunk of such JIRAs are due to the fact that their fix versions didn't 
get updated when a merge from feature branch to trunk (master). This JIRA is to 
fix such JIRAs related to Hive on Spark work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10863) Merge trunk to Spark branch 5/28/2015 [Spark Branch]

2015-05-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10863:
--

 Summary: Merge trunk to Spark branch 5/28/2015 [Spark Branch]
 Key: HIVE-10863
 URL: https://issues.apache.org/jira/browse/HIVE-10863
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10855) Make HIVE-10568 work with Spark [Spark Branch]

2015-05-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10855:
--

 Summary: Make HIVE-10568 work with Spark [Spark Branch]
 Key: HIVE-10855
 URL: https://issues.apache.org/jira/browse/HIVE-10855
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Rui Li


HIVE-10001 only works with Tez. It's good to make it also work for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10854) Make HIVE-10001 work with Spark [Spark Branch]

2015-05-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10854:
--

 Summary: Make HIVE-10001 work with Spark [Spark Branch]
 Key: HIVE-10854
 URL: https://issues.apache.org/jira/browse/HIVE-10854
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


HIVE-10001 only works with Tez. It's good to make it also work for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10850) Followup for HIVE-10550, check performance w.r.t. persistency level

2015-05-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10850:
--

 Summary: Followup for HIVE-10550, check performance w.r.t. 
persistency level
 Key: HIVE-10850
 URL: https://issues.apache.org/jira/browse/HIVE-10850
 Project: Hive
  Issue Type: Task
  Components: Spark
Affects Versions: 1.2.0, 1.1.0
Reporter: Xuefu Zhang
Assignee: Chengxiang Li


In HIVE-10550, there was a discussion on the persistence level and whether we 
need to give user some control over this. This JIRA is to investigate more, 
especially measuring performance under difference conditions, and further the 
discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10810) Document Beeline/CLI changes

2015-05-23 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10810:
--

 Summary: Document Beeline/CLI changes
 Key: HIVE-10810
 URL: https://issues.apache.org/jira/browse/HIVE-10810
 Project: Hive
  Issue Type: Sub-task
  Components: CLI
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10671) yarn-cluster mode offers a degraded performance from yarn-client [Spark Branch]

2015-05-11 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10671:
--

 Summary: yarn-cluster mode offers a degraded performance from 
yarn-client [Spark Branch]
 Key: HIVE-10671
 URL: https://issues.apache.org/jira/browse/HIVE-10671
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Xuefu Zhang


With Hive on Spark, users noticed that in certain cases 
spark.master=yarn-client offers 2x or 3x better performance than if 
spark.master=yarn-cluster. However, yarn-cluster is what we recommend and 
support. Thus, we should investigate and fix the problem. One of the such 
queries is TPC-H  22.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10579) Fix -Phadoop-1 build

2015-05-01 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10579:
--

 Summary: Fix -Phadoop-1 build
 Key: HIVE-10579
 URL: https://issues.apache.org/jira/browse/HIVE-10579
 Project: Hive
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10574) Metastore to handle expired tokens inline

2015-05-01 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10574:
--

 Summary: Metastore to handle expired tokens inline
 Key: HIVE-10574
 URL: https://issues.apache.org/jira/browse/HIVE-10574
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Xuefu Zhang


This is a followup for HIVE-9625.

Metastore has a garbage collection thread that removes expired tokens. However 
that still leaves a window (1 hour by default) where clients could retrieve a 
token that's expired or about to expire. An option is for metastore handle 
expired tokens inline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10516) Measure Hive CLI's performance difference before and after implementation is switched

2015-04-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10516:
--

 Summary: Measure Hive CLI's performance difference before and 
after implementation is switched
 Key: HIVE-10516
 URL: https://issues.apache.org/jira/browse/HIVE-10516
 Project: Hive
  Issue Type: Sub-task
  Components: CLI
Affects Versions: 0.10.0
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10515) Create tests to cover existing (supported) Hive CLI functionality

2015-04-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10515:
--

 Summary: Create tests to cover existing (supported) Hive CLI 
functionality
 Key: HIVE-10515
 URL: https://issues.apache.org/jira/browse/HIVE-10515
 Project: Hive
  Issue Type: Sub-task
  Components: CLI
Affects Versions: 0.10.0
Reporter: Xuefu Zhang


After removing HiveServer1, Hive CLI's functionality is reduced to its original 
use case, a thick client application. Let's identify this so that we maintain 
it when implementation is changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10511) Unify Hive CLI and Beeline

2015-04-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10511:
--

 Summary: Unify Hive CLI and Beeline
 Key: HIVE-10511
 URL: https://issues.apache.org/jira/browse/HIVE-10511
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 0.10.0
Reporter: Xuefu Zhang


Hive CLI is a legacy tool which had two main use cases: 
1. a thick client for SQL on hadoop
2. a command line tool for HiveServer1.

HiveServer1 is already deprecated and removed from Hive code base, so  use case 
#2 is out of the question. For #1, Beeline provides or is supposed to provides 
equal functionality, yet is implemented differently from Hive CLI.

As it has been a while that Hive community has been recommending Beeline + HS2 
configuration, ideally we should deprecating Hive CLI. Because of wide use of 
Hive CLI, we instead propose replacing Hive CLI's implementation with Beeline 
plus embedded HS2 so that Hive community only needs to maintain a single code 
path. In this way, Hive CLI is just an alias to Beeline at either shell script 
level or at high code level. The goal is that  no changes or minimum changes 
are expected from existing user scrip using Hive CLI.

This is an Umbrella JIRA covering all tasks related to this initiative. Over 
the last year or two, Beeline has been improved significantly to match what 
Hive CLI offers. Still, there may still be some gaps or deficiency to be 
discovered and fixed. In the meantime, we also want to make sure the enough 
tests are included and performance impact is identified and addressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10166) Merge Spark branch to trunk 3/31/2015

2015-03-31 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10166:
--

 Summary: Merge Spark branch to trunk 3/31/2015
 Key: HIVE-10166
 URL: https://issues.apache.org/jira/browse/HIVE-10166
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10134) Fix test failures after HIVE-10130 [Spark Branch]

2015-03-28 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10134:
--

 Summary: Fix test failures after HIVE-10130 [Spark Branch]
 Key: HIVE-10134
 URL: https://issues.apache.org/jira/browse/HIVE-10134
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Xuefu Zhang


Complete test run: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/812/#showFailuresLink

*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nonmr_fetch
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union31
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_22
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union_remove_6_subq
org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10130) Merge from Spark branch to trunk 03/27/2015

2015-03-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10130:
--

 Summary: Merge from Spark branch to trunk 03/27/2015
 Key: HIVE-10130
 URL: https://issues.apache.org/jira/browse/HIVE-10130
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10084) Improve common join performance [Spark Branch]

2015-03-25 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-10084:
--

 Summary: Improve common join performance [Spark Branch]
 Key: HIVE-10084
 URL: https://issues.apache.org/jira/browse/HIVE-10084
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


Benchmark shows that Hive on Spark shows some numbers which indicate that 
common join performance can be improved. This task is to investigate and fix 
the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9990) TestMultiSessionsHS2WithLocalClusterSpark is failing

2015-03-17 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-9990:
-

 Summary: TestMultiSessionsHS2WithLocalClusterSpark is failing
 Key: HIVE-9990
 URL: https://issues.apache.org/jira/browse/HIVE-9990
 Project: Hive
  Issue Type: Bug
  Components: Spark
Affects Versions: 1.2.0
Reporter: Xuefu Zhang


At least sometimes. I can reproduce it with "mvn test 
-Dtest=TestMultiSessionsHS2WithLocalClusterSpark -Phadoop-2" consistently on my 
local box.
{code}
---
 T E S T S
---
Running org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 92.438 sec <<< 
FAILURE! - in org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark
testSparkQuery(org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark)  
Time elapsed: 21.514 sec  <<< ERROR!
java.util.concurrent.ExecutionException: java.sql.SQLException: Error while 
processing statement: FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at 
org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:392)
at 
org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.verifyResult(TestMultiSessionsHS2WithLocalClusterSpark.java:244)
at 
org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testKvQuery(TestMultiSessionsHS2WithLocalClusterSpark.java:220)
at 
org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.access$000(TestMultiSessionsHS2WithLocalClusterSpark.java:53)
{code}

The error was also seen in HIVE-9934 test run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9889) Merge trunk to Spark branch 3/6/2015 [Spark Branch]

2015-03-06 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-9889:
-

 Summary: Merge trunk to Spark branch 3/6/2015 [Spark Branch]
 Key: HIVE-9889
 URL: https://issues.apache.org/jira/browse/HIVE-9889
 Project: Hive
  Issue Type: Task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9863) Querying parquet tables fails with IllegalStateException [Spark Branch]

2015-03-04 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-9863:
-

 Summary: Querying parquet tables fails with IllegalStateException 
[Spark Branch]
 Key: HIVE-9863
 URL: https://issues.apache.org/jira/browse/HIVE-9863
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang


Not necessarily happens only in spark branch, queries such as select count(*) 
from table_name fails with error:
{code}
hive> select * from content limit 2;
OK
Failed with exception java.io.IOException:java.lang.IllegalStateException: All 
the offsets listed in the split should be found in the file. expected: [4, 4] 
found: [BlockMetaData{69644, 881917418 [ColumnMetaData{GZIP [guid] BINARY  
[PLAIN, BIT_PACKED], 4}, ColumnMetaData{GZIP [collection_name] BINARY  
[PLAIN_DICTIONARY, BIT_PACKED], 389571}, ColumnMetaData{GZIP [doc_type] BINARY  
[PLAIN_DICTIONARY, BIT_PACKED], 389790}, ColumnMetaData{GZIP [stage] INT64  
[PLAIN_DICTIONARY, BIT_PACKED], 389887}, ColumnMetaData{GZIP [meta_timestamp] 
INT64  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 397673}, ColumnMetaData{GZIP 
[doc_timestamp] INT64  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 422161}, 
ColumnMetaData{GZIP [meta_size] INT32  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
460215}, ColumnMetaData{GZIP [content_size] INT32  [RLE, PLAIN_DICTIONARY, 
BIT_PACKED], 521728}, ColumnMetaData{GZIP [source] BINARY  [RLE, PLAIN, 
BIT_PACKED], 683740}, ColumnMetaData{GZIP [delete_flag] BOOLEAN  [RLE, PLAIN, 
BIT_PACKED], 683787}, ColumnMetaData{GZIP [meta] BINARY  [RLE, PLAIN, 
BIT_PACKED], 683834}, ColumnMetaData{GZIP [content] BINARY  [RLE, PLAIN, 
BIT_PACKED], 6992365}]}] out of: [4, 129785482, 260224757] in range 0, 134217728
Time taken: 0.253 seconds
hive> 
{code}
I can reproduce the problem with either local or yarn-cluster. It seems 
happening to MR also. Thus, I suspect this is an parquet problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9812) Merge trunk to Spark branch 02/27/2015 [Spark Branch]

2015-02-27 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created HIVE-9812:
-

 Summary: Merge trunk to Spark branch 02/27/2015 [Spark Branch]
 Key: HIVE-9812
 URL: https://issues.apache.org/jira/browse/HIVE-9812
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9671) Support Impersonation [Spark Branch]

2015-02-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332211#comment-14332211
 ] 

Xuefu Zhang commented on HIVE-9671:
---

Patch looks good. On minor nit: one space seems missing:
{code}
 user =Utils.getUGI().getShortUserName();
{code}
Besides that, the code additions in shim seem identical, so it might make sense 
to have a private method to reuse the code instead.

> Support Impersonation [Spark Branch]
> 
>
> Key: HIVE-9671
> URL: https://issues.apache.org/jira/browse/HIVE-9671
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Brock Noland
>Assignee: Brock Noland
> Attachments: HIVE-9671.1-spark.patch, HIVE-9671.1-spark.patch, 
> HIVE-9671.2-spark.patch
>
>
> SPARK-5493 in 1.3 implemented proxy user authentication. We need to implement 
> using this option in spark client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9671) Support Impersonation [Spark Branch]

2015-02-21 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9671:
--
Status: Open  (was: Patch Available)

> Support Impersonation [Spark Branch]
> 
>
> Key: HIVE-9671
> URL: https://issues.apache.org/jira/browse/HIVE-9671
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Brock Noland
>Assignee: Brock Noland
>
> SPARK-5493 in 1.3 implemented proxy user authentication. We need to implement 
> using this option in spark client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9745) predicate evaluation of character fields with spaces and literals with spaces returns unexpected result

2015-02-21 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9745:
--
Description: 
The following query should return 5 rows but Hive returns 3

{code}
select rnum, tchar.cchar from tchar where not (  tchar.cchar = ' ' or ( 
tchar.cchar is null and ' ' is null ))
{code}
Consider the following project of the base table

{code}
select rnum, tchar.cchar, 
case tchar.cchar when ' ' then 'space' else 'not space' end, 
case when tchar.cchar is null then 'is null' else 'not null' end, case when ' ' 
is null then 'is null' else 'not null' end
from tchar
order by rnum
{code}
Row 0 is a NULL
Row 1 was loaded with a zero length string ''
Row 2 was loaded with a single space ' '
{code}
rnumtchar.cchar _c2 _c3 _c4
0 not space   is null not null
1   not space   not null
not null
2   not space   not null
not null
3   BB  not space   not null
not null
4   EE  not space   not null
not null
5   FF  not space   not null
not null
{code}
Explicitly type cast the literal which many  SQL developers would not expect 
need to do.
{code}
select rnum, tchar.cchar, 
case tchar.cchar when cast(' ' as char(1)) then 'space' else 'not space' end, 
case when tchar.cchar is null then 'is null' else 'not null' end, case when 
cast( ' ' as char(1)) is null then 'is null' else 'not null' end
from tchar
order by rnum

rnumtchar.cchar _c2 _c3 _c4
0 not space   is null not null
1   space   not nullnot null
2   space   not nullnot null
3   BB  not space   not null
not null
4   EE  not space   not null
not null
5   FF  not space   not null
not null


create table  if not exists T_TCHAR ( RNUM int , CCHAR char(32 ))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
 STORED AS TEXTFILE  ;

0|\N
1|
2| 
3|BB
4|EE
5|FF


create table  if not exists TCHAR ( RNUM int , CCHAR char(32 ))
 STORED AS orc  ;

insert overwrite table TCHAR select * from  T_TCHAR;
{code}


  was:
The following query should return 5 rows but Hive returns 3


select rnum, tchar.cchar from tchar where not (  tchar.cchar = ' ' or ( 
tchar.cchar is null and ' ' is null ))

Consider the following project of the base table


select rnum, tchar.cchar, 
case tchar.cchar when ' ' then 'space' else 'not space' end, 
case when tchar.cchar is null then 'is null' else 'not null' end, case when ' ' 
is null then 'is null' else 'not null' end
from tchar
order by rnum

Row 0 is a NULL
Row 1 was loaded with a zero length string ''
Row 2 was loaded with a single space ' '

rnumtchar.cchar _c2 _c3 _c4
0 not space   is null not null
1   not space   not null
not null
2   not space   not null
not null
3   BB  not space   not null
not null
4   EE  not space   not null
not null
5   FF  not space   not null
not null

Explicitly type cast the literal which many  SQL developers would not expect 
need to do.

select rnum, tchar.cchar, 
case tchar.cchar when cast(' ' as char(1)) then 'space' else 'not space' end, 
case when tchar.cchar is null then 'is null' else 'not null' end, case when 
cast( ' ' as char(1)) is null then 'is null' else 'not null' end
from tchar
order by rnum

rnumtchar.cchar _c2 _c3 _c4
0 not space   is null not null
1   space   not nullnot null
2   space   not nullnot null
3   BB  not space   not null
not null
4   EE  not space   not null
not null
5   FF  not space   not null
not null


create table  if not exists T_TCHAR ( RNUM int , CCHAR char(32 ))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
 STORED AS TEXTFILE  ;

0|\N
1|
2| 
3|BB
4|EE
5|FF


create table  if not exists TCHAR ( RNUM int , CCHAR char(32 ))
 STORED AS orc  ;

insert overwrite table TCHAR select * from  T_TCHAR;



> predicate evaluation of character fields with spaces and literals with sp

[jira] [Commented] (HIVE-9726) Upgrade to spark 1.3 [Spark Branch]

2015-02-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328971#comment-14328971
 ] 

Xuefu Zhang commented on HIVE-9726:
---

+1

> Upgrade to spark 1.3 [Spark Branch]
> ---
>
> Key: HIVE-9726
> URL: https://issues.apache.org/jira/browse/HIVE-9726
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Affects Versions: spark-branch
>Reporter: Brock Noland
>Assignee: Brock Noland
> Attachments: HIVE-9671.1-spark.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015

2015-02-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326814#comment-14326814
 ] 

Xuefu Zhang commented on HIVE-9703:
---

No doc is needed for this JIRA. Any doc impact should be tracked by respective 
JIRAs on Spark branch. Going over the patch shows there is nothing to be 
documented, however.

> Merge from Spark branch to trunk 02/16/2015
> ---
>
> Key: HIVE-9703
> URL: https://issues.apache.org/jira/browse/HIVE-9703
> Project: Hive
>  Issue Type: Task
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 1.2.0
>
> Attachments: HIVE-9703.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-7292) Hive on Spark

2015-02-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326336#comment-14326336
 ] 

Xuefu Zhang edited comment on HIVE-7292 at 2/18/15 6:37 PM:


Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out.


was (Author: xuefuz):
Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7292) Hive on Spark

2015-02-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326336#comment-14326336
 ] 

Xuefu Zhang commented on HIVE-7292:
---

Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7292) Hive on Spark

2015-02-18 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326335#comment-14326335
 ] 

Xuefu Zhang commented on HIVE-7292:
---

Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.

> Hive on Spark
> -
>
> Key: HIVE-7292
> URL: https://issues.apache.org/jira/browse/HIVE-7292
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>  Labels: Spark-M1, Spark-M2, Spark-M3, Spark-M4, Spark-M5
> Attachments: Hive-on-Spark.pdf
>
>
> Spark as an open-source data analytics cluster computing framework has gained 
> significant momentum recently. Many Hive users already have Spark installed 
> as their computing backbone. To take advantages of Hive, they still need to 
> have either MapReduce or Tez on their cluster. This initiative will provide 
> user a new alternative so that those user can consolidate their backend. 
> Secondly, providing such an alternative further increases Hive's adoption as 
> it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
> on Hadoop.
> Finally, allowing Hive to run on Spark also has performance benefits. Hive 
> queries, especially those involving multiple reducer stages, will run faster, 
> thus improving user experience as Tez does.
> This is an umbrella JIRA which will cover many coming subtask. Design doc 
> will be attached here shortly, and will be on the wiki as well. Feedback from 
> the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]

2015-02-18 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9561:
--
   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

[~lirui], no worries. I just committed this to the Spark branch. Thanks, Rui.

> SHUFFLE_SORT should only be used for order by query [Spark Branch]
> --
>
> Key: HIVE-9561
> URL: https://issues.apache.org/jira/browse/HIVE-9561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: spark-branch
>
> Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, 
> HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch, 
> HIVE-9561.6-spark.patch
>
>
> The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance 
> and are difficult to control. So we should limit the use of {{sortByKey}} to 
> order by query only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015

2015-02-18 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9703:
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks to Brock for the review.

> Merge from Spark branch to trunk 02/16/2015
> ---
>
> Key: HIVE-9703
> URL: https://issues.apache.org/jira/browse/HIVE-9703
> Project: Hive
>  Issue Type: Task
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 1.2.0
>
> Attachments: HIVE-9703.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9607) Remove unnecessary attach-jdbc-driver execution from package/pom.xml

2015-02-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9607:
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   Status: Resolved  (was: Patch Available)

Committed to trunk. thanks, alex.

> Remove unnecessary attach-jdbc-driver execution from package/pom.xml
> 
>
> Key: HIVE-9607
> URL: https://issues.apache.org/jira/browse/HIVE-9607
> Project: Hive
>  Issue Type: Improvement
>  Components: Build Infrastructure
>Reporter: Alexander Pivovarov
>Assignee: Alexander Pivovarov
>Priority: Minor
> Fix For: 1.2.0
>
> Attachments: HIVE-9607.1.patch
>
>
> Looks like build-helper-maven-plugin block which has execution 
> attach-jdbc-driver is not needed in package/pom.xml
> package/pom.xml has maven-dependency-plugin which copies hive-jdbc-standalone 
> to project.build.directory
> I removed build-helper-maven-plugin block and rebuilt hive
> hive-jdbc-standalone.jar is still placed to project.build.directory
> {code}
> $ mvn clean install -Phadoop-2 -Pdist -DskipTests
> $ find . -name "apache-hive*jdbc.jar" -exec ls -la {} \;
> 16844023 Feb  6 17:45 ./packaging/target/apache-hive-1.2.0-SNAPSHOT-jdbc.jar
> $ find . -name "hive-jdbc*standalone.jar" -exec ls -la {} \;
> 16844023 Feb  6 17:45 
> ./packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin/lib/hive-jdbc-1.2.0-SNAPSHOT-standalone.jar
> 16844023 Feb  6 17:45 ./jdbc/target/hive-jdbc-1.2.0-SNAPSHOT-standalone.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]

2015-02-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9561:
--
Attachment: HIVE-9561.6-spark.patch

> SHUFFLE_SORT should only be used for order by query [Spark Branch]
> --
>
> Key: HIVE-9561
> URL: https://issues.apache.org/jira/browse/HIVE-9561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, 
> HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch, 
> HIVE-9561.6-spark.patch
>
>
> The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance 
> and are difficult to control. So we should limit the use of {{sortByKey}} to 
> order by query only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9708) Remove testlibs directory

2015-02-17 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324860#comment-14324860
 ] 

Xuefu Zhang commented on HIVE-9708:
---

+1

> Remove testlibs directory
> -
>
> Key: HIVE-9708
> URL: https://issues.apache.org/jira/browse/HIVE-9708
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Brock Noland
>Assignee: Brock Noland
> Fix For: 1.1.0
>
> Attachments: HIVE-9708.patch
>
>
> The {{testlibs}} directory is left over from the old ant build. We can delete 
> it as it's downloaded by maven now:
> https://github.com/apache/hive/blob/trunk/pom.xml#L610



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]

2015-02-17 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9561:
--
Attachment: HIVE-9561.5-spark.patch

Rebased.

> SHUFFLE_SORT should only be used for order by query [Spark Branch]
> --
>
> Key: HIVE-9561
> URL: https://issues.apache.org/jira/browse/HIVE-9561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, 
> HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch, HIVE-9561.5-spark.patch
>
>
> The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance 
> and are difficult to control. So we should limit the use of {{sortByKey}} to 
> order by query only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]

2015-02-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323759#comment-14323759
 ] 

Xuefu Zhang commented on HIVE-9561:
---

Unfortunately the patch doesn't apply any more after recent trunk to branch 
merge. Could you please rebase?

> SHUFFLE_SORT should only be used for order by query [Spark Branch]
> --
>
> Key: HIVE-9561
> URL: https://issues.apache.org/jira/browse/HIVE-9561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, 
> HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch
>
>
> The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance 
> and are difficult to control. So we should limit the use of {{sortByKey}} to 
> order by query only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9561) SHUFFLE_SORT should only be used for order by query [Spark Branch]

2015-02-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323756#comment-14323756
 ] 

Xuefu Zhang commented on HIVE-9561:
---

+1

> SHUFFLE_SORT should only be used for order by query [Spark Branch]
> --
>
> Key: HIVE-9561
> URL: https://issues.apache.org/jira/browse/HIVE-9561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-9561.1-spark.patch, HIVE-9561.2-spark.patch, 
> HIVE-9561.3-spark.patch, HIVE-9561.4-spark.patch
>
>
> The {{sortByKey}} shuffle launches probe jobs. Such jobs can hurt performance 
> and are difficult to control. So we should limit the use of {{sortByKey}} to 
> order by query only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9696) Address RB comments for HIVE-9425 [Spark Branch]

2015-02-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9696:
--
Fix Version/s: spark-branch

> Address RB comments for HIVE-9425 [Spark Branch]
> 
>
> Key: HIVE-9696
> URL: https://issues.apache.org/jira/browse/HIVE-9696
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Priority: Trivial
> Fix For: spark-branch
>
> Attachments: HIVE-9696.1-spark.patch, HIVE-9696.1-spark.patch, 
> HIVE-9696.1-spark.patch
>
>
> A followup task of HIVE-9425.
> The pending RB comment can be found 
> [here|https://reviews.apache.org/r/30984/#comment118482].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9696) Address RB comments for HIVE-9425 [Spark Branch]

2015-02-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9696:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Rui.

> Address RB comments for HIVE-9425 [Spark Branch]
> 
>
> Key: HIVE-9696
> URL: https://issues.apache.org/jira/browse/HIVE-9696
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Rui Li
>Priority: Trivial
> Attachments: HIVE-9696.1-spark.patch, HIVE-9696.1-spark.patch, 
> HIVE-9696.1-spark.patch
>
>
> A followup task of HIVE-9425.
> The pending RB comment can be found 
> [here|https://reviews.apache.org/r/30984/#comment118482].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9703) Merge from Spark branch to trunk 02/16/2015

2015-02-16 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-9703:
--
Status: Patch Available  (was: Open)

> Merge from Spark branch to trunk 02/16/2015
> ---
>
> Key: HIVE-9703
> URL: https://issues.apache.org/jira/browse/HIVE-9703
> Project: Hive
>  Issue Type: Task
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-9703.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3153 matches

Mail list logo