[jira] [Created] (HIVE-21372) Use Apache Commons IO To Read Stream To String

2019-03-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21372:
--

 Summary: Use Apache Commons IO To Read Stream To String
 Key: HIVE-21372
 URL: https://issues.apache.org/jira/browse/HIVE-21372
 Project: Hive
  Issue Type: Improvement
Reporter: BELUGA BEHR
 Fix For: 4.0.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious

2019-03-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21371:
--

 Summary: Make NonSyncByteArrayOutputStream Overflow Conscious 
 Key: HIVE-21371
 URL: https://issues.apache.org/jira/browse/HIVE-21371
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR
 Attachments: HIVE-21371.1.patch

{code:java|title=NonSyncByteArrayOutputStream}
  private int enLargeBuffer(int increment) {
int temp = count + increment;
int newLen = temp;
if (temp > buf.length) {
  if ((buf.length << 1) > temp) {
newLen = buf.length << 1;
  }
  byte newbuf[] = new byte[newLen];
  System.arraycopy(buf, 0, newbuf, 0, count);
  buf = newbuf;
}
return newLen;
  }
{code}

This will fail if the array is 2GB or larger because it will double the size 
every time without consideration for the 4GB limit on arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21370:
--

 Summary: JsonSerDe cannot handle json file with empty lines - 
Branch 3
 Key: HIVE-21370
 URL: https://issues.apache.org/jira/browse/HIVE-21370
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 3.1.0, 3.0.0, 3.2.0
Reporter: BELUGA BEHR
 Fix For: 3.2.0
 Attachments: HIVE-21370.1.patch





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21356) Upgrade Jackson to 2.9.8

2019-02-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21356:
--

 Summary: Upgrade Jackson to 2.9.8
 Key: HIVE-21356
 URL: https://issues.apache.org/jira/browse/HIVE-21356
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Currently at:

{code}
2.9.5
{code}

Upgrade to 2.9.8 - contains some improvements for processing Base64 data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked

2019-02-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21354:
--

 Summary: Lock The Entire Table If Majority Of Partitions Are Locked
 Key: HIVE-21354
 URL: https://issues.apache.org/jira/browse/HIVE-21354
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism.  
When a Hive query interacts with a table which has a lot of partitions, this 
may put a lot of stress on the ZK system.

Please add a heuristic that works like this:

# Count the number of partitions that a query is required to lock
# Obtain the total number of partitions in the table
# If the number of partitions accessed by the query is greater than or equal to 
half the total number of partitions, simply create one ZNode lock at the table 
level.

This would improve performance of many queries, but in particular, a {{select 
count(1) from table}} ... or ... {{select * from table limit 5}} where the 
table has many partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21352) Drop INDEX from 3.0 Schema

2019-02-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21352:
--

 Summary: Drop INDEX from 3.0 Schema
 Key: HIVE-21352
 URL: https://issues.apache.org/jira/browse/HIVE-21352
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


We dropped support for Hive indexes starting in 3.0, however there are still 
tables in Metastore to support it.  Please remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21347) Store Partition Count in TBLS

2019-02-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21347:
--

 Summary: Store Partition Count in TBLS
 Key: HIVE-21347
 URL: https://issues.apache.org/jira/browse/HIVE-21347
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


Please store a count of the number of partitions each table has in the 
```TBLS``` table.  This will allow very quick lookups for tables with many 
partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21328) Call To Hadoop Text getBytes() Without Call to getLength()

2019-02-26 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21328:
--

 Summary: Call To Hadoop Text getBytes() Without Call to getLength()
 Key: HIVE-21328
 URL: https://issues.apache.org/jira/browse/HIVE-21328
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


I'm not sure if there is actually a bug, but this looks highly suspect:

{code:java}
  public Object set(final Object o, final Text text) {
return new BytesWritable(text == null ? null : text.getBytes());
  }
{code}

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/primitive/ParquetStringInspector.java#L104-L106

There are two components to a Text object.  There are the internal bytes and 
the length of the bytes.  The two are independent.  I.e., a quick "reset" on 
the Text object simply sets the internal length counter to zero.  This code is 
potentially looking at obsolete data that it shouldn't be seeing because it is 
not considering the length of the Text.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21321) Remove Class HiveIOExceptionHandlerChain

2019-02-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21321:
--

 Summary: Remove Class HiveIOExceptionHandlerChain
 Key: HIVE-21321
 URL: https://issues.apache.org/jira/browse/HIVE-21321
 Project: Hive
  Issue Type: Task
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


I recently stumbled upon this code when tracking down some issue: 
{{HiveIOExceptionHandlerChain.java}}

Is anyone using this feature? Is has a configuration associated with it 
{{hive.io.exception.handlers}}. 

The code doesn't seem to have any unit tests.

Can this feature simply be removed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly

2019-02-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21317:
--

 Summary: Unit Test kafka_storage_handler Is Failing Regularly
 Key: HIVE-21317
 URL: https://issues.apache.org/jira/browse/HIVE-21317
 Project: Hive
  Issue Type: Task
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


{code}
org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler]
 (batchId=275)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21303:
--

 Summary: Update TextRecordReader
 Key: HIVE-21303
 URL: https://issues.apache.org/jira/browse/HIVE-21303
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR
Assignee: BELUGA BEHR
 Attachments: HIVE-21303.1.patch

Remove use of Deprecated 
{{org.apache.hadoop.mapred.LineRecordReader.LineReader}}

For every call to {{next}}, the code dives into the configuration map to see if 
this feature is enabled.  Just look it up once and cache the value.

{code:java}
public int next(Writable row) throws IOException {
...
if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
  return HiveUtils.unescapeText((Text) row);
}
return bytesConsumed;
}
{code}

Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21289) Expect EQ and LIKE to Generate the Identical Explain Plans

2019-02-19 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21289:
--

 Summary: Expect EQ and LIKE to Generate the Identical Explain Plans
 Key: HIVE-21289
 URL: https://issues.apache.org/jira/browse/HIVE-21289
 Project: Hive
  Issue Type: Improvement
  Components: Logical Optimizer
Affects Versions: 2.3.4
Reporter: BELUGA BEHR


I generated some test data with the UUID function.

{code:sql}
explain select * from test_like where a like 
'abce6254-d437-426b-8873-2cbc153ddfbc';
explain select * from test_like where a = 
'abce6254-d437-426b-8873-2cbc153ddfbc';
{code}

{code|title=LIKE}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test_like
filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
Statistics: Num rows: 262144 Data size: 9437184 Basic stats: 
COMPLETE Column stats: NONE
Filter Operator
  predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: a (type: string)
outputColumnNames: _col0
Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

{code|title=EQ}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test_like
filterExpr: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
Statistics: Num rows: 262144 Data size: 9437184 Basic stats: 
COMPLETE Column stats: NONE
Filter Operator
  predicate: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: 'abce6254-d437-426b-8873-2cbc153ddfbc' (type: 
string)
outputColumnNames: _col0
Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

They may be the same under the covers, but I would expect the EXPLAIN plan to 
be exactly the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21277) Make HBaseSerde First-Class SerDe

2019-02-15 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21277:
--

 Summary: Make HBaseSerde First-Class SerDe
 Key: HIVE-21277
 URL: https://issues.apache.org/jira/browse/HIVE-21277
 Project: Hive
  Issue Type: New Feature
  Components: HBase Handler
Reporter: BELUGA BEHR


Make HBase integration with Hive first class.

{code:sql}
CREATE TABLE...STORED AS HBASE;
{code}

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21275) Lower Logging Level in Operator Class

2019-02-15 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21275:
--

 Summary: Lower Logging Level in Operator Class
 Key: HIVE-21275
 URL: https://issues.apache.org/jira/browse/HIVE-21275
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


There is an incredible amount of logging generated by the {{Operator}} during 
the Q-Tests.

I counted more than 1 *million* lines of pretty useless logging.  Please lower 
to TRACE level.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21264) Improvements Around CharTypeInfo

2019-02-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21264:
--

 Summary: Improvements Around CharTypeInfo
 Key: HIVE-21264
 URL: https://issues.apache.org/jira/browse/HIVE-21264
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


The {{CharTypeInfo}} stores the type name of the data type (char/varchar) and 
the length (1-255).  {{CharTypeInfo}} objects are often getting cached once 
they are created.

The {hashcode()} and {{equals()}} of its sub-classes varchar and char are 
inconsistent in this regard.

* Make hashcode and equals consistent (and fast)
* Simplify the {{getQualifiedName}} implementation and reduce the scope to 
protected
* Other related nits



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21258) Streamline and Add Warning to PrimitiveObjectInspectorFactory.java

2019-02-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21258:
--

 Summary: Streamline and Add Warning to 
PrimitiveObjectInspectorFactory.java
 Key: HIVE-21258
 URL: https://issues.apache.org/jira/browse/HIVE-21258
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


I just got bit by something pretty good, so I would like to propose adding a 
developer-facing warning into the logs to avoid this situation again.  Also, 
fix up the related cache a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21252) LazyTimestamp - Use String Equals

2019-02-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21252:
--

 Summary: LazyTimestamp - Use String Equals
 Key: HIVE-21252
 URL: https://issues.apache.org/jira/browse/HIVE-21252
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


{code:java|title=LazyTimestamp.java}
if (s.compareTo("NULL") == 0) {
  isNull = true;
  logExceptionMessage(bytes, start, length, "TIMESTAMP");
}
{code}

compareTo generates a number to represent the differences between the two 
Strings.  It's faster to simply call "equals" which will simply compare the two 
String directly and return a boolean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21246:
--

 Summary: Un-bury DelimitedJSONSerDe from PlanUtils.java
 Key: HIVE-21246
 URL: https://issues.apache.org/jira/browse/HIVE-21246
 Project: Hive
  Issue Type: Improvement
Reporter: BELUGA BEHR
 Attachments: HIVE-21246.1.patch

Ultimately, I'd like to get rid of 
{{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to 
make it easier to get rid of later.  It's currently buried in 
{{PlanUtils.java}}.

A SerDe and a flag gets passed into utilities.  If the class boolean is set, 
the passed-in SerDe is overwritten.  This is not documented anywhere and it's 
weird to do it, just pass in the SerDe to use instead of the SerDe you don't 
want to use and a flag to change it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21245) Make hive.fetch.output.serde Default to LazySimpleSerde

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21245:
--

 Summary: Make hive.fetch.output.serde Default to LazySimpleSerde
 Key: HIVE-21245
 URL: https://issues.apache.org/jira/browse/HIVE-21245
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


For all intents and purposes, it already is:

{code:java|title=HiveSessionImpl.java}
  private static final String FETCH_WORK_SERDE_CLASS =
  "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe";

  @Override
  public HiveConf getHiveConf() {
sessionConf.setVar(HiveConf.ConfVars.HIVEFETCHOUTPUTSERDE, 
FETCH_WORK_SERDE_CLASS);
return sessionConf;
  }
{code}

https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L489-L492

Ultimately, I'd like to get rid of 
{{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}} altogether.  It's a weird 
thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21243) Create Default DB in default.db

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21243:
--

 Summary: Create Default DB in default.db
 Key: HIVE-21243
 URL: https://issues.apache.org/jira/browse/HIVE-21243
 Project: Hive
  Issue Type: Improvement
Reporter: BELUGA BEHR


When a database is created in Hive, it is stored in 
{{/user/hive/warehouse/[MyDatabase.db]/}}  It is very confusing that the Hive 
default database is not located in {{/user/hive/warehouse/[default.db]/}}. 
Please address this and make it consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21242) Calcite Planner Logging Indicates UTF-16 Encoding

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21242:
--

 Summary: Calcite Planner Logging Indicates UTF-16 Encoding
 Key: HIVE-21242
 URL: https://issues.apache.org/jira/browse/HIVE-21242
 Project: Hive
  Issue Type: Improvement
  Components: CBO
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


I noticed some debug logging from calcite and it is using UTF-16.   I would 
expect UTF-8.

{code}
2019-02-10T19:08:06,393 DEBUG [7db4d3c5-0f88-49db-88fa-ad6428c23784 main] 
parse.CalcitePlanner: Plan after decorrelation:
HiveSortLimit(offset=[0], fetch=[2])
  HiveProject(_o__c0=[array(3, 2, 1)], _o__c1=[map(1, 2001-01-01, 2, null)], 
_o__c2=[named_struct(_UTF-16LE'c1', 123456, _UTF-16LE'c2', _UTF-16LE'hello', 
_UTF-16LE'c3', array(_UTF-16LE'aa', _UTF-16LE'bb', _UTF-16LE'cc'), 
_UTF-16LE'c4', map(_UTF-16LE'abc', 123, _UTF-16LE'xyz', 456), _UTF-16LE'c5', 
named_struct(_UTF-16LE'c5_1', _UTF-16LE'bye', _UTF-16LE'c5_2', 88))])
HiveTableScan(table=[[default, src]], table:alias=[src])
{code}

I'm not sure if this is a calcite internal thing which can be configured or if 
this only an artifact of the way the logging works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21241) Migrate TimeStamp Parser From Joda Time

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21241:
--

 Summary: Migrate TimeStamp Parser From Joda Time
 Key: HIVE-21241
 URL: https://issues.apache.org/jira/browse/HIVE-21241
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.2.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Hive uses Joda time for its TimeStampParser.

{quote}
Joda-Time is the de facto standard date and time library for Java prior to Java 
SE 8. Users are now asked to migrate to java.time (JSR-310).

https://www.joda.org/joda-time/
{quote}

Migrate TimeStampParser to {{java.time}}

I also added a couple new pre-canned timestamp parsers for convenience:

* ISO 8601
* RFC 1123



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21240) JSON SerDe Re-Write

2019-02-11 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21240:
--

 Summary: JSON SerDe Re-Write
 Key: HIVE-21240
 URL: https://issues.apache.org/jira/browse/HIVE-21240
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 3.1.1, 4.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


The JSON SerDe has a few issues, I will link them to this JIRA.

* Use Jackson Tree parser instead of manually parsing
* Added support for base-64 encoded data (the expected format when using JSON)
* Added support to skip blank lines (returns all columns as null values)
* Current JSON parser accepts, but does not apply, custom timestamp formats in 
most cases
* Added some unit tests
* Added cache for column-name to column-index searches, currently O\(n\) for 
each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing

2019-02-04 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21210:
--

 Summary: CombineHiveInputFormat Thread Pool Sizing
 Key: HIVE-21210
 URL: https://issues.apache.org/jira/browse/HIVE-21210
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


Threadpools.

Hive uses threadpools in several different places and each implementation is a 
little different and requires different configurations. I think that Hive needs 
to reign in and standardize the way that threadpools are used and threadpools 
should scale automatically without manual configuration. At any given time, 
there are many hundreds of threads running in the HS2 as the number of 
simultaneous connections increases and they surely cause contention with 
one-another.

Here is an example:
{code:java|title=CombineHiveInputFormat.java}
  // max number of threads we can use to check non-combinable paths
  private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50;
  private static final int DEFAULT_NUM_PATH_PER_THREAD = 100;
{code}
When building the splits for a MR job, there are up to 50 threads running per 
query and there is not much scaling here, it's simply 1 thread : 100 files 
ratio.  This implies that to process 5000 files, there are 50 threads, after 
that, 50 threads are still used. Many Hive jobs these days involve more than 
5000 files so it's not scaling well on bigger sizes.

This is not configurable (even manually), it doesn't change when the hardware 
specs increase, and 50 threads seems like a lot when a service must support up 
to 80 connections:

[https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html]

Not to mention, I have never seen a scenario where HS2 is running on a host all 
by itself and has the entire system dedicated to it. Therefore it should be 
more friendly and spin up fewer threads.

I am attaching a patch here that provides a few features:
 * Common module that produces {{ExecutorService}} which caps the number of 
threads it spins up at the number of processors a host has. Keep in mind that a 
class may submit as much work units ({{Callables}} as they would like, but the 
number of threads in the pool is capped.
 * Common module for partitioning work. That is, allow for a generic framework 
for dividing work into partitions (i.e. batches)
 * Modify {{CombineHiveInputFormat}} to take advantage of both modules, 
performing its same duties in a more Java OO way that is currently implemented
 * Add a partitioning (batching) implementation that enforces partitioning of a 
{{Collection}} based on the natural log of the {{Collection}} size so that it 
scales more slowly than a simple 1:100 ratio.
 * Simplify unit test code for {{CombineHiveInputFormat}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21195) Review of DefaultGraphWalker Class

2019-01-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21195:
--

 Summary: Review of DefaultGraphWalker Class
 Key: HIVE-21195
 URL: https://issues.apache.org/jira/browse/HIVE-21195
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


{code:java}
protected final List toWalk = new ArrayList();
...
while (toWalk.size() > 0) {
  Node nd = toWalk.remove(0);
{code}

Every time this loop runs, the first item of a list is removed.  For an 
{{ArrayList}}, this means that every time the first item is removed, all of the 
remaining items in the list are copied down one position so that the first item 
is always at array index 0.  This is expensive in a tight loop.  Use a 
{{Queue}} implementation that does not have this behavior. {{ArrayDeque}}

{quote}
This class is likely to be faster than Stack when used as a stack, and faster 
than LinkedList when used as a queue.
{quote}

https://docs.oracle.com/javase/7/docs/api/java/util/ArrayDeque.html

Add a little bit extra cleanup since it's being looked at.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21193) Support LZO Compression with CombineHiveInputFormat

2019-01-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21193:
--

 Summary: Support LZO Compression with CombineHiveInputFormat
 Key: HIVE-21193
 URL: https://issues.apache.org/jira/browse/HIVE-21193
 Project: Hive
  Issue Type: Improvement
  Components: Compression
Affects Versions: 4.0.0, 3.2.0
Reporter: BELUGA BEHR


In regards to LZO compression with Hive...

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO

It does not work out of the box if there are {{.lzo.index}} files present.  As 
I understand it, this is because of the default Hive input format 
{{CombineHiveInputFormat}} does not handle this correctly.  It does not like 
that there are a mix of data files and some index files, it lumps them 
altogether when making the combined splits and Mappers fail when they try to 
process the {{.lzo.index}} files as data.  When using the original 
{{HiveInputFormat}}, it correctly identifies the {{.lzo.index}} files because 
it considers each file individually.

Allow {{CombineHiveInputFormat}} to short-circuit LZO files and to not combine 
them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21192) TestReplicationScenariosIncrementalLoadAcidTables Fails Regularly

2019-01-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21192:
--

 Summary: TestReplicationScenariosIncrementalLoadAcidTables Fails 
Regularly
 Key: HIVE-21192
 URL: https://issues.apache.org/jira/browse/HIVE-21192
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


Several of my patches are failing in YETUS due to the following unit test 
failure:

{code}
TestReplicationScenariosIncrementalLoadAcidTables - did not produce a 
TEST-*.xml file (likely timed out) (batchId=251)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21179) Move SampleHBaseKeyFactory* Into Main Code Line

2019-01-29 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21179:
--

 Summary: Move SampleHBaseKeyFactory* Into Main Code Line
 Key: HIVE-21179
 URL: https://issues.apache.org/jira/browse/HIVE-21179
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR


https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

{quote}
"hbase.composite.key.factory" should be the fully qualified class name of a 
class implementing HBaseKeyFactory. See SampleHBaseKeyFactory2 for a fixed 
length example in the same package. This class must be on your classpath in 
order for the above example to work. TODO: place these in an accessible place; 
they're currently only in test code.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21175) Use StandardCharsets Where Possible (Part 2)

2019-01-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21175:
--

 Summary: Use StandardCharsets Where Possible (Part 2)
 Key: HIVE-21175
 URL: https://issues.apache.org/jira/browse/HIVE-21175
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.2.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Additional work not already addressed by [HIVE-21148].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21148) Remove Use StandardCharsets Where Possible

2019-01-22 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21148:
--

 Summary: Remove Use StandardCharsets Where Possible
 Key: HIVE-21148
 URL: https://issues.apache.org/jira/browse/HIVE-21148
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Starting in Java 1.7, JDKs must support a set of standard charsets.  When using 
this facility, instead of passing the name (string) of the character set, there 
is no need to catch a {{UnsupportedEncodingException}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21147) Remove Contrib RegexSerDe

2019-01-22 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21147:
--

 Summary: Remove Contrib RegexSerDe
 Key: HIVE-21147
 URL: https://issues.apache.org/jira/browse/HIVE-21147
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 4.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

https://github.com/apache/hive/blob/ae008b79b5d52ed6a38875b73025a505725828eb/serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java

Merge any difference in functionality and remove the version in the 'contrib' 
library



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21073) Remove Extra String Object

2018-12-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21073:
--

 Summary: Remove Extra String Object
 Key: HIVE-21073
 URL: https://issues.apache.org/jira/browse/HIVE-21073
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.1.1, 4.0.0
Reporter: BELUGA BEHR


{code}
  public static String generatePath(Path baseURI, String filename) {
String path = new String(baseURI + Path.SEPARATOR + filename);
return path;
  }

  public static String generateFileName(Byte tag, String bigBucketFileName) {
String fileName = new String("MapJoin-" + tag + "-" + bigBucketFileName + 
suffix);
return fileName;
  }
{code}

It's a bit odd to be performing string concatenation and then wrapping the 
results in a new string.  This is creating superfluous String objects. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21071) Improve getInputSummary

2018-12-26 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21071:
--

 Summary: Improve getInputSummary
 Key: HIVE-21071
 URL: https://issues.apache.org/jira/browse/HIVE-21071
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.1.1, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


There is a global lock in the {{getInptSummary}} code, so it is important that 
it be fast.  The current implementation has quite a bit of overhead that can be 
re-engineered.

For example, the current implementation keeps a map of File Path to 
ContentSummary object.  This map is populated by several threads concurrently. 
The method then loops through the map, in a single thread, at the end to add up 
all of the ContentSummary objects and ignores the paths.  The code can be be 
re-engineered to not use a map, or a collection at all, to store the results 
and instead just keep a running tally.  By keeping a tally, there is no O(n) 
operation at the end to perform the addition.

There are other things can be improved.  The method returns an object which is 
never used anywhere, so change method to void return type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20956) Avro Result File Format

2018-11-21 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20956:
--

 Summary: Avro Result File Format
 Key: HIVE-20956
 URL: https://issues.apache.org/jira/browse/HIVE-20956
 Project: Hive
  Issue Type: New Feature
  Components: HiveServer2
Affects Versions: 3.1.1, 4.0.0
Reporter: BELUGA BEHR


*hive.query.result.fileformat*
 * Default Value:
 ** Hive 0.x, 1.x, and 2.0: {{TextFile}}
 ** Hive 2.1 onward: {{SequenceFile}}
 * Added In: Hive 0.7.0 with HIVE-1598

File format to use for a query's intermediate results. Options are TextFile, 
SequenceFile, and RCfile. Default value is changed to SequenceFile since Hive 
2.1.0 (HIVE-1608).

 

Add AVRO to this list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20947) Add User Agent String to Hive Client API

2018-11-20 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20947:
--

 Summary: Add User Agent String to Hive Client API
 Key: HIVE-20947
 URL: https://issues.apache.org/jira/browse/HIVE-20947
 Project: Hive
  Issue Type: New Feature
  Components: Clients, Diagnosability, JDBC, ODBC
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


Allow users to specify a user agent string as part of their JDBC/ODBC 
connection string and print the information in the HS2 logs.  This will  allow 
us the opportunity to identify misbehaving clients.

Variable: {{userAgent}}

https://en.wikipedia.org/wiki/User_agent#Format_for_human-operated_web_browsers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20895) Cleanup JdbcColumn Class

2018-11-08 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20895:
--

 Summary: Cleanup JdbcColumn Class
 Key: HIVE-20895
 URL: https://issues.apache.org/jira/browse/HIVE-20895
 Project: Hive
  Issue Type: Improvement
  Components: JDBC
Affects Versions: 3.1.1, 4.0.0
Reporter: BELUGA BEHR






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20894) Clean Up JDBC HiveQueryResultSet

2018-11-08 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20894:
--

 Summary: Clean Up JDBC HiveQueryResultSet
 Key: HIVE-20894
 URL: https://issues.apache.org/jira/browse/HIVE-20894
 Project: Hive
  Issue Type: Improvement
  Components: JDBC
Affects Versions: 4.0.0
Reporter: BELUGA BEHR






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20849) Review of ConstantPropagateProcFactory

2018-10-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20849:
--

 Summary: Review of ConstantPropagateProcFactory
 Key: HIVE-20849
 URL: https://issues.apache.org/jira/browse/HIVE-20849
 Project: Hive
  Issue Type: Improvement
  Components: Logical Optimizer
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR
 Attachments: HIVE-20849.1.patch

I was looking at this class because it blasts a lot of useless (to an admin) 
information to the logs.  Especially if the table has a lot of columns, I see 
big blocks of logging that are meaningless to me.  I request that the logging 
is toned down to debug, and some other improvements to the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20847) Review of NullScanCode

2018-10-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20847:
--

 Summary: Review of NullScanCode
 Key: HIVE-20847
 URL: https://issues.apache.org/jira/browse/HIVE-20847
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR


What got me looking at this class was the verboseness of some of the logging.  
I would like to request that we DEBUG the logging since this level of detail 
means nothing to a cluster admin.

Also... this {{contains}} call would be better applied onto a {{HashSet}} 
instead of an {{ArrayList}}.

{code:java|title=NullScanTaskDispatcher.java}
  private void processAlias(MapWork work, Path path, ArrayList 
aliasesAffected, ArrayList aliases) {
// the aliases that are allowed to map to a null scan.
ArrayList allowed = new ArrayList();
for (String alias : aliasesAffected) {
  if (aliases.contains(alias)) {
allowed.add(alias);
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20844) Cache Instances of CacheManager in DummyTxnManager

2018-10-31 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20844:
--

 Summary: Cache Instances of CacheManager in DummyTxnManager
 Key: HIVE-20844
 URL: https://issues.apache.org/jira/browse/HIVE-20844
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Locking
Affects Versions: 3.1.0, 2.3.2, 4.0.0
Reporter: BELUGA BEHR


I noticed that the {{DummyTxnManager}} class instantiates quite a few instances 
of {{ZooKeeperHiveLockManager}}. The ZooKeeper LM creates a connection to ZK 
for each instance created.  It also does some initialization steps that are 
almost always just noise and pressure on ZooKeeper because it has already been 
initialized and the steps are therefore NOOPs.  {{ZooKeeperHiveLockManager}} 
should be a singleton class with one long-lived connection to the ZooKeeper 
service. Perhaps the {{HiveLockManager}} interface could have a 
{{isSingleton()}} method which indicates that the LM should only be 
instantiated once and cached for subsequent sessions.

 
{code:java}
2018-05-14 22:45:30,574  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1252389]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 22:51:27,865  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1252671]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 22:51:37,552  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1252686]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 22:51:49,046  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1252736]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 22:51:50,664  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1252742]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 23:00:54,314  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1253479]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 23:17:26,867  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1254180]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
2018-05-14 23:24:25,426  INFO  
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: 
[HiveServer2-Background-Pool: Thread-1255493]: Creating lock manager of type 
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
{code}
{code:java|title=DummyTxnManager.java}
@Override
  public HiveLockManager getLockManager() throws LockException {
if (lockMgr == null) {
  boolean supportConcurrency =
  conf.getBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY);
  if (supportConcurrency) {
String lockMgrName =
conf.getVar(HiveConf.ConfVars.HIVE_LOCK_MANAGER);
if ((lockMgrName == null) || (lockMgrName.isEmpty())) {
  throw new LockException(ErrorMsg.LOCKMGR_NOT_SPECIFIED.getMsg());
}

try {
 // CACHE LM HERE
  LOG.info("Creating lock manager of type " + lockMgrName);
  lockMgr = (HiveLockManager)ReflectionUtils.newInstance(
  conf.getClassByName(lockMgrName), conf);
  lockManagerCtx = new HiveLockManagerCtx(conf);
  lockMgr.setContext(lockManagerCtx);
} catch (Exception e) {
...
{code}
[https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockManager.java]

 {code:java|title=ZooKeeperHiveLockManager Initialization}
try {
  curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
  parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
  try{
curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + 
 parent, new byte[0]);
  } catch (Exception e) {
// ignore if the parent already exists
if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
KeeperException.Code.NODEEXISTS) {
  LOG.warn("Unexpected ZK exception when creating parent node /" + 
parent, e);
}
  }
{code}
 
https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java#L96-L106



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20832) Locate Operation Logs Under Session Directory

2018-10-29 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20832:
--

 Summary: Locate Operation Logs Under Session Directory
 Key: HIVE-20832
 URL: https://issues.apache.org/jira/browse/HIVE-20832
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR


If I understand the Hive entity relationship model correctly, each Hive Session 
will have 0 or more Operations associated with it.

Here is the current session setup sequence:

{code}
2018-10-24 21:06:03,771  INFO  org.apache.hadoop.hive.ql.session.SessionState: 
[HiveServer2-Handler-Pool: Thread-510932]: Created local directory: 
/tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193
2018-10-24 21:06:03,779  INFO  org.apache.hadoop.hive.ql.session.SessionState: 
[HiveServer2-Handler-Pool: Thread-510932]: Created HDFS directory: 
/tmp/hive/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/_tmp_space.db
2018-10-24 21:06:03,782  INFO  
org.apache.hive.service.cli.session.HiveSessionImpl: [HiveServer2-Handler-Pool: 
Thread-510932]: Operation log session directory is created: 
/var/log/hive/operation_logs/7650c8ff-2ba5-4bbb-964f-fcecc45e5193
{code}

The Hive Session gets its own directory on the local FS and operation logs get 
their own space as well.  Can we please merge so that all of the Operation 
directories are stored within their associated Hive session directory?  
Something like...

{code}
2018-10-24 21:06:03,771  INFO  org.apache.hadoop.hive.ql.session.SessionState: 
[HiveServer2-Handler-Pool: Thread-510932]: Created local directory: 
/tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193
2018-10-24 21:06:03,779  INFO  org.apache.hadoop.hive.ql.session.SessionState: 
[HiveServer2-Handler-Pool: Thread-510932]: Created HDFS directory: 
/tmp/hive/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/_tmp_space.db
2018-10-24 21:06:03,782  INFO  
org.apache.hive.service.cli.session.HiveSessionImpl: [HiveServer2-Handler-Pool: 
Thread-510932]: Operation log session directory is created:  
/tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/operation_logs/7650c8ff-2ba5-4bbb-964f-fcecc45e5193
{code}

Allows removal of configuration 
{{hive.server2.logging.operation.log.location}}.  One less thing an operator 
needs to worry/know about.  One set of logs are in {{/tmp}} and the other is 
{{/var}}.  A bit confusing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20831) Add Session ID to Operation Logging

2018-10-29 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20831:
--

 Summary: Add Session ID to Operation Logging
 Key: HIVE-20831
 URL: https://issues.apache.org/jira/browse/HIVE-20831
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR


{code:java|title=OperationManager.java}
LOG.info("Adding operation: " + operation.getHandle());
{code}

Please add additional logging to explicitly state which Hive session this 
operation is being added to.

https://github.com/apache/hive/blob/3963c729fabf90009cb67d277d40fe5913936358/service/src/java/org/apache/hive/service/cli/operation/OperationManager.java#L201



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20797) Print Number of Locks Acquired

2018-10-24 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20797:
--

 Summary: Print Number of Locks Acquired
 Key: HIVE-20797
 URL: https://issues.apache.org/jira/browse/HIVE-20797
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Locking
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


The number of locks acquired by a query can greatly influence the performance 
and stability of the system, especially for ZK locks.  Please add INFO level 
logging with the number of locks each query obtains.

Log here:
https://github.com/apache/hive/blob/3963c729fabf90009cb67d277d40fe5913936358/ql/src/java/org/apache/hadoop/hive/ql/Driver.java#L1670-L1672

{quote}
A list of acquired locks will be stored in the 
org.apache.hadoop.hive.ql.Context object and can be retrieved via 
org.apache.hadoop.hive.ql.Context#getHiveLocks.
{quote}

https://github.com/apache/hive/blob/758ff449099065a84c46d63f9418201c8a6731b1/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java#L115-L127



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20665) Hive Parallel Tasks - Hive Configuration ConcurrentModificationException

2018-10-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20665:
--

 Summary: Hive Parallel Tasks - Hive Configuration 
ConcurrentModificationException
 Key: HIVE-20665
 URL: https://issues.apache.org/jira/browse/HIVE-20665
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 3.1.0, 2.3.2, 4.0.0
Reporter: BELUGA BEHR


When parallel tasks are enabled in Hive, all of the resulting queries share the 
same Hive configuration.  This is problematic as each query will modify the 
same {{HiveConf}} object with things like query ID and query text.  This will 
overwrite each other and cause {{ConcurrentModificationException}} issues.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20619) Include MultiDelimitSerDe in HIveServer2 By Default

2018-09-21 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20619:
--

 Summary: Include MultiDelimitSerDe in HIveServer2 By Default
 Key: HIVE-20619
 URL: https://issues.apache.org/jira/browse/HIVE-20619
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Serializers/Deserializers
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


In [HIVE-20020], the hive-contrib JAR file was removed from the HiveServer2 
classpath.  With this change, the {{MultiDelimitSerDe}} is no longer included.  
This is fine, because {{MultiDelimitSerDe}} was a pain in that environment 
anyway.  It was available to HiveServer2, and therefore would work with a 
limited set of queries (select * from table limit 1) but any other query on 
that table which launched a MapReduce project would fail because the 
hive-contrib JAR file was not sent out with the rest of the Hive JARs for 
MapReduce jobs.

Please bring {{MultiDelimitSerDe}} back into the fold so that it's available to 
users out of the box without having to install the hive-contrib JAR into the 
HiveServer2 auxiliary directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20484) Disable Block Cache By Default With HBase SerDe

2018-08-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20484:
--

 Summary: Disable Block Cache By Default With HBase SerDe
 Key: HIVE-20484
 URL: https://issues.apache.org/jira/browse/HIVE-20484
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 1.2.3, 2.4.0, 4.0.0, 3.2.0
Reporter: BELUGA BEHR


{quote}
Scan instances can be set to use the block cache in the RegionServer via the 
setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. 

https://hbase.apache.org/book.html#perf.hbase.client.blockcache
{quote}

However, from the Hive code, we can see that this is not the case.

{code}
public static final String HBASE_SCAN_CACHEBLOCKS = "hbase.scan.cacheblock";

...

String scanCacheBlocks = 
tableProperties.getProperty(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  jobProperties.put(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS, scanCacheBlocks);
}

...

String scanCacheBlocks = jobConf.get(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  scan.setCacheBlocks(Boolean.parseBoolean(scanCacheBlocks));
}
{code}

In the Hive code, we can see that if {{hbase.scan.cacheblock}} is not specified 
in the {{SERDEPROPERTIES}} then {{setCacheBlocks}} is not called and the 
default value of the HBase {{Scan}} class is used.

{code:java|title=Scan.java}
  /**
   * Set whether blocks should be cached for this Scan.
   * 
   * This is true by default.  When true, default settings of the table and
   * family are used (this will never override caching blocks if the block
   * cache is disabled for that family or entirely).
   *
   * @param cacheBlocks if false, default settings are overridden and blocks
   * will not be cached
   */
  public Scan setCacheBlocks(boolean cacheBlocks) {
this.cacheBlocks = cacheBlocks;
return this;
  }
{code}

Hive is doing full scans of the table with MapReduce/Spark and therefore, 
according to the HBase docs, the default behavior here should be that blocks 
are not cached.  Hive should set this value to "false" by default unless the 
table {{SERDEPROPERTIES}} override this.

{code:sql}
-- Commands for HBase
-- create 'test', 't'

CREATE EXTERNAL TABLE test(value map, row_key string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "t:,:key",
"hbase.scan.cacheblock" = "false"
);
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20317) Spark Dynamic Partition Pruning - Use Stats to Determine Partition Count

2018-08-06 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20317:
--

 Summary: Spark Dynamic Partition Pruning - Use Stats to Determine 
Partition Count
 Key: HIVE-20317
 URL: https://issues.apache.org/jira/browse/HIVE-20317
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 3.1.0, 4.0.0
Reporter: BELUGA BEHR


{code:xml|hive-site.xml}


hive.metastore.limit.partition.request
2

{code}

{code:sql}
CREATE TABLE partitioned_user(
firstname VARCHAR(64),
lastname  VARCHAR(64)
) PARTITIONED BY (country VARCHAR(64))
STORED AS PARQUET;

CREATE TABLE country(
name VARCHAR(64)
) STORED AS PARQUET;

insert into partitioned_user partition (country='USA') values ("John", "Doe");
insert into partitioned_user partition (country='UK') values ("Sir", "Arthur");
insert into partitioned_user partition (country='FR') values ("Jacque", 
"Martin");

insert into country values ('USA');

set hive.execution.engine=spark;
set hive.spark.dynamic.partition.pruning=true;
explain select * from partitioned_user u where u.country in (select c.name from 
country c);
-- Error while compiling statement: FAILED: SemanticException 
MetaException(message:Number of partitions scanned (=3) on table 
'partitioned_user' exceeds limit (=2). This is controlled on the metastore 
server by hive.metastore.limit.partition.request.)
{code}

The EXPLAIN plan generation fails because there are three partitions involved 
in this query.  However, since Spark DPP is enabled, Hive should be able to use 
table stats to know that the {{country}} table only has one record and 
therefore there will only need to be one partitioned scanned and allow this 
query to execute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20258) Should Syncronize getInstance in ReplChangeManager

2018-07-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20258:
--

 Summary: Should Syncronize getInstance in ReplChangeManager
 Key: HIVE-20258
 URL: https://issues.apache.org/jira/browse/HIVE-20258
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:java}
  public static ReplChangeManager getInstance(Configuration conf) throws 
MetaException {
if (instance == null) {
  instance = new ReplChangeManager(conf);
}
return instance;
  }
{code}

This method needs to be synchronized or two different callers will see a 'null' 
value and each create their own manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20257) Improvements to Hive.java

2018-07-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20257:
--

 Summary: Improvements to Hive.java
 Key: HIVE-20257
 URL: https://issues.apache.org/jira/browse/HIVE-20257
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Various fixes to {{Hive.java}}
 * Use Log4J parameters in logging statements
 * Fix check styles
 * Make code more concise
 * Remove "log and throw" code
 * Replaced calls to deprecated code
 * Removed superfluous calls to {{toString}}

 

"log and throw" is considered and anti-pattern.  Only the highest level catch 
should be providing detailed logging otherwise we print the same stack trace to 
the logs several times and with different context (for example when an 
exception is wrapped, we get two different logging events).

 

https://community.oracle.com/docs/DOC-983543#logAndThrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20255) Review LevelOrderWalker.java

2018-07-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20255:
--

 Summary: Review LevelOrderWalker.java
 Key: HIVE-20255
 URL: https://issues.apache.org/jira/browse/HIVE-20255
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Fix For: 3.1.0
 Attachments: HIVE-20255.1.patch

https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/lib/LevelOrderWalker.java

* Make code more concise
* Fix some check style issues

{code}
  if (toWalk.get(index).getChildren() != null) {
for(Node child : toWalk.get(index).getChildren()) {
{code}

Actually, the underlying implementation of {{getChildren()}} has to do some 
real work, so do not throw away the work after checking for null.  Simply call 
once and store the results.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20239) Do Not Print StackTraces to STDERR in MapJoinProcessor

2018-07-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20239:
--

 Summary: Do Not Print StackTraces to STDERR in MapJoinProcessor
 Key: HIVE-20239
 URL: https://issues.apache.org/jira/browse/HIVE-20239
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


{code:java|title=MapJoinProcessor.java}
} catch (Exception e) {
  e.printStackTrace();
  throw new SemanticException("Failed to generate new mapJoin operator " +
  "by exception : " + e.getMessage());
}
{code}

Please change to... something like...

{code}
} catch (Exception e) {
  throw new SemanticException("Failed to generate new mapJoin operator", e);
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20238) Remove stringifyException Method

2018-07-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20238:
--

 Summary: Remove stringifyException Method
 Key: HIVE-20238
 URL: https://issues.apache.org/jira/browse/HIVE-20238
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


Remove the method {{stringifyException}}

https://github.com/apache/hive/blob/c2940a07cf0891e922672782b73ec22551a7eedd/common/src/java/org/apache/hive/common/util/HiveStringUtils.java#L146

The code already exists in Hadoop proper:

https://github.com/apache/hadoop/blob/2b2399d623539ab68e71a38fa9fbfc9a405bddb8/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java#L86

And beyond that, I was told on the Hadoop dev mailing list that this function 
should not be used anymore.  Developers should just be using the SLF4J 
facilities and not this home-grown thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20237) Do Not Print StackTraces to STDERR in HiveMetaStore

2018-07-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20237:
--

 Summary: Do Not Print StackTraces to STDERR in HiveMetaStore
 Key: HIVE-20237
 URL: https://issues.apache.org/jira/browse/HIVE-20237
 Project: Hive
  Issue Type: Improvement
  Components: Standalone Metastore
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:java|title=HiveMetaStore.java}
} catch (Throwable x) {
  x.printStackTrace();
  HMSHandler.LOG.error(StringUtils.stringifyException(x));
  throw x;
}
{code}

Bad design here of "log and throw".  Don't do it.  Just throw the exception and 
let it be handled, and logged, in one place.  At the very least, we don't need 
the error message to go into the STDERR logs with {{printStackTrace}}, please 
remove.  And remove the {{stringifyException}} code.  Just use the normal 
logging faciltiies:

{code}
HMSHandler.LOG.error("Error", e);
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20236) Do Not Print StackTraces to STDERR in DDLTask

2018-07-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20236:
--

 Summary: Do Not Print StackTraces to STDERR in DDLTask
 Key: HIVE-20236
 URL: https://issues.apache.org/jira/browse/HIVE-20236
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:java|title=DDLTask.java}
try {
  ret = ToolRunner.run(fss, args.toArray(new String[0]));
} catch (Exception e) {
  e.printStackTrace();
  throw new HiveException(e);
}
{code}

Don't print the stacktrace to STDERR, deal with handling the error up the call 
stack by using the HiveException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20233) Review Operator.java

2018-07-24 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20233:
--

 Summary: Review Operator.java
 Key: HIVE-20233
 URL: https://issues.apache.org/jira/browse/HIVE-20233
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Various improvements to {{Operator.java}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20224) ReplChangeManager.java Remove Logging Guards

2018-07-23 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20224:
--

 Summary: ReplChangeManager.java Remove Logging Guards
 Key: HIVE-20224
 URL: https://issues.apache.org/jira/browse/HIVE-20224
 Project: Hive
  Issue Type: Improvement
  Components: Metastore, Standalone Metastore
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:java|title=metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ReplChangeManager.java}
if (LOG.isDebugEnabled()) {
  LOG.debug("A file with the same content of {} already exists, ignore", 
path.toString());
}
// >
LOG.debug("A file with the same content of {} already exists, ignore", path);


if (LOG.isDebugEnabled()) {
  LOG.debug("Encoded URI: " + encodedUri);
}
// >
LOG.debug("Encoded URI: {}", encodedUri);


if (LOG.isDebugEnabled()) {
  LOG.debug("Move " + file.toString() + " to trash");
}
// >
 LOG.debug("Move {} to trash", file);

... others
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20223) SmallTableCache.java SLF4J Parameterized Logging

2018-07-23 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20223:
--

 Summary: SmallTableCache.java SLF4J Parameterized Logging
 Key: HIVE-20223
 URL: https://issues.apache.org/jira/browse/HIVE-20223
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:java|title=org/apache/hadoop/hive/ql/exec/spark/SmallTableCache.java}

if (LOG.isDebugEnabled()) {
LOG.debug("Cleaned up small table cache for query " + queryId);
}

if (tableContainerMap.putIfAbsent(path, tableContainer) == null && 
LOG.isDebugEnabled()) {
  LOG.debug("Cached small table file " + path + " for query " + queryId);
}

if (tableContainer != null && LOG.isDebugEnabled()) {
  LOG.debug("Loaded small table file " + path + " from cache for query " + 
queryId);
}
{code}
 

Remove {{isDebugEnabled}} and replace with parameterized logging.

https://www.slf4j.org/faq.html#logging_performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20222) Enable Skew Join Optimization For Outer Joins

2018-07-23 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20222:
--

 Summary: Enable Skew Join Optimization For Outer Joins
 Key: HIVE-20222
 URL: https://issues.apache.org/jira/browse/HIVE-20222
 Project: Hive
  Issue Type: New Feature
  Components: Logical Optimizer
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code}
// We are trying to adding map joins to handle skew keys, and map join right
// now does not work with outer joins
if (!GenMRSkewJoinProcessor.skewJoinEnabled(parseCtx.getConf(), joinOp))
return;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20190) Report Client IP Address When Opening New Session

2018-07-16 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20190:
--

 Summary: Report Client IP Address When Opening New Session
 Key: HIVE-20190
 URL: https://issues.apache.org/jira/browse/HIVE-20190
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/e7d1781ec4662e088dcd6ffbe3f866738792ad9b/service/src/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L320

There are times when a misbehaving client can knock a HS2 instance offline 
because it opens many simultaneous connections and takes up all of the 
resources.  It would be nice if we could log the source IP address of each 
connection along with the "Client protocol version" information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20171) Make hive.stats.autogather Per Table

2018-07-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20171:
--

 Summary: Make hive.stats.autogather Per Table
 Key: HIVE-20171
 URL: https://issues.apache.org/jira/browse/HIVE-20171
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Standalone Metastore
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{{hive.stats.autogather}}
{{hive.stats.column.autogather}}

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

These are currently global-level settings.  Make these global setting the 
'default' values for tables but allow for these configurations to be override 
by the table's properties.

Recently started seeing tables backed by S3 that are not regularly queried but 
that the CREATE TABLE is very slow to collect the stats (30+ minutes) for all 
of the files in the table.  We would like to turn this feature off for certain 
S3 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20170) Improve JoinOperator "rows for join key" Logging

2018-07-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20170:
--

 Summary: Improve JoinOperator "rows for join key" Logging
 Key: HIVE-20170
 URL: https://issues.apache.org/jira/browse/HIVE-20170
 Project: Hive
  Issue Type: Improvement
  Components: Operators
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code}
2018-06-25 09:37:33,193 INFO [main] 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5728000 rows for 
join key [333, 22]
2018-06-25 09:37:33,901 INFO [main] 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5828000 rows for 
join key [333, 22]
2018-06-25 09:37:34,623 INFO [main] 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5928000 rows for 
join key [333, 22]
2018-06-25 09:37:35,342 INFO [main] 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 6028000 rows for 
join key [333, 22]
{code}

https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java#L120

This logging should use the same facilities as the other Operators for emitting 
this type of log message. [HIVE-10078]  Maybe this feature should be refactored 
into an AbstractOperator class?

Also, it should print a final count for each join value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20169) Print Final Rows Processed in MapOperator

2018-07-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20169:
--

 Summary: Print Final Rows Processed in MapOperator
 Key: HIVE-20169
 URL: https://issues.apache.org/jira/browse/HIVE-20169
 Project: Hive
  Issue Type: Improvement
  Components: Operators
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L573-L582

This class emits a log message every time it a certain number of records are 
processed, but it does not print a final count.

Overload the {{MapOperator}} class's {{closeOp}} method to print a final log 
message providing the total number of rows read by this mapper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20168) ReduceSinkOperator Logging Hidden

2018-07-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20168:
--

 Summary: ReduceSinkOperator Logging Hidden
 Key: HIVE-20168
 URL: https://issues.apache.org/jira/browse/HIVE-20168
 Project: Hive
  Issue Type: Bug
  Components: Operators
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


[https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java]

 
{code:java}
if (LOG.isTraceEnabled()) {
  if (numRows == cntr) {
cntr = logEveryNRows == 0 ? cntr * 10 : numRows + logEveryNRows;
if (cntr < 0 || numRows < 0) {
  cntr = 0;
  numRows = 1;
}
LOG.info(toString() + ": records written - " + numRows);
  }
}

...

if (LOG.isTraceEnabled()) {
  LOG.info(toString() + ": records written - " + numRows);
}
{code}

There are logging guards here checking for TRACE level debugging but the 
logging is actually INFO.  This is important logging for detecting data skew.  
Please change guards to check for INFO... or I would prefer that the guards are 
removed altogether since it's very rare that a service is running with only 
WARN level logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20166) LazyBinaryStruct Warn Level Logging

2018-07-13 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20166:
--

 Summary: LazyBinaryStruct Warn Level Logging
 Key: HIVE-20166
 URL: https://issues.apache.org/jira/browse/HIVE-20166
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java#L177-L180

{code}
// Extra bytes at the end?
if (!extraFieldWarned && lastFieldByteEnd < structByteEnd) {
  extraFieldWarned = true;
  LOG.warn("Extra bytes detected at the end of the row! " +
   "Last field end " + lastFieldByteEnd + " and serialize buffer end " 
+ structByteEnd + ". " +
   "Ignoring similar problems.");
}

// Missing fields?
if (!missingFieldWarned && lastFieldByteEnd > structByteEnd) {
  missingFieldWarned = true;
  LOG.info("Missing fields! Expected " + fields.length + " fields but " +
  "only got " + fieldId + "! " +
  "Last field end " + lastFieldByteEnd + " and serialize buffer end " + 
structByteEnd + ". " +
  "Ignoring similar problems.");
}
{code}

The first log statement is a 'warn' level logging, the second is an 'info' 
level logging.  Please change the second log to also be a 'warn'.  This seems 
like it could be a problem that the user would like to know about.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20163) Simplify StringSubstrColStart Initialization

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20163:
--

 Summary: Simplify StringSubstrColStart Initialization
 Key: HIVE-20163
 URL: https://issues.apache.org/jira/browse/HIVE-20163
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Attachments: HIVE-20163.1.patch

* Remove code
* Remove exception handling
* Remove {{printStackTrace}} call



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20162) Do Not Print StackTraces to STDERR in AbstractJoinTaskDispatcher

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20162:
--

 Summary: Do Not Print StackTraces to STDERR in 
AbstractJoinTaskDispatcher
 Key: HIVE-20162
 URL: https://issues.apache.org/jira/browse/HIVE-20162
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/AbstractJoinTaskDispatcher.java

{code}
} catch (Exception e) {
  e.printStackTrace();
  throw new SemanticException("Generate Map Join Task Error: " + 
e.getMessage());
}
{code}

Remove the call to {{printStackTrace}} and just throw the error.  If the stack 
trace really is needed (doubtful), then pass it to the {{SemanticException}} 
constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20161) Do Not Print StackTraces to STDERR in ParseDriver

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20161:
--

 Summary: Do Not Print StackTraces to STDERR in ParseDriver
 Key: HIVE-20161
 URL: https://issues.apache.org/jira/browse/HIVE-20161
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java

{code}
// Do not print stack trace to STDERR - remove this, just throw the 
HiveException
} catch (Exception e) {
  e.printStackTrace();
  throw new HiveException(e);
}
...
// Do not log and throw.  log *or* throw.  In this case, just throw. Remove 
logging.
// Remove explicit 'return' call. No need for it.
  try {
skewJoinKeyContext.endGroup();
  } catch (IOException e) {
LOG.error(e.getMessage(), e);
throw new HiveException(e);
  }
  return;
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20160) Do Not Print StackTraces to STDERR in OperatorFactory

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20160:
--

 Summary: Do Not Print StackTraces to STDERR in OperatorFactory
 Key: HIVE-20160
 URL: https://issues.apache.org/jira/browse/HIVE-20160
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java#L158

{code}
} catch (Exception e) {
  e.printStackTrace();
  throw new HiveException(...
{code}

Do not print the stack trace.  The error is being wrapped in a HiveException.  
Allow the code catching this exception to print the error to a logger instead 
of dumping it here to STDERR.  There are several instances of this in the class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20159) Do Not Print StackTraces to STDERR in ConditionalResolverSkewJoin

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20159:
--

 Summary: Do Not Print StackTraces to STDERR in 
ConditionalResolverSkewJoin
 Key: HIVE-20159
 URL: https://issues.apache.org/jira/browse/HIVE-20159
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverSkewJoin.java#L121

{code}
} catch (IOException e) {
  e.printStackTrace();
}
{code}

Introduce an SLF4J logger to this class and print a WARN level log message if 
the {{IOException}} from {{Utilities.listStatusIfExists}} is generated.  I 
suggest WARN because the entire operation doesn't fail if this error happens.  
It continues on its way with the data that it was able to collect.  I'm not 
sure if this is the intended behavior, but for now, an error message in the 
logging would be better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20158) Do Not Print StackTraces to STDERR in Base64TextOutputFormat

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20158:
--

 Summary: Do Not Print StackTraces to STDERR in 
Base64TextOutputFormat
 Key: HIVE-20158
 URL: https://issues.apache.org/jira/browse/HIVE-20158
 Project: Hive
  Issue Type: Improvement
  Components: Contrib
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64/Base64TextOutputFormat.java

{code}
  try {
String signatureString = job.get("base64.text.output.format.signature");
if (signatureString != null) {
  signature = signatureString.getBytes("UTF-8");
} else {
  signature = new byte[0];
}
  } catch (UnsupportedEncodingException e) {
e.printStackTrace();
  }
{code}

The {{UnsupportedEncodingException}} is coming from the {{getBytes}} method 
call.  Instead, use the {{CharSet}} version of the method and it doesn't throw 
this explicit exception so the 'try' block can simply be removed.  Every JVM 
will support UTF-8.

https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset)
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html#UTF_8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20157) Do Not Print StackTraces to STDERR

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20157:
--

 Summary: Do Not Print StackTraces to STDERR
 Key: HIVE-20157
 URL: https://issues.apache.org/jira/browse/HIVE-20157
 Project: Hive
  Issue Type: Improvement
  Components: Parser
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{{org/apache/hadoop/hive/ql/parse/ParseDriver.java}}

{code}
catch (RecognitionException e) {
  e.printStackTrace();
  throw new ParseException(parser.errors);
}
{code}

Do not use {{e.printStackTrace()}} and print to STDERR.  Either remove or 
replace with a debug-level log statement.  I would vote to simply remove.  
There are several occurrences of this pattern in this class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20156) Printing Stacktrace to STDERR

2018-07-12 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20156:
--

 Summary: Printing Stacktrace to STDERR
 Key: HIVE-20156
 URL: https://issues.apache.org/jira/browse/HIVE-20156
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Class {{org.apache.hadoop.hive.ql.exec.JoinOperator}} has the following code:

{code}
} catch (Exception e) {
  e.printStackTrace();
  throw new HiveException(e);
}
{code}

Do not print the stack trace to STDERR with a call to {{printStackTrace()}}.  
Please remove that line and let the code catching the {{HiveException}} worry 
about printing any messages through a logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20136) Code Review of ArchiveUtils Class

2018-07-10 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20136:
--

 Summary: Code Review of ArchiveUtils Class
 Key: HIVE-20136
 URL: https://issues.apache.org/jira/browse/HIVE-20136
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


General code review of {{ArchiveUtil}}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20023) Give Indication That Query is Blocked on a Lock in HS2 UI

2018-06-28 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20023:
--

 Summary: Give Indication That Query is Blocked on a Lock in HS2 UI
 Key: HIVE-20023
 URL: https://issues.apache.org/jira/browse/HIVE-20023
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Web UI
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Please provide a clear indication that a Hive query is blocked by a 
table/partition lock in the HS2 WebUI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20015) Populate ArrayList with Constructor

2018-06-27 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-20015:
--

 Summary: Populate ArrayList with Constructor
 Key: HIVE-20015
 URL: https://issues.apache.org/jira/browse/HIVE-20015
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:title=MapWork.java}
  public ArrayList> getWorks() {
return new ArrayList>(aliasToWork.values());
  }

  public ArrayList getPaths() {
ArrayList ret=new ArrayList<>();
ret.addAll(pathToAliases.keySet());
return ret;
  }
{code}

{{getWorks}} method correctly uses the constructor to populate the 
{{ArrayList}}.  Please update {{getPaths}} method to do the same, instead of 
creating an empty array, then creating a new array to accommodate the 
{{addAll}} request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19977) Improve Output of SHOW PARTITIONS

2018-06-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19977:
--

 Summary: Improve Output of SHOW PARTITIONS
 Key: HIVE-19977
 URL: https://issues.apache.org/jira/browse/HIVE-19977
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


{code:sql}
> create table partition_test (a string) partitioned by (b string, c string);
> insert into table partition_test partition (b='z', c='z') VALUES ('top');
> show partitions partition_test;
b=z/c=z
{code}

I think it would be more informative in a table format:


||b||z||
|z|z|

This clearly provides the information and prevents users from doing something 
like...

{code:sql}
> alter table partition_test drop partition ("b=z/c=z");
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19960) Add Column Mapping to JsonSerDe

2018-06-21 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19960:
--

 Summary: Add Column Mapping to JsonSerDe
 Key: HIVE-19960
 URL: https://issues.apache.org/jira/browse/HIVE-19960
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Enhance the JSON SerDe to accept a SerDe property that allows for arbitrary 
mapping of JSON parameter names to column names.  This would be very similar to 
Hive HBase integration.

{code}
{"fname":"John","lname":"Doe"}

CREATE TABLE (
first_name string,
last_name string,
...
WITH SERDEPROPERTIES (
"json.columns.mapping" = "fname:first_name,lname:last_name"
);
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19933) ALTER TABLE DROP PARTITION - Partition Not Found

2018-06-18 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19933:
--

 Summary: ALTER TABLE DROP PARTITION - Partition Not Found
 Key: HIVE-19933
 URL: https://issues.apache.org/jira/browse/HIVE-19933
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 1.2.2
Reporter: BELUGA BEHR


{code:sql}
ALTER TABLE web_logsz DROP PARTITION (`date`='xyz')
-- SemanticException [Error 10001]: Table not found web_logsz

ALTER TABLE web_logs DROP PARTITION (`date`='xyz')
-- Success.
{code}

There is no 'xyz' partition for the 'date' column.  To make this more 
consistent, the query should fail if the user tries to drop a partition that 
does not exist



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19901) Consolidate JsonSerde Classes

2018-06-14 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19901:
--

 Summary: Consolidate JsonSerde Classes
 Key: HIVE-19901
 URL: https://issues.apache.org/jira/browse/HIVE-19901
 Project: Hive
  Issue Type: Improvement
Affects Versions: 2.3.2, 3.0.0
Reporter: BELUGA BEHR
 Fix For: 4.0.0


There's a {{serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java}} and a 
{{hcatalog/core/src/main/java/org/apache/hive/hcatalog/data/JsonSerDe.java}} in 
the Hive project.  Please consolidate down into a single JsonSerDe, preferably 
in the 'serde2' package.

Please note however there there is only a {{TestJsonSerDe.java}} for the 
_hcatalog_ package, not the _sede2_ package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19900) HiveCLI HoS Performs Invalid Impersonation If User Name Truncated

2018-06-14 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19900:
--

 Summary: HiveCLI HoS Performs Invalid Impersonation If User Name 
Truncated
 Key: HIVE-19900
 URL: https://issues.apache.org/jira/browse/HIVE-19900
 Project: Hive
  Issue Type: Improvement
  Components: CLI, Spark
Affects Versions: 2.3.2, 3.0.0, 1.2.2, 4.0.0
Reporter: BELUGA BEHR


The HiveCLI HoS code relies on the system property {{user.name}} when 
performing impersonations. The code decides to do an impersonation if the 
{{user.name}} system property does not match the current user who is launching 
the HiveCLI client.  However, when confronted with a long user name, some 
shells and linux distros may opt to truncate the user name to a certain size to 
conserve screen space. In these scenarios, the current user name does not match 
the {{user.name}} system property and never will, so impersonation will always 
happen, even though the user is trying to impersonate themselves. If YARN is 
not setup to allow the current user to impersonate, YARN will reject the 
request.
{code:java}
if (hiveConf.getBoolVar(HiveConf.ConfVars.HIVE_SERVER2_ENABLE_DOAS)) {
  try {
String currentUser = Utils.getUGI().getShortUserName();
// do not do impersonation in CLI mode
if (!currentUser.equals(System.getProperty("user.name"))) {
  LOG.info("Attempting impersonation of " + currentUser);
  addProxyUser(currentUser);
}
  } catch (Exception e) {
String msg = "Cannot obtain username: " + e;
throw new IllegalStateException(msg, e);
  }
}
{code}
 

[https://github.com/apache/hive/blob/da66386662fbbcbde9501b4a7b27d076bcc790d4/spark-client/src/main/java/org/apache/hive/spark/client/AbstractSparkClient.java#L354-L366]

Assuming a kerberos enabled environment, the error message in the YARN Resource 
Manager will be:
{code:java}
my-really-really-long-user-n...@hadoop.domain.com is not allowed to impersonate 
my-really-really-long-user-name
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19847) Create Separate getInputSummary Service

2018-06-10 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19847:
--

 Summary: Create Separate getInputSummary Service
 Key: HIVE-19847
 URL: https://issues.apache.org/jira/browse/HIVE-19847
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Attachments: HIVE-19847.1.patch

The Hive {{org.apache.hadoop.hive.ql.exec.Utilities.java}} file has taken on a 
life of its own.  We should consider separating out the various components into 
their own classes.  For this ticket, I propose separating out the 
{{getInputSummary}} functionality into its own class.

There are several issues with the current implementation:

# It is 
[synchronized|https://github.com/apache/hive/blob/f27c38ff55902827499192a4f8cf8ed37d6fd967/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2383].
  Only one query can get file input summary at a time.  For a query which deals 
with a large data set with a large number of files, this can block other 
queries for a long period of time.  This is especially painful when most 
queries use a small data set, but a large data set is submitted on occasion.
# For each query, time is spend setting up and tearing down a ThreadPool
# It uses deprecated code

I propose breaking it out into its own class and creating a single thread pool 
that all queries pull from.  In this way, the bottle neck will be one the 
number of available threads, not on a single query and if a big query is 
running and a small query is also submitted, the smaller query will be able to 
proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19846) Removed Deprecated Calls From FileUtils-getJarFilesByPath

2018-06-10 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19846:
--

 Summary: Removed Deprecated Calls From FileUtils-getJarFilesByPath
 Key: HIVE-19846
 URL: https://issues.apache.org/jira/browse/HIVE-19846
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Attachments: HIVE-19846.1.patch





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19845) Create Table Stored By MIME Type

2018-06-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19845:
--

 Summary: Create Table Stored By MIME Type
 Key: HIVE-19845
 URL: https://issues.apache.org/jira/browse/HIVE-19845
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Serializers/Deserializers
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Instead of doing 'STORED BY XXX' perhaps we can standardized on 
[MIME|https://en.wikipedia.org/wiki/MIME] types.

 

 

 
||Mime||Type||
|"text/csv"|Comma Separated Value|
|"text/tab-separated-values"|Tab Separated Value|
|"{{application/json"}}|JSON|
|"text/xml"|XML|
|"{{application/avro"}}|Apache Avro|
|"{{application/parquet"}}|Apache Parquet|
|"{{application/orc"}}|Apache Hive ORC|

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19844) Make CSV SerDe First-Class SerDe

2018-06-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19844:
--

 Summary: Make CSV SerDe First-Class SerDe
 Key: HIVE-19844
 URL: https://issues.apache.org/jira/browse/HIVE-19844
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Serializers/Deserializers
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


According to the [Hive SerDe 
Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are 
some extras steps involved in getting the CSV SerDe working with Hive.

{code}
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar" = "'",
   "escapeChar"= "\\"
)  
STORED AS TEXTFILE;
{code}

I would like to propose that we move this SerDe into first-class status:

{{STORED AS TEXT_CSV}}
{{STORED AS TEXT_TSV}}

The user should have to perform no additional steps to use this SerDe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19843) Add Function Calls To Standardized Authorization Model

2018-06-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19843:
--

 Summary: Add Function Calls To Standardized Authorization Model
 Key: HIVE-19843
 URL: https://issues.apache.org/jira/browse/HIVE-19843
 Project: Hive
  Issue Type: Improvement
  Components: Authorization, HiveServer2
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


As far as I know, the CREATE FUNCTION statement is protected by standard Hive 
authorization model, however, actually using the function is open to anyone 
once it has been created by the Hive administrator.  Please also make each call 
to a function subject to an authorization check to protect the functions from 
un-authorized users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19842) Improve Hive UDF Usability

2018-06-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19842:
--

 Summary: Improve Hive UDF Usability
 Key: HIVE-19842
 URL: https://issues.apache.org/jira/browse/HIVE-19842
 Project: Hive
  Issue Type: Task
  Components: HiveServer2, UDF
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Deploying and Managing Hive UDFs is very cumbersome.  Please improve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19841) Upgrade commons-collections to commons-collections4

2018-06-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19841:
--

 Summary: Upgrade commons-collections to commons-collections4
 Key: HIVE-19841
 URL: https://issues.apache.org/jira/browse/HIVE-19841
 Project: Hive
  Issue Type: Task
Affects Versions: 3.0.0, 4.0.0
Reporter: BELUGA BEHR


Perhaps time to drink the Apache champagne (eat the Apache dog food) and 
upgrade the commons-collections library from 3.x to 4.x.

{code}
3.2.2
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19834) Clear Context Map of Paths to ContentSummary

2018-06-08 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19834:
--

 Summary: Clear Context Map of Paths to ContentSummary
 Key: HIVE-19834
 URL: https://issues.apache.org/jira/browse/HIVE-19834
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 2.3.2, 3.0.0, 4.0.0
Reporter: BELUGA BEHR
 Attachments: HIVE-19834.1.patch

The {{Context}} class has a {{clear}} method which is called.  During the 
method, various files are deleted and in-memory maps are cleared.  I would like 
to propose that we clear out an additional in-memory map structure that may 
contain a lot of data so that it can be GC'ed asap. This map contains mapping 
of "File Path"->"Content Summary".  For a query with a large file set, this can 
be quite large.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19809) Remove Deprecated Code From Utilities Class

2018-06-05 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19809:
--

 Summary: Remove Deprecated Code From Utilities Class
 Key: HIVE-19809
 URL: https://issues.apache.org/jira/browse/HIVE-19809
 Project: Hive
  Issue Type: Improvement
Reporter: BELUGA BEHR


{quote}
This can go away once hive moves to support only JDK 7  and can use 
Files.createTempDirectory
{quote}

Remove the {{createTempDir}} method from the {{Utilities}} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19807) Add Useful Error Message To Table Header/Footer Parsing

2018-06-05 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19807:
--

 Summary: Add Useful Error Message To Table Header/Footer Parsing
 Key: HIVE-19807
 URL: https://issues.apache.org/jira/browse/HIVE-19807
 Project: Hive
  Issue Type: Improvement
Affects Versions: 2.3.2, 3.0.0
Reporter: BELUGA BEHR


Add some useful logging messages to invalid value parsing of 
{{skip.header.line.count}} and {{skip.footer.line.count}} for better 
troubleshooting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19805) TableScanDesc Use Commons Library

2018-06-05 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19805:
--

 Summary: TableScanDesc Use Commons Library
 Key: HIVE-19805
 URL: https://issues.apache.org/jira/browse/HIVE-19805
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 4.0.0
Reporter: BELUGA BEHR


Use commons library and remove some code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19480) Implement and Incorporate MAPREDUCE-207

2018-05-09 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19480:
--

 Summary: Implement and Incorporate MAPREDUCE-207
 Key: HIVE-19480
 URL: https://issues.apache.org/jira/browse/HIVE-19480
 Project: Hive
  Issue Type: New Feature
  Components: HiveServer2
Affects Versions: 3.0.0
Reporter: BELUGA BEHR


* HiveServer2 has the ability to run many MapReduce jobs in parallel.
 * Each MapReduce application calculates the job's file splits at the client 
level
 * = HiveServer2 loading many file splits at the same time, putting pressure on 
memory

{quote}"The client running the job calculates the splits for the job by calling 
getSplits(), then sends them to the application master, which uses their 
storage locations to schedule map tasks that will process them on the cluster."
 - "Hadoop: The Definitive Guide"{quote}
MAPREDUCE-207 should address this memory pressure by moving split calculations 
into ApplicationMaster. Spark and Tez already take this approach.

Once MAPREDUCE-207 is completed, leverage the capability in HiveServer2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19444) Create View - Table not found _dummy_table

2018-05-07 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19444:
--

 Summary: Create View - Table not found _dummy_table
 Key: HIVE-19444
 URL: https://issues.apache.org/jira/browse/HIVE-19444
 Project: Hive
  Issue Type: Bug
  Components: Views
Affects Versions: 1.1.0
Reporter: BELUGA BEHR


{code:sql}
CREATE VIEW view_s1 AS select 1;

-- FAILED: SemanticException 
org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found 
_dummy_table
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19424) NPE In MetaDataFormatters

2018-05-04 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19424:
--

 Summary: NPE In MetaDataFormatters
 Key: HIVE-19424
 URL: https://issues.apache.org/jira/browse/HIVE-19424
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, Metastore, Standalone Metastore
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


h2. Overview

According to the Hive Schema definition, a table's {{INPUT_FORMAT}} class can 
be set to NULL.  However, there are places in the code where we do not account 
for this NULL value, in particular the {{MetaDataFormatters}} classes 
{{TextMetaDataFormatter}} and {{JsonMetaDataFormatter}}.  In addition, there is 
no debug level logging in the {{MetaDataFormatters}} classes to tell me which 
table in particular is causing the problem.

{code:sql|title=hive-schema-2.2.0.mysql.sql}
CREATE TABLE IF NOT EXISTS `SDS` (
  `SD_ID` bigint(20) NOT NULL,
  `CD_ID` bigint(20) DEFAULT NULL,
  `INPUT_FORMAT` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT 
NULL,
  `IS_COMPRESSED` bit(1) NOT NULL,
...
{code}

{code:java|title=TextMetaDataFormatter.java}
// Not checking for a null return from getInputFormatClass
inputFormattCls = par.getInputFormatClass().getName();
outputFormattCls = par.getOutputFormatClass().getName();
{code}

h2. Reproduction

{code:sql}
-- MySQL Backend
update SDS SET INPUT_FORMAT=NULL WHERE SD_ID=XXX;
{code}

{code}
// Hive
SHOW TABLE EXTENDED FROM default LIKE '*';

// HS2 Logs
[HiveServer2-Background-Pool: Thread-464]: Error running hive query: 
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. Exception while processing show table 
status
at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:238)
at 
org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:89)
at 
org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at 
org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:314)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Exception while 
processing show table status
at 
org.apache.hadoop.hive.ql.exec.DDLTask.showTableStatus(DDLTask.java:3025)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:405)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:99)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2052)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1748)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1501)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1285)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1280)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
... 11 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.metadata.formatting.TextMetaDataFormatter.showTableStatus(TextMetaDataFormatter.java:202)
at 
org.apache.hadoop.hive.ql.exec.DDLTask.showTableStatus(DDLTask.java:3020)
... 20 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19404) Revise DDL Task Result Logging

2018-05-03 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19404:
--

 Summary: Revise DDL Task Result Logging
 Key: HIVE-19404
 URL: https://issues.apache.org/jira/browse/HIVE-19404
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


There is some logging in {{DDLTask}} that can be made better:

{code}
2018-05-03 03:08:32,524 INFO  hive.ql.exec.DDLTask: 
[HiveServer2-Background-Pool: Thread-101980]: results : 706
{code}

This logging should either be demoted to _debug_ level logging and/or requires 
additional context.

{code}
2018-05-03 03:08:32,524 INFO  hive.ql.exec.DDLTask: 
[HiveServer2-Background-Pool: Thread-101980]: Found 706 tables that match the 
SHOW DATABASE statement
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19403) Demote 'Pattern' Logging

2018-05-03 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19403:
--

 Summary: Demote 'Pattern' Logging
 Key: HIVE-19403
 URL: https://issues.apache.org/jira/browse/HIVE-19403
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


In the {{DDLTask}} class, there is some logging that is not helpful to a 
cluster admin and should be demoted to _debug_ level logging.  In fact, in one 
place in the code, it already is.

{code}
LOG.info("pattern: {}", showDatabasesDesc.getPattern());
LOG.debug("pattern: {}", pattern);
LOG.info("pattern: {}", showFuncs.getPattern());
LOG.info("pattern: {}", showTblStatus.getPattern());
{code}

Here is an example... as an admin, I can already see what the pattern is, I do 
not need this extra logging.  It provides no additional context.

{code:java|title=Example}
2018-05-03 03:08:26,354 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-101980]: Executing 
command(queryId=hive_20180503030808_e53c26ef-2280-4eca-929b-668503105e2e): SHOW 
TABLE EXTENDED FROM my_db LIKE '*'
2018-05-03 03:08:26,355 INFO  hive.ql.exec.DDLTask: 
[HiveServer2-Background-Pool: Thread-101980]: pattern: *
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19380) Hive Lock Error Too Verbose

2018-05-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19380:
--

 Summary: Hive Lock Error Too Verbose
 Key: HIVE-19380
 URL: https://issues.apache.org/jira/browse/HIVE-19380
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2, Locking
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


When a query fails to gain a lock to a table, the following error message 
appears in beeline.  The error message does not need a stack trace and it would 
be helpful to know how long it waiting before it gave up.  It should also state 
exactly which lock it was unable to obtain.

{code:java}
ERROR : FAILED: Error in acquiring locks: Locks on the underlying objects 
cannot be acquired. retry after some time
org.apache.hadoop.hive.ql.lockmgr.LockException: Locks on the underlying 
objects cannot be acquired. retry after some time
at 
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager.acquireLocks(DummyTxnManager.java:171)
at 
org.apache.hadoop.hive.ql.Driver.acquireLocksAndOpenTxn(Driver.java:1205)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1489)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1285)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1280)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
at 
org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:89)
at 
org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at 
org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:314)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19379) Docs Are Incorrect for "hive.lock.sleep.between.retries"

2018-05-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19379:
--

 Summary: Docs Are Incorrect for "hive.lock.sleep.between.retries"
 Key: HIVE-19379
 URL: https://issues.apache.org/jira/browse/HIVE-19379
 Project: Hive
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Locking

{quote}
Default Value: 60
Added In: Hive 0.7.0 with HIVE-1293
The sleep time (in seconds) between various retries.
{quote}

Actually, it's the _sleep time_ (don't specify a unit) and the default value is 
_60s_.  If no unit is specified, it would be 60 milliseconds.

{code:java|title=ZooKeeperHiveLockManager.java}
  public void refresh() {
HiveConf conf = ctx.getConf();
sleepTime = conf.getTimeVar(
HiveConf.ConfVars.HIVE_LOCK_SLEEP_BETWEEN_RETRIES, 
TimeUnit.MILLISECONDS);
numRetriesForLock = conf.getIntVar(HiveConf.ConfVars.HIVE_LOCK_NUMRETRIES);
numRetriesForUnLock = 
conf.getIntVar(HiveConf.ConfVars.HIVE_UNLOCK_NUMRETRIES);
  }
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19378) "hive.lock.numretries" Is Misleading

2018-05-01 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19378:
--

 Summary: "hive.lock.numretries" Is Misleading
 Key: HIVE-19378
 URL: https://issues.apache.org/jira/browse/HIVE-19378
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.0.0, 2.4.0
Reporter: BELUGA BEHR


Configuration 'hive.lock.numretries' is confusing.  It's not actually a 'retry' 
count, it's the total number of attempt to try:

 

{code:java|title=ZooKeeperHiveLockManager.java}
do {
  lastException = null;
  tryNum++;
  try {
if (tryNum > 1) {
  Thread.sleep(sleepTime);
  prepareRetry();
}
ret = lockPrimitive(key, mode, keepAlive, parentCreated, 
conflictingLocks);
...
} while (tryNum < numRetriesForLock);
{code}

So, from this code you can see that on the first loop, {{tryNum}} is set to 1, 
in which case, if the configuration num*retries* is set to 1, there will be one 
attempt total.  With a *retry* value of 1, I would assume one initial attempt 
and one additional retry.  Please change to:

{code}
while (tryNum <= numRetriesForLock);
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-19302) Logging Too Verbose For TableNotFound

2018-04-25 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-19302:
--

 Summary: Logging Too Verbose For TableNotFound
 Key: HIVE-19302
 URL: https://issues.apache.org/jira/browse/HIVE-19302
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 2.2.0, 3.0.0
Reporter: BELUGA BEHR
 Attachments: table_not_found_cdh6.txt

There is way too much logging when a user submits a query against a table which 
does not exist.  In an ad-hoc setting, it is quite normal that a user 
fat-fingers a table name.  Yet, from the perspective of the Hive administrator, 
there was perhaps a major issue based on the volume and severity of logging.  
Please change the logging to INFO level, and do not present a stack trace, for 
such a trivial error.

 

See the attached file for a sample of what logging a single "table not found" 
query generates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >