[jira] [Created] (HIVE-23180) Remove unused variables from tez build dag

2020-04-10 Thread Mustafa Iman (Jira)
Mustafa Iman created HIVE-23180:
---

 Summary: Remove unused variables from tez build dag
 Key: HIVE-23180
 URL: https://issues.apache.org/jira/browse/HIVE-23180
 Project: Hive
  Issue Type: Improvement
Reporter: Mustafa Iman
Assignee: Mustafa Iman
 Attachments: HIVE-23180.patch

This is a simple refactoring around TezTask build dag functionality. Unused 
options are removed from function calls. Also some variables are given 
meaningful names. Gets rid of unneccessary filesystem creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23179) Show create table is not showing SerDe Properties in unicode

2020-04-10 Thread Naresh P R (Jira)
Naresh P R created HIVE-23179:
-

 Summary: Show create table is not showing SerDe Properties in 
unicode
 Key: HIVE-23179
 URL: https://issues.apache.org/jira/browse/HIVE-23179
 Project: Hive
  Issue Type: Bug
Reporter: Naresh P R
Assignee: Naresh P R


Table with special character delimiters are not shown in show create output

eg., 
create external table test(age int, name string) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY '\u0001' stored as textfile;
Show create output
++
|   createtab_stmt   |
++
| CREATE EXTERNAL TABLE `test`(|
|   `age` int,   |
|   `name` string)   |
| ROW FORMAT SERDE   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES ( |
|   'field.delim'='',   |
|   'serialization.format'='')  |
| STORED AS INPUTFORMAT  |
|   'org.apache.hadoop.mapred.TextInputFormat'   |
| OUTPUTFORMAT   |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION   |
|   'hdfs://abcd:8020/warehouse/tablespace/external/hive/testca' |
| TBLPROPERTIES (|
|   'bucketing_version'='2', |
|   'discover.partitions'='true',|
|   'transient_lastDdlTime'='1577162310')|
++
Few client console not able to show ^A (Ctrl + A) properly. It's better to show 
the output in unicode as shown in desc formatted.
| Storage Desc Params:  | NULL  
 | NULL   |
|   | field.delim   
 | \u0001 |
|   | serialization.format  
 | \u0001



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23178) Add Tez Total Order Partitioner

2020-04-10 Thread Roohi Syeda (Jira)
Roohi Syeda created HIVE-23178:
--

 Summary: Add Tez Total Order Partitioner
 Key: HIVE-23178
 URL: https://issues.apache.org/jira/browse/HIVE-23178
 Project: Hive
  Issue Type: Bug
Reporter: Roohi Syeda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Many ANTLR Tokens

2020-04-10 Thread David Mollitor
Hello Gang,

I've been investigating this issue.

This should no longer be an issue with ANTL4 (and ANTLR3 is long since
stopped seeing development circa 2014).  However, ANTL4 is not fully
backwards compatible with ANTL3.  In particular, ANTL4 changes how it
approaches "rewrite rules" operations.  ANTLR3 heavily uses these
operations and therefore it is quite a lift to get this upgrade done.  Not
to mention, as I work on fixing some of these things, we may want to
backport to HIVE 3.x branches.

https://issues.apache.org/jira/browse/HIVE-23177


I also looked at possibly writing a tool that will break up the java file
that ANTL3 produces into smaller pieces, but this would require that I
create another Maven module in Hive just for this purpose.  It would be a
custom Maven Plugin that performs this action of reading in the source code
and then chopping it up a bit to make the compiler happy.  This is
possible, but adds quite a bit of overheard to the project (yet another
Maven module to manage).


We can also just remove the duplicate token names.  I understand that its
design grants flexibility, but SQL is a pretty tight standard at this point
and I don't see Hive leveraging this in any meaningful way.  This would be
the path of least resistance.

Thoughts?

Thanks.

On Thu, Apr 9, 2020 at 6:36 PM David Mollitor  wrote:

> Hello Gang,
>
> I am investigating HIVE-23172 and I am having a problem addressing this
> because I am getting the following error from compiling the grammar:
>
> hive-parser: Compilation failure
> [ERROR]
> /home/apache/hive/hive/parser/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveParser.java:[40,38]
> code too large
>
> I traced it down to the fact that there are too many token defined.  In
> HiveParser.java, it has the following:
>
>  public static final String[] tokenNames = new String[] { ... };
>
> That list is so long, it's breaking Java compilation.  Someone else came
> across this awhile ago: HIVE-15577.
>
> I observed that the parser defines two token for most elements, for
> example:
>
> KW_TRUNCATE / TOK_TRUNCATETABLE
>
> What is the value of having both?  Can we consolidate this down to one and
> conserve some space?  I would propose just using  TOK_TRUNCATE and get rid
> of the KW version.
>
> Does anyone have an insight into why things are setup the way they are?
>


[jira] [Created] (HIVE-23177) Upgrade to ANTLR4

2020-04-10 Thread David Mollitor (Jira)
David Mollitor created HIVE-23177:
-

 Summary: Upgrade to ANTLR4
 Key: HIVE-23177
 URL: https://issues.apache.org/jira/browse/HIVE-23177
 Project: Hive
  Issue Type: Improvement
Reporter: David Mollitor


Upgrade Hive to ANTL4, ANTLR3 lost support many moons ago.

This is going to be a big lift.  Many of the Hive rules use the "rule rewrite" 
feature which no longer exists in ANLTR4 and it must be completely 
re-implemented:

https://stackoverflow.com/questions/14565794/antlr-4-tree-inject-rewrite-operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23176) Remove REGEX Column Feature

2020-04-10 Thread David Mollitor (Jira)
David Mollitor created HIVE-23176:
-

 Summary: Remove REGEX Column Feature
 Key: HIVE-23176
 URL: https://issues.apache.org/jira/browse/HIVE-23176
 Project: Hive
  Issue Type: Improvement
Reporter: David Mollitor


Remove the Hive feature: REGEX Column.

 

Hive has this interesting feature for doing REGEX to SELECT multiple columns.  
This needs to go.  It is not SQL standard and as currently implemented, it is 
impossible to determine if a column identifier is a REGEX or the actual name of 
the column.  If a column name is enclosed in back ticks then any UTF-8 
character is a valid table name.

 

[https://dev.mysql.com/doc/refman/8.0/en/identifiers.html]

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)