[jira] [Created] (HIVE-21072) NPE when runnign partitioned CTAS statements

2018-12-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created HIVE-21072:
--

 Summary: NPE when runnign partitioned CTAS statements
 Key: HIVE-21072
 URL: https://issues.apache.org/jira/browse/HIVE-21072
 Project: Hive
  Issue Type: Bug
Reporter: Liang-Chi Hsieh


HIVE-20241 adds support of partitioned CTAS statements:
{code:sql}
CREATE TABLE partition_ctas_1 PARTITIONED BY (key) AS
SELECT value, key FROM src where key > 200 and key < 300;{code}
 
However, I've tried this feature by checking out latest branch-3, and 
encountered NPE:
{code:java}
hive> CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part;
FAILED: NullPointerException null
{code}

I also ran the query test partition_ctas.q. The test passes when using 
TestMiniLlapLocalCliDriver, but when I go to test it with TestCliDriver 
manually, it also throws NullPointerException:
{code}
2018-12-25T05:58:22,221 ERROR [a96009a7-3dda-4d95-9536-e2e16d976856 main] 
ql.Driver: FAILED: NullPointerException null
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.usePartitionColumns(GenMapRedUtils.java:2103)
at 
org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.createMRWorkForMergingFiles(GenMapRedUtils.java:1323)
at 
org.apache.hadoop.hive.ql.optimizer.GenMRFileSink1.process(GenMRFileSink1.java:113)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:54)
at 
org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
at 
org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
at 
org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at 
org.apache.hadoop.hive.ql.parse.MapReduceCompiler.generateTaskTree(MapReduceCompiler.java:323)
at 
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:244)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12503)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:357)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:166)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:664)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1854)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1801)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1796)
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402)
{code}









--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


insert data into hadoop / hive cluster

2018-12-26 Thread Daniel Takacs
I'm working on an ETL that requires me to import a continuous stream of CSVs 
into hadoop / hive cluster. For now let's assume the CSVs need to end up in the 
same database.table. But the newer CSVs might introduce additional columns 
(hence I want the script to alter the table and add additional columns as it 
encounters them).



e.g.



csv1.csv

a,b

1,2

2,4



csv2.csv

a,b,c

3,8,0

4,10,2



what is the best way to write such ETL into hive.  should I use hive with -f to 
spin up scripts like:


upsert.hql:

CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\,';

SET hive.cli.errors.ignore=true;

ALTER TABLE mydbname.testtable ADD COLUMNS (b string);

SET hive.cli.errors.ignore=false;

LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE 
mydbname.testtable;



(disadvantage is that when LAD DATA encounters invalid column string for 
integer field the value NULL is inserted and I do not get notified)

should I do it from beeline?

should I write a pig script?

should I write a java program?


should I use programs like: https://github.com/enahwe/Csv2Hive


what's the recommended approach here?



[jira] [Created] (HIVE-21071) Improve getInputSummary

2018-12-26 Thread BELUGA BEHR (JIRA)
BELUGA BEHR created HIVE-21071:
--

 Summary: Improve getInputSummary
 Key: HIVE-21071
 URL: https://issues.apache.org/jira/browse/HIVE-21071
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Affects Versions: 3.1.1, 3.0.0, 4.0.0
Reporter: BELUGA BEHR


There is a global lock in the {{getInptSummary}} code, so it is important that 
it be fast.  The current implementation has quite a bit of overhead that can be 
re-engineered.

For example, the current implementation keeps a map of File Path to 
ContentSummary object.  This map is populated by several threads concurrently. 
The method then loops through the map, in a single thread, at the end to add up 
all of the ContentSummary objects and ignores the paths.  The code can be be 
re-engineered to not use a map, or a collection at all, to store the results 
and instead just keep a running tally.  By keeping a tally, there is no O(n) 
operation at the end to perform the addition.

There are other things can be improved.  The method returns an object which is 
never used anywhere, so change method to void return type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21070) HiveSchemaTool doesn't load hivemetastore-site.xml

2018-12-26 Thread peng bo (JIRA)
peng bo created HIVE-21070:
--

 Summary: HiveSchemaTool doesn't load hivemetastore-site.xml
 Key: HIVE-21070
 URL: https://issues.apache.org/jira/browse/HIVE-21070
 Project: Hive
  Issue Type: Bug
  Components: Beeline
Affects Versions: 2.3.3
Reporter: peng bo
Assignee: peng bo


HiveSchemaTool doesn't load hivemetastore-site.xml in case of no-embedded 
MetaStore.

javax.jdo.option is server-side metastore property which is always defined in 
hivemetastore-site.xml. HiveSchemaTool seems reasonable to always read this 
file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21069) Timestamp statistics in orc is wrong if read with useUTCTimestamp=true

2018-12-26 Thread Rei Mai (JIRA)
Rei Mai created HIVE-21069:
--

 Summary: Timestamp statistics in orc is wrong if read with 
useUTCTimestamp=true
 Key: HIVE-21069
 URL: https://issues.apache.org/jira/browse/HIVE-21069
 Project: Hive
  Issue Type: Bug
Affects Versions: 3.1.0
 Environment: timezone for both client and server "Europe/Moscow" 
(UTC+3) 
 hive version 3.1.0.3.0.1.0-187
Reporter: Rei Mai
 Attachments: 00_0

We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both 
client and server. After switching to hive 3.1.0 which uses orc 1.5.1 we've got 
an issue with predicate push down filtering out matching stripes by timestamp. 
E.g. consider a table (it's orc data is in the attachment):
{quote}{{create external table test_ts (ts timestamp) stored as orc;}}

{{insert into test_ts values ("2018-12-24 18:30:00");}}

{{// No rows selected}}

{{select * from test_ts where ts < "2018-12-24 19:00:00";}}

// the lowest filter to return the value

{{select * from test_ts where ts <= "2018-12-24 21:30:00";}}
{quote}
The issue only affects external orc tables statistics. Turning ppd off with 
_set hive.optimize.index.filter=false;_ helps.

We believe it was the https://jira.apache.org/jira/browse/ORC-341, which 
introduced it.

org.apache.orc.impl.SerializationUtils utc convertion is rather strange:
{quote}public static long convertToUtc(TimeZone local, long time){
 {color:#cc7832}  int {color}offset = local.getOffset(time - 
local.getRawOffset()){color:#cc7832};{color}{color:#cc7832}  return {color}time 
- offset{color:#cc7832};{color}
 }
{quote}
This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it 
substract 3 hours, btw?).

If org.apache.orc.impl.TimestampStatisticsImpl is used with 
useUTCTimestamp=false, the timestamp is converted back in a compatible way via 
SerializationUtils.convertFromUtc. But hive seems to override default 
org.apache.orc.OrcFile.ReaderOptions with 
org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true) 
in it's constructor. With useUTCTimestamp=true evaluatePredicateProto 
predictate is using  TimestampStatisticsImpl.getMaximumUTC(), which returns the 
timestamp as is, i.e. in the example it's "2018-12-24 21:30:00 UTC+3".

At the same time the search predicate is not shifted (the value in this tez log 
is in UTC+3):
{quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0|#0] 
|orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 
19:00:00.0), expr = leaf-0
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)