[jira] [Created] (HIVE-21072) NPE when runnign partitioned CTAS statements
Liang-Chi Hsieh created HIVE-21072: -- Summary: NPE when runnign partitioned CTAS statements Key: HIVE-21072 URL: https://issues.apache.org/jira/browse/HIVE-21072 Project: Hive Issue Type: Bug Reporter: Liang-Chi Hsieh HIVE-20241 adds support of partitioned CTAS statements: {code:sql} CREATE TABLE partition_ctas_1 PARTITIONED BY (key) AS SELECT value, key FROM src where key > 200 and key < 300;{code} However, I've tried this feature by checking out latest branch-3, and encountered NPE: {code:java} hive> CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part; FAILED: NullPointerException null {code} I also ran the query test partition_ctas.q. The test passes when using TestMiniLlapLocalCliDriver, but when I go to test it with TestCliDriver manually, it also throws NullPointerException: {code} 2018-12-25T05:58:22,221 ERROR [a96009a7-3dda-4d95-9536-e2e16d976856 main] ql.Driver: FAILED: NullPointerException null java.lang.NullPointerException at org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.usePartitionColumns(GenMapRedUtils.java:2103) at org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.createMRWorkForMergingFiles(GenMapRedUtils.java:1323) at org.apache.hadoop.hive.ql.optimizer.GenMRFileSink1.process(GenMRFileSink1.java:113) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:54) at org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65) at org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65) at org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.MapReduceCompiler.generateTaskTree(MapReduceCompiler.java:323) at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:244) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12503) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:357) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:166) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:664) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1854) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1801) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1796) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
insert data into hadoop / hive cluster
I'm working on an ETL that requires me to import a continuous stream of CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end up in the same database.table. But the newer CSVs might introduce additional columns (hence I want the script to alter the table and add additional columns as it encounters them). e.g. csv1.csv a,b 1,2 2,4 csv2.csv a,b,c 3,8,0 4,10,2 what is the best way to write such ETL into hive. should I use hive with -f to spin up scripts like: upsert.hql: CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,'; SET hive.cli.errors.ignore=true; ALTER TABLE mydbname.testtable ADD COLUMNS (b string); SET hive.cli.errors.ignore=false; LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE mydbname.testtable; (disadvantage is that when LAD DATA encounters invalid column string for integer field the value NULL is inserted and I do not get notified) should I do it from beeline? should I write a pig script? should I write a java program? should I use programs like: https://github.com/enahwe/Csv2Hive what's the recommended approach here?
[jira] [Created] (HIVE-21071) Improve getInputSummary
BELUGA BEHR created HIVE-21071: -- Summary: Improve getInputSummary Key: HIVE-21071 URL: https://issues.apache.org/jira/browse/HIVE-21071 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.1.1, 3.0.0, 4.0.0 Reporter: BELUGA BEHR There is a global lock in the {{getInptSummary}} code, so it is important that it be fast. The current implementation has quite a bit of overhead that can be re-engineered. For example, the current implementation keeps a map of File Path to ContentSummary object. This map is populated by several threads concurrently. The method then loops through the map, in a single thread, at the end to add up all of the ContentSummary objects and ignores the paths. The code can be be re-engineered to not use a map, or a collection at all, to store the results and instead just keep a running tally. By keeping a tally, there is no O(n) operation at the end to perform the addition. There are other things can be improved. The method returns an object which is never used anywhere, so change method to void return type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21070) HiveSchemaTool doesn't load hivemetastore-site.xml
peng bo created HIVE-21070: -- Summary: HiveSchemaTool doesn't load hivemetastore-site.xml Key: HIVE-21070 URL: https://issues.apache.org/jira/browse/HIVE-21070 Project: Hive Issue Type: Bug Components: Beeline Affects Versions: 2.3.3 Reporter: peng bo Assignee: peng bo HiveSchemaTool doesn't load hivemetastore-site.xml in case of no-embedded MetaStore. javax.jdo.option is server-side metastore property which is always defined in hivemetastore-site.xml. HiveSchemaTool seems reasonable to always read this file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21069) Timestamp statistics in orc is wrong if read with useUTCTimestamp=true
Rei Mai created HIVE-21069: -- Summary: Timestamp statistics in orc is wrong if read with useUTCTimestamp=true Key: HIVE-21069 URL: https://issues.apache.org/jira/browse/HIVE-21069 Project: Hive Issue Type: Bug Affects Versions: 3.1.0 Environment: timezone for both client and server "Europe/Moscow" (UTC+3) hive version 3.1.0.3.0.1.0-187 Reporter: Rei Mai Attachments: 00_0 We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both client and server. After switching to hive 3.1.0 which uses orc 1.5.1 we've got an issue with predicate push down filtering out matching stripes by timestamp. E.g. consider a table (it's orc data is in the attachment): {quote}{{create external table test_ts (ts timestamp) stored as orc;}} {{insert into test_ts values ("2018-12-24 18:30:00");}} {{// No rows selected}} {{select * from test_ts where ts < "2018-12-24 19:00:00";}} // the lowest filter to return the value {{select * from test_ts where ts <= "2018-12-24 21:30:00";}} {quote} The issue only affects external orc tables statistics. Turning ppd off with _set hive.optimize.index.filter=false;_ helps. We believe it was the https://jira.apache.org/jira/browse/ORC-341, which introduced it. org.apache.orc.impl.SerializationUtils utc convertion is rather strange: {quote}public static long convertToUtc(TimeZone local, long time){ {color:#cc7832} int {color}offset = local.getOffset(time - local.getRawOffset()){color:#cc7832};{color}{color:#cc7832} return {color}time - offset{color:#cc7832};{color} } {quote} This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it substract 3 hours, btw?). If org.apache.orc.impl.TimestampStatisticsImpl is used with useUTCTimestamp=false, the timestamp is converted back in a compatible way via SerializationUtils.convertFromUtc. But hive seems to override default org.apache.orc.OrcFile.ReaderOptions with org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true) in it's constructor. With useUTCTimestamp=true evaluatePredicateProto predictate is using TimestampStatisticsImpl.getMaximumUTC(), which returns the timestamp as is, i.e. in the example it's "2018-12-24 21:30:00 UTC+3". At the same time the search predicate is not shifted (the value in this tez log is in UTC+3): {quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0|#0] |orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 19:00:00.0), expr = leaf-0 {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)