Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Ryan Harris
For A) I’d recommend mapping an EXTERNAL table to the raw/original source files…then you can just run a SELECT query from the EXTERNAL source and INSERT into your destination. LOAD DATA can be very useful when you are trying to move data between two tables that share the same schema but 1

Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Premal Shah
Hi, I have an ORC table with around 9 million rows. It has an ID column. We are running a query to make sure that are no duplicate IDs. This is the query *SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;* This is the output from the presto shell presto:test> SELECT COUNT(*), COUNT(DISTINCT

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Right, that makes sense, Dudu. So basically, if we have our data in "some form", and a goal of loading it into a parquet, partitioned table in Hive, we have two choices: A. Load this data into a temporary table first. Presumably, for this we should be able to do a LOAD INPATH, from delimited

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
“LOAD” is very misleading here. it is all in done the metadata level. The data is not being touched. The data in not being verified. The “system” does not have any clue if the flies format match the table definition and they can be actually used. The data files are being “moved” (again, a

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a few tables to achieve the effect of actual insertion of records. Is it not the case that the LOAD causes the data to get inserted into Hive? Based on that I'd like to understand whether we can get away with using LOAD

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
I just want to verify that you understand the following: · LOAD DATA INPATH is just a HDFS file movement operation. You can achieve the same results by using hdfs dfs -mv … · LOAD DATA LOCAL INPATH is just a file copying operation from the shell to the HDFS. You can

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Dudu, This is still in design stages, so we have a way to get the data from its source. The data is *not* in the Parquet format. It's up to us to format it the best and most efficient way. We can roll with CSV or Parquet; ultimately the data must make it into a pre-defined PARQUET, PARTITIONED

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
Are your files already in Parquet format? From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] Sent: Tuesday, April 04, 2017 7:03 PM To: user@hive.apache.org Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? Thanks, Dudu. Just to re-iterate;

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Thanks, Dudu. Just to re-iterate; the way I'm reading your response is that yes, we can use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the delimited file is properly formatted. Then we can LOAD it into the table (mytable in my example) directly and avoid the creation

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Markovitz, Dudu
Since LOAD DATA INPATH only moves files the answer is very simple. If you’re files are already in a format that matches the destination table (storage type, number and types of columns etc.) then – yes and if not, then – no. But – You don’t need to load the files into intermediary table. You

Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
We have a table such as the following defined: CREATE TABLE IF NOT EXISTS db.mytable ( `item_id` string, `timestamp` string, `item_comments` string) PARTITIONED BY (`date`, `content_type`) STORED AS PARQUET; Currently we insert data into this PARQUET, PARTITIONED table as follows, using an

Re: Request write access to Hive wiki

2017-04-04 Thread Lefty Leverenz
You have write access now, Sankar. Welcome to the Hive wiki team! -- Lefty On Mon, Apr 3, 2017 at 11:15 PM, Sankar Hariappan < shariap...@hortonworks.com> wrote: > Hi, > > I’m currently working on Hive Replication feature and need access to > update some wiki pages. > Confluence ID: sankarh >

Request write access to Hive wiki

2017-04-04 Thread Sankar Hariappan
Hi, I’m currently working on Hive Replication feature and need access to update some wiki pages. Confluence ID: sankarh Best regards Sankar