Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin
Hi, note that if your table contains the max value several time, all the occurences will be returned. Note also that if it contains a null it will be returned too. How about this? SELECT score FROM student ORDER BY score DESC LIMIT 1 ; Note that on this query Impala is incoherent with Hive

Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin
iPhone On Apr 23, 2014, at 2:51 AM, Furcy Pin furcy@flaminem.com wrote: Hi, note that if your table contains the max value several time, all the occurences will be returned. Note also that if it contains a null it will be returned too. How about this? SELECT score FROM student ORDER

Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-06-30 Thread Furcy Pin
Hi all, Does anyone know what is the expected behavior on a query like this : FROM source INSERT OVERWRITE TABLE dest SELECT * INSERT INTO TABLE dest SELECT * is it the same as FROM source INSERT INTO TABLE dest SELECT * INSERT OVERWRITE TABLE dest SELECT * ? I'm asking because I ran a

Re: Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-07-01 Thread Furcy Pin
? 2014-06-30 9:15 GMT+02:00 Furcy Pin furcy@flaminem.com: Hi all, Does anyone know what is the expected behavior on a query like this : FROM source INSERT OVERWRITE TABLE dest SELECT * INSERT INTO TABLE dest SELECT * is it the same as FROM source INSERT INTO TABLE dest SELECT

about collect_set and ordering

2014-07-30 Thread Furcy Pin
Hi all, I just wanted to point out a little gotcha we got while using the collect_set UDF : You should not perform directly a GROUP BY over a collect_set(...), because the set is cast as an array and is not necessarilly sorted. For example, we ran a query looking like this; SELECT set,

Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
Hi all, I have an event table with (user_id, timestamp, event) and I'm trying to write a query to get the first 10 events for each user. My query goes like this : SELECT user_id, event FROM ( SELECT user_id, event, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum FROM

Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
/browse/HIVE-7143 5. https://issues.apache.org/jira/browse/HIVE-7344 Hope it helps, Ashutosh On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote: Hi all, I have an event table with (user_id, timestamp, event) and I'm trying to write a query to get the first 10 events

Re: MatchPath UDF usage info ?

2014-09-03 Thread Furcy Pin
Hi Muhammad, From what I've googled a few months ago on the subject, MatchPath UDF has been removed from Cloudera and Hortonworks releases because TeraData claims it violates one of their patent (apparently renaming it did not suffice). I guess that if you really need it, it might be possible to

Re: UDTF KryoException: unable create/find class error in hive 0.13

2014-09-09 Thread Furcy Pin
Hi, I think I encountered this kind of serialization problem when writing UDFs. Usually, marking every fields of the UDF as *transient* does the trick. I guess the error means that Kryo tries to serialize the UDF class and everything that is inside, and by marking them as transient you ensure

Re: UDTF KryoException: unable create/find class error in hive 0.13

2014-09-10 Thread Furcy Pin
failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask could be something else? On Tue, Sep 9, 2014 at 1:31 AM, Furcy Pin furcy@flaminem.com wrote: Hi, I think I encountered this kind of serialization problem

Re: How to match three letter month name in Hive queries

2014-12-21 Thread Furcy Pin
Hi Thimut, I believe that the UDF unix_timestamp uses the java class SimpleDateFormat. http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html From the doc, you can see that m denotes a minute while M denotes a month. For your problem, -MMM-dd should do the trick. Regards,

Re: Hash algorithm issues

2015-01-21 Thread Furcy Pin
Hi Murali, As you can see from the source code of Hive's hash UDF: https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java

Re: CSV file reading in hive

2015-02-13 Thread Furcy Pin
Hi Sreeman, Unfortunately, I don't think that Hive built-in format can currently read csv files with fields enclosed in double quotes. More generally, for having ingested quite a lot of messy csv files myself, I would recommend you to write a MapReduce (or Spark) job for cleaning your csv before

Re: Hive order by not working properly

2015-01-07 Thread Furcy Pin
still using 0.12 though, maybe later versions changed some things. Hope this helps, Furcy Pin PS: On the other hand, Hive has the opposite behavior in GROUP BY clauses, eg this will not work: SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table GROUP BY *tempColumn* *order

Re: WHERE ... NOT IN (...) + NULL values = BUG

2015-07-07 Thread Furcy Pin
On Tue, Jul 7, 2015 at 11:46 AM, Furcy Pin furcy@flaminem.com wrote: Hi folks, just to let my fellow Hive users know that we found a bug with subquery in where clauses and created a JIRA for it. https://issues.apache.org/jira/browse/HIVE-11192 The latest version seems to be affected

Re: locks are held on tables even when no job running

2015-10-24 Thread Furcy Pin
E NULL >> 1445586471827 NULL hduser rhes564 >> >> >> >> I am not sure this is behaving like classic RDBMS like Sybase or MSSQL >> where the rollback is happening after KILL command and the locks are held >> until rollback is complete. Killing a pro

Re: locks are held on tables even when no job running

2015-10-23 Thread Furcy Pin
Hi Mich, I believe the duration of locks is defined by hive.txn.timeout, which is 5 min by default. https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties Retry your SHOW LOCKS command and check that the Last HeartBeat is not changing. If it is, it means your query is still

Re: Best way to load CSV file into Hive

2015-11-01 Thread Furcy Pin
Hi Vijaya, If you need some nice ETL capabilities, you may want to try https://github.com/databricks/spark-csv Among other things, spark-csv let you read the csv as is and create and insert a copy of the data into a Hive table with any format you like (Parquet, ORC, etc.) If you have a header

Alias vs Assignment

2015-10-08 Thread Furcy Pin
Hi folks, I would like to start a discussion with the Hive user and developper community about an element of syntax present in SQL Server that could be nice to have in Hive. Back in 2012, before I started Hive, and was using SQL Server, I came accross this post :

Re: transactional table + vectorization + where = bug

2015-09-23 Thread Furcy Pin
as such. > > Alan. > > Furcy Pin <furcy@flaminem.com> > September 19, 2015 at 2:36 > Hi, > > > We bumped into a bug when using vectorization on a transactional table. > Here is a minimal example : > > DROP TABLE IF EXISTS vectorization_transactional_test

transactional table + vectorization + where = bug

2015-09-19 Thread Furcy Pin
Hi, We bumped into a bug when using vectorization on a transactional table. Here is a minimal example : DROP TABLE IF EXISTS vectorization_transactional_test ; CREATE TABLE vectorization_transactional_test ( id INT ) CLUSTERED BY (id) into 3 buckets STORED AS ORC

Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin
maybe you compile and run against different versions of spark? On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh wrote: > Trying to run Hive on Spark 1.3 engine, I get > > > > conf hive.spark.client.channel.log.level=null --conf > hive.spark.client.rpc.max.size=52428800

Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin
recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* Furcy Pin [mailto:furcy@flaminem.com] > *Sent:* 03 December 2015 18:07 > *To:* user@hive.apache.org > *

Discussion: permanent UDF with database name

2015-12-17 Thread Furcy Pin
Hi Hive users, I would like to pursue the discussion that happened during the design of the feature: https://issues.apache.org/jira/browse/HIVE-6167 Some concern where raised back then, and I think that maybe now that it has been implemented, some user feedbacks could bring water to the mill.

Re: Could i use Hive SQL parser in our application?

2016-05-18 Thread Furcy Pin
Hi, This is feasible. I already did it in some tool I developed. Unfortunately I can't open-source it yet, but I can tell you how I dit it. There is this class in Hive source code

Re: Doubt on Hive Partitioning.

2016-08-01 Thread Furcy Pin
Hi Abhishek, Yes, it can happen. The only such scenarios I can think of are when you use a WHERE clause with a non-constant clause. As far as I know, partition only work on constant clauses, because it has to evaluate them *before* starting the query in order to prune the partitions. For

Re: Re-arrange columns

2016-07-26 Thread Furcy Pin
Hi, I think I had similar issues to yours. Did you look in the Hive documentation at what the CASCADE keyword does on ADD or CHANGE COLUMNS statements? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterColumn >From what I understand, the behavior of Hive

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Furcy Pin
Hi Elliot, I guess you can use IMetaStoreClient.listPartitionsNames instead, and then use IMetaStoreClient.getPartition for each partition. This might be slow though, as you will have to make 10 000 calls to get them. Another option I'd consider is connecting directly to the Hive metastore. This

Re: Doubt on Hive Partitioning.

2016-08-02 Thread Furcy Pin
I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to work. Since MR is deprecated in 2.0, I assume we should not expect any future perf optimisation on this side. It has been implemented for Hive on Spark, though. https://issues.apache.org/jira/browse/HIVE-9152 On Tue, Aug

Re: Populating tables using hive and spark

2016-08-22 Thread Furcy Pin
Hi Nitin, I confirm that there is something odd here. I did the following test : create table test_orc (id int, name string, dept string) stored as ORC; insert into table test_orc values (1, 'abc', 'xyz'); insert into table test_orc values (2, 'def', 'xyz'); insert into table test_orc values

Re: Table substitution

2016-08-25 Thread Furcy Pin
Hi Rakesh, this is an interesting question, Did you look at the cube and rollup possibilities? https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup I believe that you could use this feature to merge your table 1 and table 2 in one single table with a

Re: Hive metadata on Hbase

2016-10-24 Thread Furcy Pin
Hi Mich, the umbrella JIRA for this gives a few reason. https://issues.apache.org/jira/browse/HIVE-9452 (with even more details in the attached pdf https://issues.apache.org/jira/secure/attachment/12697601/HBaseMetastoreApproach.pdf ) In my experience, Hive tables with a lot of partitions (> 10

Re: Hive metadata on Hbase

2016-10-25 Thread Furcy Pin
Hi Mich, I mostly agree with you, but I would comment on the part about using HBase as a maintenance free core product: I would say that most medium company using Hadoop rely on Hortonworks or Cloudera, that both provides a pre-packaged HBase installation. It would probably make sense for them to

Re: Hive on Spark not working

2016-11-28 Thread Furcy Pin
ClassNotFoundException generally means that jars are missing from your class path. You probably need to link the spark jar to $HIVE_HOME/lib https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive On Tue, Nov 29, 2016 at 2:03

Re: Hive metadata on Hbase

2016-10-26 Thread Furcy Pin
r any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 25

Re: How to move tasks under reducer to Mapper phase

2016-12-10 Thread Furcy Pin
Hi Mahender, it's hard to say what happen without seeing the actual query. Hive has several ways to perform joins. There is a complete description of how it does it here: https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization Sadly, the illustrations are broken. There is also

Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin
Hi Joaquin, Suresh was faster than me ... Also, you should check this : https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert On Tue, Dec 13, 2016 at 3:19 PM, Suresh Kumar Sethuramaswamy < rock...@gmail.com> wrote: > Hi Joaquin > > In hive , when u run

Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin
Actually, there is a legacy feature in Hive that would do exactly what you need : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification Another way would simply to use dynamic partitioning : INSERT INTO TABLE employee_orc PARTITION

Re: Table not found in the definition of view

2016-12-01 Thread Furcy Pin
Hi, you should replace WITH table AS (subquery) SELECT ... FROM table with SELECT ... FROM( subquery ) table Regards. On Thu, Dec 1, 2016 at 12:32 PM, Priyanka Raghuvanshi wrote: > Hi All > > > Getting error 'Table not found in the definition of view ' if a view is

Re: hivevar in partition clause

2017-03-21 Thread Furcy Pin
Hello Jan, I guess this is because you wrote ${hivevar=PNAME} with an = instead of a : regards, Furcy On Tue, Mar 21, 2017 at 5:50 PM, Brotanek, Jan wrote: > Hi, I have following shell script: > > > > #!/bin/bash > > PNAME=20170320_203225 > > TNAME=tablea1 > >

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Furcy Pin
Hi Dmitry, If I understand what you said correctly: At the beginning you have csv files on hdfs, and at the end you want a partitioned Hive table as parquet. And your question is: "can I do this using only one Hive table and a LOAD statement?" The answer to that question is "no". The correct

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-07 Thread Furcy Pin
Hi Dmitry, I believe what you are referring to is similar to what the MSCK REPAIR TABLE command does in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) Simply put: if you define a partitioned Hive table via a CREATE

Re: How can I identify permissions issues given problematic SQL?

2017-08-16 Thread Furcy Pin
Hi Giles, Did you try running the query with the EXPLAIN AUTHORIZATION clause? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain#LanguageManualExplain-TheAUTHORIZATIONClause On Wed, Aug 16, 2017 at 3:08 AM, Bear Giles wrote: > Hi, an issue came up

Introducing a new open source project: flamy

2017-07-05 Thread Furcy Pin
the repository and to spread the word! Thanks, Furcy Pin CTO @ Flaminem

Re: drop table - external - aws

2017-05-17 Thread Furcy Pin
Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked. On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar wrote: > This is interesting and possibly a bug. Did you try changing them to > managed tables and then dropping or truncating them? How do we

Re: drop table - external - aws

2017-05-17 Thread Furcy Pin
; > On Wed, May 17, 2017 at 10:07 AM, Furcy Pin <furcy@flaminem.com> > wrote: > >> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked. >> >> >> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar <vih...@cloudera.com >> >

Re: Format dillema

2017-06-20 Thread Furcy Pin
Another option would be to try Facebook's Presto https://prestodb.io/ Like Impala, Presto is designed for fast interactive querying over Hive tables, but it is also capable of querying data from many other SQL sources (mySQL, postgreSQL, Kafka, Cassandra, ...

Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin
collect the records and do > a bulk load instead of using single insertion for each record. > > > Jorn and Furcy, thanks again for the explanations and suggestions! > > > On Sep 12, 2017 4:53 AM, "Furcy Pin" <furcy@flaminem.com> wrote: > > Hi, > >

Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin
Hi, this is a very common question, as many people knowing SQL are used to RDBMS like MySQL, Oracle, or SQL Server. The first thing you need to know about Hive is that, in the first place, it has not been designed to replace such databases. Not when they are used for transaction processing

Re: One column into multiple column.

2017-08-24 Thread Furcy Pin
Hello, You can use the split(string, pattern) UDF, that returns an array. You can compute this array in a subquery, and then assign a[0], a[1], a[2] ... to each column. The split UDF will only be called once per row. On Thu, Aug 24, 2017 at 12:11 AM, Deepak Khandelwal <

Re: ORC Transaction Table - Spark

2017-08-24 Thread Furcy Pin
As far as I know, Spark can't read Hive's transactionnal tables yet: https://issues.apache.org/jira/browse/SPARK-16996 On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal wrote: > So, there is no way possible right now for Spark to read Hive 2.x data ? > > On Thu, Aug 24,

Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Furcy Pin
I would suggest to use a subquery WITH unique_currency AS ( SELECT CCY_CD, MAX(CNTRY_DESC) as CNTRY_DESC FROM CURRENCY GROUP BY CCY_CD ) and then perform your left join on it. Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col) that take any value and are

Re: [Questio]Which record does Hive give the bigger number when I use row_number

2017-10-11 Thread Furcy Pin
Hello, Either one can receive the bigger row_num, in an underteministic fashion (which is NOT equivalent to random). Simply put, it will be whichever is treated last by Hive, which you have no way to know. If your two rows differ on other columns, you might want to add them to your ORDER BY

Re: Hive locking mechanism on read partition.

2017-10-12 Thread Furcy Pin
According to the documentation link you gave: "A 'S' lock on table and relevant partition is acquired when a read is being performed. For all other operations, an 'X' lock is taken on the partition. However, if the change is only applicable to the newer partitions, a 'S' lock is acquired on the

Re: migrate hive cli to beeline

2017-11-22 Thread Furcy Pin
Hello, As the table explains, it really depends on the number of concurrent connections you will have on your HiveServer2, but I confirm that you can experience OOM failures on either the HiveServer2 or the Metastore if too many concurrent queries are run simultaneously. For larger clusters, as

Re: "IS [NOT] NULL for a complex type"

2017-11-29 Thread Furcy Pin
Hello Jinchul, in SQL, any type (even complex types such as structs) can be nulls. And this happen as often as the use of (LEFT | RIGHT | FULL) JOINS: when a record is not matched by the joined table, it will be NULL. As far as I remember, directly creating NULL with complex types (for testing

Re: Error "Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient"

2017-11-21 Thread Furcy Pin
Hi, I recommend you try using HiveServer2 and beeline as explained here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHiveServer2andBeeline.1 In particular, please make sure you have run "bin/schematool -dbType derby -initSchema" and that it returned

Re: Options for connecting to Apache Hive

2017-11-11 Thread Furcy Pin
Hi Jacob, Like Elliot and Alan said, cross-language exchanges seem to be handled using Thrift, and there are some ODBC drivers too. I didn't find a C library, even if they are python, ruby and nodeJS libraries. Also, when Googling, beware not to confuse HiveServer1 (sometimes refered to as

Re: Cannot create external table on S3; class S3AFileSystem not found

2017-12-10 Thread Furcy Pin
Hi Scott, I ran into similar issues before. I think that you need to do two things to enable Hive access to s3: 1. add the following jars to your Hive classpath, for instance by using symlinks ln -s $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar $HIVE_HOME/lib/. ln -s

Re: partitioned hive table

2017-10-31 Thread Furcy Pin
Hi, If you want to load pre-existing records, instead of inserting data in this partition, you should use the ADD PARTITION statement https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions or simply the MSCK REPAIR TABLE statement

Re: Partition Pruning using UDF

2018-05-16 Thread Furcy Pin
/r1.2.2/api/org/apache/hadoop/hive/ql/udf/UDFType.html> > ) > > I think that I need something like 'eager evaluation' --> evaluate UDFs > before build physical plan (if not you can't do partition pruning) > > On 15 May 2018 at 09:21, Furcy Pin <pin.fu...@gmail.co

Re: Partition Pruning using UDF

2018-05-15 Thread Furcy Pin
Hi Alberto, If I'm not mistaken, to make sure that this work you need to give the proper annotation in your UDF code (deterministic, and maybe some other). You may also need to return a Constant Object Inspector in the unit method so that Hive knows that it can perform partition pruning with it.

Re: Question about efficiency of SELECT DISTINCT

2018-07-02 Thread Furcy Pin
Hi, They are rigorously equivalent. You can see this with the following queries: CREATE TABLE t1 (a INT, b INT, c INT) ; EXPLAIN SELECT DISTINCT a,b,c FROM t1 ; EXPLAIN SELECT a,b,c FROM t1 GROUP BY a,b,c ; Both queries will return the exact same query plan: Stage-0 Fetch Operator

Re: Exploding arrays of structs

2018-05-01 Thread Furcy Pin
Hi Tom, High level manipulation with arrays of structural is generally where SQL reaches its limit in terms of expressivity and where Spark starts to get my preference. Unfortunately I don't know any public UDF that can do exactly what you want, but coding one is not that hard. If you use scala,

Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Furcy Pin
I suggest that you do a DESCRIBE FORMATTED orc and to check directly on s3 the location of your data. Did you also try a SELECT * FROM orc LIMIT 10; ? Did it return something? On Fri, 27 Apr 2018, 21:30 Sowjanya Kakarala, wrote: > > > > >> Hi Guys, >> >> I am new to

Re: Dynamic vs Static partition

2018-01-09 Thread Furcy Pin
You need to enable nonstrict mode to be able to use dynamic partitioning. Dynamic partitioning is more automatic, but might take more time because Hive has to write to multiple outputs at once. With MapReduce, this uses MultipleOutputs, and can fail with OOM when writing too many partitions at

Re: HQL parser internals

2018-02-17 Thread Furcy Pin
Hi Elliot, Actually, I have done quite similar work regarding Hive custom Parsing, you should have a look at my project: https://github.com/flaminem/flamy The Hive parsing related stuff is here: https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive A good

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
wrote: > I got your point and thanks for the nice slides info. > > > So the parquet filter is not an easy thing and I will try that according > to the deck. > > > Thanks ! > ------ > *From:* Furcy Pin <pin.fu...@gmail.com> >

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
Hi, Unless your table is partitioned or bucketed by myid, Hive generally requires to read through all the records to find the records that match your predicate. In other words, Hive table are generally not indexed for single record retrieval like you would expect RDBMs tables or Vertica tables

Re: 答复: Re: Does Hive SQL support reading data without locking?

2018-01-02 Thread Furcy Pin
You can disable locking by setting SET hive.support.concurrency=true ; It can be used in a single session before a query. However locking exists for a reason: what do you think will happen if you overwrite files from a table while Tableau is reading them? It really depends on your use case, but

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Furcy Pin
Hi Elliot, >From your description of the problem, I'm assuming that you are doing a INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table or something close, like a CREATE TABLE AS ... maybe. If this is the case, I suspect that your shuffle phase comes from dynamic partitioning, and in

Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin
Sujeet Singh Pardeshi > > Software Specialist > > SAS Research and Development (India) Pvt. Ltd. > > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > *o*ff: +91-20-30418810 > [image: Description: untitled] > > *"When the solu

Re: External Table Creation is slow/hangs

2018-08-16 Thread Furcy Pin
Hi, I can't tell for sure where your problem is coming from, but from what you said, I guess that the Hive Metastore is performing some list or scan operation on the files and that operation is taking a very long time. maybe setting *hive.stats.autogather* to false might help. Also, beware that

Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Furcy Pin
Hi Chintan, Yes, this sounds weird... "REFRESH TABLES" is the kind of statement required by SQL engines such as Impala, Presto or Spark-SQL that cache metadata from the Metastore, but vanilla Hive usually don't cache it and query the metastore every time (unless some new feature was added

Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin
It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ? Hive will write one file per reducer, called 0_0, 1_0, etc. where the number corresponds to the number of your reducer. Sometimes the _0 will be a _1 or _2 or more depending

Re: what's the best practice to create an external hive table based on a csv file on HDFS with 618 columns in header?

2018-07-24 Thread Furcy Pin
Hello, To load your data as parquet, you can either: A. use spark: https://docs.databricks.com/spark/latest/data-sources/read-csv.html and write it directly as a parquet file (df.write.format("parquet").saveAsTable("parquet_table")) B. Load it as a csv file in Hive, and perform a CREATE TABLE

Re: Will partition pruning will happen on HIVE views?

2018-01-23 Thread Furcy Pin
Hi, I might be wrong, but I would expect views to work as a simple replacement at the query level (like a subquery or a CTE). This means that partition pruning should work as long as predicate pushdown properly works. For your specific use case, it should be easy enough to test, and you can use

Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Furcy Pin
Hi Mahender, Did you look at this? https://www.snappydata.io/blog/the-spark-database But I believe that most people handle this use case by either using: - Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if the data is not too big - Their favorite New-SQL storage

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross wrote: > I think

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
t; Compressed: No > Num Buckets:61 > Bucket Columns: [crs_id] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.467 seconds, Fetched: 98 row(s) > > > - Original Message - > From: "Furcy Pin" &

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
¯\_(ツ)_/¯ On 5 April 2018 at 16:02, Richard A. Bross <r...@oaktreepeak.com> wrote: > I can't duplicate the issue now. Works like it always has . . > > - Original Message - > From: "Furcy Pin" <pin.fu...@gmail.com> > To: user@hive.apache.org >

Re: HQL parser internals

2018-04-13 Thread Furcy Pin
Dharmesh Kakadia <dhkaka...@gmail.com> > wrote: > > +1 for using ParseDriver for this. I also have used it to intercept and > augment query AST. > > > > Also, I would echo others sentiment that its quite ugly. It would be great > if we can refactor/standardize this. T

Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin
Hi Oleksiy, I must say that I don't know if partitioned HBase-backed tables are supported in Hive, but I don't understand why you would need it. What are you trying to do exactly? I suspect that you could do it by using composite keys (Department, doc_id). Also, I would advise against using

Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin
Hi Oleksiy, I must say that I don't know if partitioned HBase-backed tables are supported in Hive, but I don't understand why you would need it. What are you trying to do exactly? I suspect that you could do it by using composite keys (Department, doc_id). Also, I would advise against using

Re: HQL parser internals

2018-03-20 Thread Furcy Pin
I would echo others sentiment that its quite ugly. It would be >>> great if we can refactor/standardize this. That will make integrating >>> other system a lot easier. >>> >>> Thanks, >>> Dharmesh >>> >>> On Sat, Feb 17, 2018 at 12:0

Re: Comparing Google Cloud Platform BiqQuery with Hive

2019-01-14 Thread Furcy Pin
Hi Mich, Contrary to what you said, I can confirm you that BQ is able to read ORC files compressed with Snappy. However, BQ requires to perform a loading operation from the ORC file on Google Storage and convert it into a BQ table. The main advantages I see with BQ is the guaranteed very high

Re: just released: Docker image of a minimal Hive server

2019-02-21 Thread Furcy Pin
Hello! If that might help, I did this repo a while ago: https://github.com/FurcyPin/docker-hive-spark It provides a pre-installed Hive Metastore and a HiveServer running on Spark (Spark-SQL not Hive on Spark) I also did some config to acces AWS s3 data with it. Cheers, Furcy On Thu, 21 Feb

Re: How to manage huge partitioned table with 1000+ columns in Hive

2019-11-26 Thread Furcy Pin
Hello, Sorry for the late reply, but this problem is very interesting. How did you end up solving it in the end? I have an idea which is very ugly but might work: Create a big view that is an union of all partitions SELECT '2019-10-01' as ds, * FROM test_1 a JOIN test_2 b ON a.id = b.id JOIN