Re: How to manage huge partitioned table with 1000+ columns in Hive

2019-11-26 Thread Furcy Pin
Hello,

Sorry for the late reply, but this problem is very interesting. How did you
end up solving it in the end?

I have an idea which is very ugly but might work:

Create a big view that is an union of all partitions

SELECT
'2019-10-01' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-01' AND b.ds = '2019-10-01' AND c.ds = '2019-10-01'
UNION ALL
SELECT
'2019-10-02' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-02' AND b.ds = '2019-10-02' AND c.ds = '2019-10-02'
UNION ALL
...
;


That way, each part of the union will use the *sortedmerge* optimization,
and hopefully if the optimizer is smart enough, when you filter on several
days,
he will be able to prune the parts of the union that don't match.

I haven't tested it, so I can't tell if that works or not. It just an idea.
If Hive as limitation about maximum resolved query size, it might reach it,
though.

Hope this helps.

Furcy


On Wed, 2 Oct 2019 at 11:09, Pau Tallada  wrote:

> Hi,
>
> I would say the most efficient way would be option (3), where all the
> subtables are partitioned by date, and clustered+**sorted** by id.
> This way, efficient SMB map joins can be performed over the 10 tables of
> the same partition.
>
> Unfortunately, I haven't found a way to achieve SMB map joins* over more
> than one* partition :(
>
> Example:
>
> CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> Over this tables, one can perform efficient single-stage SMB map joins, *if
> you filter on a single partition*:
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds = '2019-10-01'
> ;
>
> When you try to query on two partitions, then it does a shuffle :(
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds IN ('2019-10-01', '2019-10-02')
> ;
>
>
> My problem is very similar, so let's hope someone out there has the answer
> :)
>
> Cheers,
>
> Pau.
>
> Missatge de Saurabh Santhosh  del dia dt., 1
> d’oct. 2019 a les 8:48:
>
>> Hi,
>>
>> I am facing the following problem while trying to store/use a huge
>> partitioned table with 1000+ columns in Hive. I would like to know how to
>> solve this problem either using hive or any other store.
>>
>> Requirement:
>>
>> 1).There is a table with around 1000+ columns which is partitioned by
>> date.
>> 2).Every day consist of data about around 500 million entities. There
>> will be an id column with the id of the entity and around 1000+ columns
>> which represent attributes of given entity for each day.
>> 3).We have to store data for around 2 years
>> 4). New columns may be added/logic of existing column may be changed any
>> day and when this happens we have to populate data for the given column for
>> last 2 years
>>
>>
>> Our Solution 1:
>>
>> 1). We created a table with 1000+ columns and partitioned by date.
>> 2). Every day we create a new partition and delete partition older than 2
>> years
>>
>> Problems Faced in Solution 1:
>>
>> Whenever we had to add/modify certain columns, the backfill of data took
>> a long time and it was taking months to backfill the data for 2 years (this
>> was because there is lot of IO due to the read/write of each partition)
>>
>>
>> Our Solution 2:
>>
>> 1). We created 10 tables with around 100+ columns each and each of them
>> was partitioned by date.
>> 2). Every day we create a new partition in each of the small tables and
>> delete partition older than 2 years
>> 3). Created a view which was a join between all the tables with id, date
>> as join key
>>
>>
>> Problems Faced in Solution 2:
>>
>> Now the backfill time was considerably reduced from months to weeks as we
>> need to only refresh the small table which contained the columns to be
>> backfilled thus 

Re: just released: Docker image of a minimal Hive server

2019-02-21 Thread Furcy Pin
Hello!

If that might help, I did this repo a while ago:
https://github.com/FurcyPin/docker-hive-spark
It provides a pre-installed Hive Metastore and a HiveServer running on
Spark (Spark-SQL not Hive on Spark)

I also did some config to acces AWS s3 data with it.

Cheers,

Furcy

On Thu, 21 Feb 2019 at 18:30, Edward Capriolo  wrote:

> Good deal and great name!
>
> On Thu, Feb 21, 2019 at 11:31 AM Aidan L Feldman (CENSUS/ADEP FED) <
> aidan.l.feld...@census.gov> wrote:
>
>> Hi there-
>>
>> I am a new Hive user, working at the US Census Bureau. I was interested
>> in getting Hive running locally, but wanted to keep the dependencies
>> isolated. I could find Hadoop and Hive Docker images, but not one that had
>> both. Therefore, I present:  WeeHive , *a
>> minimal-as-possible Hive deployment*! This allows getting Hive up and
>> running (for development, not production) in a handful of steps.
>>
>>
>> I'm new to Hadoop and Hive, so I'm sure there are improvements that could
>> be made. Feedback (email ,issues
>> , or pull requests) welcome.
>>
>>
>> Enjoy!
>>
>>
>> Aidan Feldman
>>
>> xD (Experimental Data) team
>>
>> Office of Program, Performance, and Stakeholder Integration (PPSI)
>>
>> Office of the Director
>>
>> Census Bureau
>>
>>


Re: Comparing Google Cloud Platform BiqQuery with Hive

2019-01-14 Thread Furcy Pin
Hi Mich,

Contrary to what you said, I can confirm you that BQ is able to read ORC
files compressed with Snappy.
However, BQ requires to perform a loading operation from the ORC file on
Google Storage and convert it into a BQ table.

The main advantages I see with BQ is the guaranteed very high scalability
and low query latency without having to manage a Hadoop cluster yourself.

I would not say however, that you can simply plug your existing HQL queries
into BQ. All useful analytics functions are indeed there, but in many cases
they have a different name.
For instance, the equivalent of Hive's UDF *trunc* in BQ is *date_trunc.*

In my use case I use pyspark for complex transformations and use BQ as a
Warehouse to plug Power BI on it.
So for a fair comparison, I think you should compare BQ with Vertica,
Presto, Impala or Hive LLAP rather than just Hive.

Regards,

Furcy




On Fri, 11 Jan 2019 at 11:18, Mich Talebzadeh 
wrote:

> Hi,
>
> Has anyone got some benchmarks on comparing Hive with Google Cloud
> Platform (GCP) BiqQuery (BQ)?
>
> From my experience  experience BQ supports both Avro and ORC file types.
> There is no support for compressed ORC or AVRO. So if you want to load a
> Hive table into BQ, you will need to create a table with no compression. In
> short you need to perform ETL to move a Hive table to BQ.
>
> On the other hand BQ seems to support all analytical functions available
> in Hive so your queries should run without any modification in BQ.
>
> On the other hand Dataproc tool in GCP also supports Hive (though I have
> not tried it myself). So the question is are there any advantages taking a
> Hive table into BQ itself?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: External Table Creation is slow/hangs

2018-08-16 Thread Furcy Pin
Hi,

I can't tell for sure where your problem is coming from, but from what you
said, I guess that the Hive Metastore is performing some list or scan
operation on the files
and that operation is taking a very long time.

maybe setting *hive.stats.autogather* to false might help.

Also, beware that some configuration parameters that apply to the Metastore
cannot be changed via a SET operation,
and require you to change the configuration file of your Metastore service
and restart it.
Maybe that's why some of the conf changes you tried had no effect...

Also, don't hesitate to provide more details about what type of query you
run (e.g. is your table partitioned? etc.)
and what configuration tweaks you tried already.

Hope this helps,

Furcy







On Tue, 14 Aug 2018 at 21:39, Luong, Dickson 
wrote:

> I have a dataset up on S3 in partitioned folders. I'm trying to create an
> external hive table pointing to the location of that data. The table schema
> is set up to have the column partitions matching how the folders are set up
> on S3.
>
> I've done this quite a few times successfully, but when the data is large
> the table creation query is either extremely slow or it hangs (We can't
> tell).
>
> I've followed some of the tips in
> https://hortonworks.github.io/hdp-aws/s3-hive/index.html#general-performance-tips
> by configuring some of the parameters involving file permission and file
> size checks to adjust for S3 but still no luck.
>
> We're using EMR 5.12.1 which contains Hive 2.3.2. The table creation query
> does not show up in the Tez UI, but it does show up in the HiveServer UI as
> running, but we're not sure if it actually is or just hung (most likely the
> latter).
>
> Our (very roundabout) solution so far is to copy all the files in that
> master folder to another directory, delete the files, create the external
> table when the directory is empty, and to transfer the files back. We need
> to keep the original directory name as other processes depend on it and
> can't simply just start in a fresh directory, so this whole method is
> obviously not ideal.
>
> Any tips / solutions to this problem we've been tackling would be greatly
> appreciated.
>
> Dickson
>


Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Furcy Pin
Hi Chintan,

Yes, this sounds weird...

"REFRESH TABLES" is the kind of statement required by SQL engines such as
Impala, Presto or Spark-SQL that cache metadata from the Metastore, but
vanilla Hive usually don't cache it and query the metastore every time
(unless some new feature was added recently, in which case it is probably
be possible to disable it with some option).
In other words, as long as you add new files to your existing partitions,
they should be automatically readable from Hive.
If you add new partitions, that's a different story, of course.

Are you sure you are using Hive here, and not Spark-SQL or something else?

By the way, if you want near-real-time tables with Hive, maybe you should
have a look at this project from Uber: https://uber.github.io/hudi/
I don't know how mature it is yet, but I think it aims at solving that kind
of challenge.

Regards,

Furcy

On Thu, 9 Aug 2018 at 18:30, Will Du  wrote:

> i never experienced such kind of issue. Once data is loaded to HDFS by
> sink, the data is available in hive.
>
> Sent from my iPhone
>
> On Aug 9, 2018, at 10:18, Chintan Patel  wrote:
>
> Hello Will Du,
>
> I'm using Kafka connector to create hive database. All the data are stored
> in s3 bucket and using mysql database for metastore.
>
> For example If connector add new records in hive table and If I run query
> It's not returning latest data and I have to run refresh table {table_name}
> to clear metastore cache. Now If I have 1000 hive table and I want to
> update those tables every 5 mins, running refresh query is not good idea I
> guess.
>
> So I was thinking if hive has some type of mechanism to do it in
> background then it will be good.
>
>
> On 9 August 2018 at 17:51, Will Du  wrote:
>
>> any reason to do this?
>>
>> Sent from my iPhone
>>
>> > On Aug 9, 2018, at 07:57, Chintan Patel  wrote:
>> >
>> > Hello,
>> >
>> > I want to refresh external type hive table metadata on some regular
>> interval without using "refresh table {table_name}".
>> >
>> > Thanks & Regards
>> >
>>
>
>


Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin
Indeed, if your table is big, then you DO want to parallelise and setting
the number of reducers to 1 is clearly not a good idea.

>From what you explained, I still don't understand what your exact problem
is.
Can't you just leave your query as is and be okay with many part_X
files?




On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi 
wrote:

> Thanks Furcy. Will the below setting not hit performance? Just one
> reducer? I am processing huge data and cannot compromise on performance.
>
>
>
> SET mapred.reduce.tasks = 1
>
>
>
>
>
> Regards,
>
> Sujeet Singh Pardeshi
>
> Software Specialist
>
> SAS Research and Development (India) Pvt. Ltd.
>
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> *o*ff: +91-20-30418810
> [image: Description: untitled]
>
>  *"When the solution is simple, God is answering…" *
>
>
>
> *From:* Furcy Pin 
> *Sent:* 08 August 2018 PM 06:37
> *To:* user@hive.apache.org
> *Subject:* Re: Hive output file 00_0
>
>
>
> *EXTERNAL*
>
> It might sound silly, but isn't it what Hive is supposed to do, being a
> distributed computation framework and all ?
>
>
>
> Hive will write one file per reducer, called 0_0, 1_0, etc. where
> the number corresponds to the number of your reducer.
>
> Sometimes the _0 will be a _1 or _2 or more depending on the retries or
> speculative execution.
>
>
>
> This is how MapReduce works, and this is how Tez and Spark work too: you
> can't have several executors writing simultaneously in the same file on
> HDFS
>
> (maybe you can on s3, but not through the HDFS api). So each executor
> writes in its own file.
>
> If you want to read back the data with MapReduce, Tez or Spark, you simply
> specify the folder path, and all files in it will be read. That's what Hive
> does too.
>
>
>
> Now, if you really want only one file (for instance to export it as a csv
> file), I guess you can try to add this before your query
>
> (if you use MapReduce, there probably are similar configurations for Hive
> on Tez or Hive on Spark)
>
> SET mapred.reduce.tasks = 1
>
>
>
> And if you really suspect that some files are not correctly overwritten
> (like it happened to me once with Spark on Azure blob storage), I suggest
> that you check the file timestamp
>
> to determine which execution of your query wrote it.
>
>
>
>
>
> Hope this helps,
>
>
>
> Furcy
>
>
>
> On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi 
> wrote:
>
> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -Original Message-
> From: Deepak Jaiswal 
> Sent: 07 August 2018 PM 11:19
> To: user@hive.apache.org
> Subject: Re: Hive output file 00_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 03_0 belongs to bucket 3.
> To always have files named 00_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi"  wrote:
>
> Hi All,
> I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 00_0 onto S3. However at times
> I am noticing that it creates file with other names like 00

Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin
It might sound silly, but isn't it what Hive is supposed to do, being a
distributed computation framework and all ?

Hive will write one file per reducer, called 0_0, 1_0, etc. where
the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending on the retries or
speculative execution.

This is how MapReduce works, and this is how Tez and Spark work too: you
can't have several executors writing simultaneously in the same file on
HDFS
(maybe you can on s3, but not through the HDFS api). So each executor
writes in its own file.
If you want to read back the data with MapReduce, Tez or Spark, you simply
specify the folder path, and all files in it will be read. That's what Hive
does too.

Now, if you really want only one file (for instance to export it as a csv
file), I guess you can try to add this before your query
(if you use MapReduce, there probably are similar configurations for Hive
on Tez or Hive on Spark)
SET mapred.reduce.tasks = 1

And if you really suspect that some files are not correctly overwritten
(like it happened to me once with Spark on Azure blob storage), I suggest
that you check the file timestamp
to determine which execution of your query wrote it.


Hope this helps,

Furcy

On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi 
wrote:

> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -Original Message-
> From: Deepak Jaiswal 
> Sent: 07 August 2018 PM 11:19
> To: user@hive.apache.org
> Subject: Re: Hive output file 00_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 03_0 belongs to bucket 3.
> To always have files named 00_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi"  wrote:
>
> Hi All,
> I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 00_0 onto S3. However at times
> I am noticing that it creates file with other names like 003_0 etc. I
> always need to overwrite the existing file but with inconsistent file names
> I am unable to do so. How do I force hive to always create a consistent
> filename like 00_0? Below is an example of how my code looks like,
> where tab_content is a hive external table.
>
> INSERT OVERWRITE TABLE tab_content
> PARTITION(datekey)
> select * from source
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune,
> Maharashtra, 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
>
>


Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Furcy Pin
Hi Elliot,

>From your description of the problem, I'm assuming that you are doing a
INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table

or something close, like a CREATE TABLE AS ... maybe.

If this is the case, I suspect that your shuffle phase comes from dynamic
partitioning, and in particular from this option (quote from the doc)

hive.optimize.sort.dynamic.partition
>
>- Default Value: true in Hive 0.13.0 and 0.13.1; false in Hive 0.14.0
>and later (HIVE-8151 )
>
>
>- Added In: Hive 0.13.0 with HIVE-6455
>
>
> When enabled, dynamic partitioning column will be globally sorted. This
> way we can keep only one record writer open for each partition value in the
> reducer thereby reducing the memory pressure on reducers.


This option has been added to avoid OOM exceptions when doing dynamic
partitioned insertions, however it has disastrous performances for table
copy operations,
where only a Map phase should suffice. Disabling this option before your
query should suffice.

Also, beware that reading from and inserting to the same partitioned table
may create deadlock issues: https://issues.apache.org/jira/browse/HIVE-12258

Regards,

Furcy


On Sat, 4 Aug 2018 at 13:28, Elliot West  wrote:

> Hi,
>
> I’m trying to simply change the format of a very large partitioned table
> from Json to ORC. I’m finding that it is unexpectedly resource intensive,
> primarily due to a shuffle phase with the partition key. I end up running
> out of disk space in what looks like a spill to disk in the reducers.
> However, the partitioning scheme is identical on both the source and the
> destination so my expectation is a map only job that simply rencodes each
> file.
>
> I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
> could resolve my issue by allocating more storage to the task nodes.
> However, can anyone advise a more resource and time efficient approach?
>
> Cheers,
>
> Elliot.
>


Re: what's the best practice to create an external hive table based on a csv file on HDFS with 618 columns in header?

2018-07-24 Thread Furcy Pin
Hello,

To load your data as parquet, you can either:

A. use spark:
https://docs.databricks.com/spark/latest/data-sources/read-csv.html and
write it directly as a parquet file
(df.write.format("parquet").saveAsTable("parquet_table"))
B. Load it as a csv file in Hive, and perform a CREATE TABLE parquet_table
STORED PARQUET as SELECT * FROM csv_table
C. Try to use Hue, or any ETL tool to do it (Talend, Dataiku, etc.)

This list is not exhaustive, surely they are other methods out there,


Now, concerning the fact that your table has 618 columns, there are a few
possibilities

A. You don't need all of them and, you can select the one you want while
you transform your csv table into a parquet table
B. You need all of them (after all, if your company give you a csv with
618, they are probably using all of them at some point)
you can either: use spark solution above to create the table without
specifying the fields (and you can then do a SHOW CREATE TABLE
parquet_table to get the DLL of the new table if you want to keep it)
C. You can try to denormalize your data (i.e. make multiple tables out of
it). I would recommend that for a traditional RDBMS, but not for Hive
D. You can group your columns into structs to make them more organized and
easier to select and manage (that's what I would do)

Last but not least, If you prefer writing plain SQL, I would advise that
you train yourself to perform multi-line edit like this:
https://youtu.be/ZJyJqI1UVaA?t=65

I would advise trying Sublime Text for that, but most modern IDE support
this too.
https://www.youtube.com/watch?v=kyiBiXFbbQs_channel=MattThiessen

Hope this helps,

cheers, and good luck

Furcy

On Mon, 23 Jul 2018 at 21:47, Raymond Xie  wrote:

> We are using Cloudera CDH 5.11
>
> I have seen solution for small xlsx files with only handful columns in
> header, in my case the csv file to be loaded into a new hive table has 618
> columns.
>
>1.
>
>Would it be saved as parquet by default if I upload it (save it to csv
>first) through HUE-> File Browser? if not, where can I specify the file
>format?
>2.
>
>What would be the best way to create an external Impala table based on
>that location? It would definitely be unbelievable if I need to create the
>DDL/schema manually as there are so many columns.
>
> Thank you very much.
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Re: Question about efficiency of SELECT DISTINCT

2018-07-02 Thread Furcy Pin
Hi,

They are rigorously equivalent.

You can see this with the following queries:

CREATE TABLE t1 (a INT, b INT, c INT) ;

EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;

EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;


Both queries will return the exact same query plan:

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 2 vectorized
 File Output Operator [FS_8]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_7]
|  keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2
(type: int)
|  outputColumnNames:["_col0","_col1","_col2"]
|  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
   Reduce Output Operator [RS_3]
  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
  Map-reduce partition columns:_col0 (type: int), _col1
(type: int), _col2 (type: int)
  sort order:+++
  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
  Group By Operator [GBY_2]
 keys:a (type: int), b (type: int), c (type: int)
 outputColumnNames:["_col0","_col1","_col2"]
 Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
 Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
   alias:t1
   Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE





However, these two queries are NOT equivalent:

SELECT COUNT(DISTINCT a,b,c)
FROM t1
;

SELECT COUNT(1)
FROM (
  SELECT a,b,c
  FROM t1
  GROUP BY a,b,c
  ) T
;

In general, the first one is faster except that it can fail if Hive
optimize it poorly, while the second one is slower but more reliable.
Also, most importantly, they don't give the same results as COUNT(DISTINCT
a, b, c) will ignore any row where a, b or c is null.

Their respective query plans are :

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 2
 File Output Operator [FS_6]
compressed:false
Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_4]
|  aggregations:["count(DISTINCT KEY._col0:0._col0,
KEY._col0:0._col1, KEY._col0:0._col2)"]
|  outputColumnNames:["_col0"]
|  Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
   Reduce Output Operator [RS_3]
  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
  sort order:+++
  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
  Group By Operator [GBY_2]
 aggregations:["count(DISTINCT a, b, c)"]
 keys:a (type: int), b (type: int), c (type: int)
 outputColumnNames:["_col0","_col1","_col2","_col3"]
 Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
 Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
   alias:t1
   Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE

and

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 3 vectorized
 File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_12]
|  aggregations:["count(VALUE._col0)"]
|  outputColumnNames:["_col0"]
|  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
|<-Reducer 2 

Re: Partition Pruning using UDF

2018-05-16 Thread Furcy Pin
Yes, I believe that's what Constant Object Inspector are used for.

The initialize method returns object inspectors that are used at query
compilation time to check the type safety and perform potential
optimizations. If you return a StringConstantObjectInspector containing the
value that your method is supposed to return, the Hive optimiser will know
that it can safely perform partition pruning on it.

On Tue, 15 May 2018, 23:32 Alberto Ramón, <a.ramonporto...@gmail.com> wrote:

> Yes, I checked, by default all UDF are deterministic (LINK
> <https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/ql/udf/UDFType.html>
> )
>
> I think that I need something like 'eager evaluation' --> evaluate UDFs
> before build physical plan (if not you can't do partition pruning)
>
> On 15 May 2018 at 09:21, Furcy Pin <pin.fu...@gmail.com> wrote:
>
>> Hi Alberto,
>>
>>
>> If I'm not mistaken, to make sure that this work you need to give the
>> proper annotation in your UDF code (deterministic, and maybe some other).
>> You may also need to return a Constant Object Inspector in the unit
>> method so that Hive knows that it can perform partition pruning with it.
>>
>> On Wed, 9 May 2018, 19:23 Alberto Ramón, <a.ramonporto...@gmail.com>
>> wrote:
>>
>>> Hello
>>>
>>> We have a UDP to select the correct partition to read 'FindPartition':
>>> Select * from TB where partitionCol =FindPartition();
>>>
>>> How I can avoid a full scan of all partitions?
>>>
>>>
>>> (Set MyPartition=FindPartition();  // Is not valid in Hive)
>>>
>>
>


Re: Partition Pruning using UDF

2018-05-15 Thread Furcy Pin
Hi Alberto,


If I'm not mistaken, to make sure that this work you need to give the
proper annotation in your UDF code (deterministic, and maybe some other).
You may also need to return a Constant Object Inspector in the unit method
so that Hive knows that it can perform partition pruning with it.

On Wed, 9 May 2018, 19:23 Alberto Ramón,  wrote:

> Hello
>
> We have a UDP to select the correct partition to read 'FindPartition':
> Select * from TB where partitionCol =FindPartition();
>
> How I can avoid a full scan of all partitions?
>
>
> (Set MyPartition=FindPartition();  // Is not valid in Hive)
>


Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Furcy Pin
I suggest that you do a

DESCRIBE FORMATTED orc

and to check directly on s3 the location of your data.

Did you also try a SELECT * FROM orc LIMIT 10; ? Did it return something?



On Fri, 27 Apr 2018, 21:30 Sowjanya Kakarala,  wrote:

>
>
>
>
>> Hi Guys,
>>
>> I am new to aws, I am not sure how the hive works and store the data in
>> aws.
>>
>> I have created a text table and orc table. where text table have data and
>> when I do insert overwrite from text table the qury runs fine and when I do
>> a count on orc table its zero.
>>
>> I have no clue where the data is going, any suggestions or ideas?
>>
>> Here is my query:
>>
>> hive> insert into table orc select * from text;
>
> Query ID = hadoop_20180427190946_03edcce9-7a48-4876-aadb-6a7a1a862e50
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> Tez session was closed. Reopening...
>
> Session re-established.
>
> Status: Running (Executing on YARN cluster with App id
> application_1524689752360_0058)
>
>
>
> --
>
> *VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
> PENDING  FAILED  KILLED  *
>
>
> --
>
> Map 1 .. container SUCCEEDED  7  70
>   0   0   0
>
> Reducer 2 .. container SUCCEEDED 10 100
>   0   0   0
>
>
> --
>
> *VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 23.02
> s*
>
>
> --
>
> Loading data to table db.orc
>
> OK
>
> Time taken: 29.982 seconds
>
> hive> select count(*) from orc;
>
> Query ID = hadoop_20180427191026_faa4fa26-df05-4b52-aa31-f0c878b9c0bd
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> Status: Running (Executing on YARN cluster with App id
> application_1524689752360_0058)
>
>
>
> --
>
> *VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
> PENDING  FAILED  KILLED  *
>
>
> --
>
> Map 1container SUCCEEDED  0  00
>   0   0   0
>
> Reducer 2 .. container SUCCEEDED  1  10
>   0   0   0
>
>
> --
>
> *VERTICES: 01/02  [==>>] 100%  ELAPSED TIME: 3.49
> s *
>
>
> --
>
> OK
>
> 0
>
> Time taken: 4.425 seconds, Fetched: 1 row(s)
>
> hive> select count(*) from text;
>
> Query ID = hadoop_20180427192401_fc0c4e26-0bb7-4d56-a7b9-17493cd5c88d
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> Tez session was closed. Reopening...
>
> Session re-established.
>
> Status: Running (Executing on YARN cluster with App id
> application_1524689752360_0059)
>
>
>
> --
>
> *VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
> PENDING  FAILED  KILLED  *
>
>
> --
>
> Map 1 .. container SUCCEEDED  7  70
>   0   0   0
>
> Reducer 2 .. container SUCCEEDED  1  10
>   0   0   0
>
>
> --
>
> *VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 10.19
> s*
>
>
> --
>
> OK
>
> 37504
>
> Time taken: 15.506 seconds, Fetched: 1 row(s)
>
>
> Thanks in advance.
>
>


Re: Exploding arrays of structs

2018-05-01 Thread Furcy Pin
Hi Tom,

High level manipulation with arrays of structural is generally where SQL
reaches its limit in terms of expressivity and where Spark starts to get my
preference. Unfortunately I don't know any public UDF that can do exactly
what you want, but coding one is not that hard.
If you use scala, spark UDFs are even much easier to write (but not to
share I guess).

Concerning the self- UNION ALL, a common trick to avoid it is to create an
array containing what you need and exploding it immediately, like this:

explode ( ARRAY( host_details, metrics ))
Of course this can only work if host_details and metrics have the same type.
Another trick I learned recently and that could help you is that I think
that you call "metrics.key", which will give you the Array of all keys
(like metrics.map(_.keys) in scala spark)

Hope this helps

Furcy.

On Mon, 30 Apr 2018, 11:51 Tom Davis,  wrote:

> Just spotted a typo in the email below, the table structure is:
>
> id : STRING
> host_details : ARRAY
> metrics : ARRAY
>
> I.e. value is a STRING, not s STRUCT.
>
> Thanks,
>
> Tom
>
> On Mon, 30 Apr 2018 at 09:13, Tom Davis  wrote:
>
>> Hi all,
>>
>>
>> I have a table with a lot of logging / metrics data that looks like this:
>>
>>
>> id: STRING
>>
>> host_details: ARRAY
>>
>> metrics: ARRAY
>>
>>
>> I would like to be able to extract known keys and their values (from each
>> of the host_details and metrics arrays-of-struct). Yes, this format sucks
>> and could be a map, but technically the keys aren’t unique, although I am
>> happy to make that compromise at this point. In reality the table has more
>> fields in the struct.
>>
>>
>> In Spark or MR, I could extract all the fields I need and project them to
>> columns with only a single pass over the table. I’m struggling to achieve
>> the same in Hive, which may be either my SQL ability or that there isn’t a
>> UDF available to meet my needs.
>>
>>
>> The most efficient approach I have found so far, uses the collect()
>> Brickhouse UDF And two lateral views generated from the inline() UDTF:
>>
>>
>> SELECT
>>
>> id,
>>
>> collect(t.key, t..value) AS map_h_m
>>
>> FROM
>>
>> (SELECT id, hds.key AS key, hds.value AS value
>>
>> FROM metrics
>>
>> LATERAL VIEW inline(metrics.host_details) hds
>>
>> UNION ALL
>>
>> SELECT id, ms.key AS key, ms.value AS value
>>
>> FROM metrics
>>
>> LATERAL VIEW inline(metrics.metrics)) t
>>
>> GROUP BY t.id
>>
>>
>> I can create this as a VIEW and then extract columns from the map using
>> the standard map_h_m[‘error_count’] notation.
>>
>>
>> Using collect() / GROUP BY seems more efficient than doing a join by id
>> (I think it cuts out a shuffle), but still has problems:
>>
>>- it’s still scanning the metrics table multiple times
>>- It assumes that keys are unique across both host_details and
>>metrics.
>>
>>
>> Does anyone have any thoughts before I attempt to write my own UDF(s) to
>> assist?
>>
>>
>> Many thanks,
>>
>> Tom
>>
>


Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin
Hi Oleksiy,

I must say that I don't know if partitioned HBase-backed tables are
supported in Hive, but I don't understand why you would need it. What are
you trying to do exactly? I suspect that you could do it by using composite
keys (Department, doc_id).


Also, I would advise against using multiple column families for the example
you are describing. I don't think it would lead to better performances.

Hope this helps,

Furcy


On Sun, 22 Apr 2018, 14:06 Oleksiy S,  wrote:

> Any updates?
>
> On Fri, Apr 20, 2018 at 10:54 AM, Oleksiy S  > wrote:
>
>> Hi all.
>>
>> I can create following table
>>
>> create table hbase_partitioned(doc_id STRING, EmployeeID Int, FirstName
>> String, Designation  String, Salary Int) PARTITIONED BY (Department String)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
>> SERDEPROPERTIES ("hbase.columns.mapping" =
>> ":key,boolsCF:EmployeeID,intsCF:FirstName,intsCF:Designation,intsCF:Salary")
>> TBLPROPERTIES("hbase.table.name" = "hbase_partitioned");
>>
>>
>> But when I want to insert data, I have an exception. Is it expected
>> behavior?
>>
>> INSERT INTO TABLE hbase_partitioned PARTITION(department='A') values
>> ('1', 1, 'John Connor', 'New York', 2300),
>> ('2', 2, 'Max Plank', 'Las Vegas', 1300),
>> ('3', 3, 'Arni Shwarz', 'Los Angelos', 7700),
>> ('4', 4, 'Sarah Connor', 'Oakland', 9700);
>>
>>
>>
>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>> the future versions. Consider using a different execution engine (i.e.
>> spark, tez) or using Hive 1.X releases.
>> Query ID = mapr_20180420074356_b13d8652-1ff6-4fe1-975c-7318db6037de
>> Total jobs = 3
>> Launching Job 1 out of 3
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.lang.IllegalArgumentException: Must specify table name
>> at
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.checkOutputSpecs(FileSinkOperator.java:1136)
>> at org.apache.hadoop.hive.ql.io
>> .HiveOutputFormatImpl.checkOutputSpecs(HiveOutputFormatImpl.java:67)
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:271)
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:142)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>> at
>> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:434)
>> at
>> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:138)
>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>> at
>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2074)
>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1745)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1454)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1162)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:238)
>> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:186)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:405)
>> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:791)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:729)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:652)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:647)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.lang.IllegalArgumentException: Must specify table name
>> at
>> 

Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin
Hi Oleksiy,

I must say that I don't know if partitioned HBase-backed tables are
supported in Hive, but I don't understand why you would need it. What are
you trying to do exactly? I suspect that you could do it by using composite
keys (Department, doc_id).


Also, I would advise against using multiple column families for the example
you are describing. I don't think it would lead to better performances.

Hope this helps,

Furcy


On Sun, 22 Apr 2018, 14:06 Oleksiy S,  wrote:

> Any updates?
>
> On Fri, Apr 20, 2018 at 10:54 AM, Oleksiy S  > wrote:
>
>> Hi all.
>>
>> I can create following table
>>
>> create table hbase_partitioned(doc_id STRING, EmployeeID Int, FirstName
>> String, Designation  String, Salary Int) PARTITIONED BY (Department String)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
>> SERDEPROPERTIES ("hbase.columns.mapping" =
>> ":key,boolsCF:EmployeeID,intsCF:FirstName,intsCF:Designation,intsCF:Salary")
>> TBLPROPERTIES("hbase.table.name" = "hbase_partitioned");
>>
>>
>> But when I want to insert data, I have an exception. Is it expected
>> behavior?
>>
>> INSERT INTO TABLE hbase_partitioned PARTITION(department='A') values
>> ('1', 1, 'John Connor', 'New York', 2300),
>> ('2', 2, 'Max Plank', 'Las Vegas', 1300),
>> ('3', 3, 'Arni Shwarz', 'Los Angelos', 7700),
>> ('4', 4, 'Sarah Connor', 'Oakland', 9700);
>>
>>
>>
>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>> the future versions. Consider using a different execution engine (i.e.
>> spark, tez) or using Hive 1.X releases.
>> Query ID = mapr_20180420074356_b13d8652-1ff6-4fe1-975c-7318db6037de
>> Total jobs = 3
>> Launching Job 1 out of 3
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.lang.IllegalArgumentException: Must specify table name
>> at
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.checkOutputSpecs(FileSinkOperator.java:1136)
>> at org.apache.hadoop.hive.ql.io
>> .HiveOutputFormatImpl.checkOutputSpecs(HiveOutputFormatImpl.java:67)
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:271)
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:142)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>> at
>> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:434)
>> at
>> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:138)
>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>> at
>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2074)
>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1745)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1454)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1162)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:238)
>> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:186)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:405)
>> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:791)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:729)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:652)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:647)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>> java.lang.IllegalArgumentException: Must specify table name
>> at
>> 

Re: HQL parser internals

2018-04-13 Thread Furcy Pin
Hi Jay,

I noticed the same thing when I did my tool, and it makes sense are
syntactically they are both equivalent, so the Hive parser does not care.
You could probably update the HiveParser so that it keeps the information
in the AST, but not without breaking every part of the code that reads that
AST I'm afraid.

In the long run, I believe that it would be really valuable if someone had
the courage to completely rewrite Hive parsing using best practices
and good tools (e.g. ANTLR 4), but I doubt that it will ever happen
unfortunately :-(




On 13 April 2018 at 14:25, Jay Green-Stevens <t-jgreenstev...@hotels.com>
wrote:

> Afternoon all,
>
>
>
> We have successfully managed to build a java tool which will translate a
> hive query into a syntax tree, and then turn this back into a hive query
> equivalent to the input.
>
>
>
> But we have found that for a query such as
>
> *select a *
>
> *from (*
>
> *select a *
>
> *from b*
>
> *) as c*
>
>
>
> the hive parser is stripping out the ‘as’ when building the tree. This
> then means that when the query string is rebuilt the output is ‘*select a
> from (select a from b) c’*, and although this is technically valid it is
> not equivalent to the input query.
>
>
>
> Is there any way we can fix this issue?
>
> Could we change the parser in some way to stop the ‘as’ from being
> stripped out?
>
>
>
> Any ideas would be greatly appreciated!
>
>
>
> Thanks,
>
>
>
> Jay
>
>
>
>
>
>
>
>
> From: *Elliot West* <tea...@gmail.com>
> Date: 19 March 2018 at 21:33
> Subject: Re: HQL parser internals
> To: user@hive.apache.org
>
> Hello again,
>
>
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
>
>
> Cheers,
>
>
>
> Elliot.
>
>
>
> On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:
>
> Thank you all for your rapid responses; some really useful information and
> pointers in there.
>
>
>
> We'll keep the list updated with our progress.
>
>
>
> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
> wrote:
>
> +1 for using ParseDriver for this. I also have used it to intercept and
> augment query AST.
>
>
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
>
>
> Thanks,
>
> Dharmesh
>
>
>
> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>
> Hi Elliot,
>
>
>
> Actually, I have done quite similar work regarding Hive custom Parsing,
> you should have a look at my project: https://github.com/flaminem/flamy
>
>
>
> The Hive parsing related stuff is here: https://github.com/
> flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
> A good starting point to see how to parse queries is here:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>
>
>
>
>
> Basically, all you need is to use a org.apache.hadoop.hive.ql.
> parse.ParseDriver.
>
>
>
> val pd: ParseDriver = new ParseDriver
>
> val tree: ASTNode = pd.parse(query, hiveContext)
>
> You then get the ASTNode, that you can freely parse and change.
>
> Also, I must say that it is quite ugly to manipulate, and the Presto
> Parser seems to be much better designed (but it is not the same syntax,
> unfortunately),
>
> I recommend to look at it to get better design ideas.
>
>
>
>
>
> If you want to enrich your Hive syntax like I did (I wanted to be able to
> parse ${VARS} in queries),
>
> you will not be able to use the HiveParser without some workaround.
>
> What I did was replacing these ${VARS} by strings "${VARS}" that the
> HiveParser would agree to parse,
>
> and that I could recognize afterwards...
>
>
>
> Also, if you are familiar with Scala, I recommend using it, it helps a
> lot...
>
>
>
> For instance, I have this class that transforms an AST back into a string
> query:
>
> https://github.com/flaminem/flamy/blob/master/src/main/
> scala/com/flami

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
¯\_(ツ)_/¯

On 5 April 2018 at 16:02, Richard A. Bross <r...@oaktreepeak.com> wrote:

> I can't duplicate the issue now.  Works like it always has . .
>
> - Original Message -
> From: "Furcy Pin" <pin.fu...@gmail.com>
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:57:39 AM
> Subject: Re: ALTER TABLE DROP PARTITION not working on S3
>
>
> Indeed.
>
>
> If I remember correctly, s3 does not really have the concept of "folder"
> like HDFS has, and Hive sort of makes up for it by creating a descriptor
> file
> where the partition "folder" is supposed to be. Maybe this is what is
> missing here.
>
>
> Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure
> that the partitions are correctly loaded and then try again dropping that
> particular partition?
>
>
> Or look at your s3 folder if you see any such "partition folder file" and
> check if it is missing for this particular partition?
>
>
>
>
> On 5 April 2018 at 15:40, Richard A. Bross < r...@oaktreepeak.com > wrote:
>
>
> Leaving the column list out, here you go:
>
> # Detailed Table Information
> Database: default
> Owner: hadoop
> CreateTime: Thu Apr 05 13:24:33 UTC 2018
> LastAccessTime: UNKNOWN
> Retention: 0
> Location: s3://zoomi-proto-warehouse-measurements/
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
> numFiles 10
> numPartitions 7
> numRows 153
> orc.compress ZLIB
> orc.create.index true
> rawDataSize 113563
> totalSize 37801
> transient_lastDdlTime 1522934673
>
> # Storage Information
> SerDe Library: org.apache.hadoop.hive.ql.io .orc.OrcSerde
> InputFormat: org.apache.hadoop.hive.ql.io .orc.OrcInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io .orc.OrcOutputFormat
> Compressed: No
> Num Buckets: 61
> Bucket Columns: [crs_id]
> Sort Columns: []
> Storage Desc Params:
> serialization.format 1
> Time taken: 0.467 seconds, Fetched: 98 row(s)
>
>
> - Original Message -
> From: "Furcy Pin" < pin.fu...@gmail.com >
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:21:06 AM
>
>
> Subject: Re: ALTER TABLE DROP PARTITION not working on S3
>
>
> Hi Richard,
>
>
> could you please check if your table is EXTERNAL?
> You can see it with a "DESCRIBE FORMATTED table_name ;"
>
>
> That's what external tables are for, they don't delete underlying data
> when you drop them.
>
>
>
>
> On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote:
>
>
> I think that someone put a file in there manually. Would that prevent Hive
> from dropping the partition. I also did a "drop table" and the s3 object
> keys persisted.
>
>
>
> - Original Message -
> From: "Richard A. Bross" < r...@oaktreepeak.com >
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:14:52 AM
> Subject: ALTER TABLE DROP PARTITION not working on S3
>
> Hi,
>
> I have a Hive managed table on S3, "api_measurements". I've tried dropping
> a partition like so:
>
> hive> alter table api_measurements drop if exists
> partition(daydate='2018-04-04', epoch=1522876500);
> Dropped the partition daydate=2018-04-04/epoch=1522876500
> OK
> Time taken: 2.109 seconds
>
> Yet the data is still on S3. Because object keys on S3 are always strings,
> I also tried this:
>
> hive> alter table api_measurements drop partition(daydate='2018-04-04',
> epoch='1522876500');
> OK
> Time taken: 0.135 seconds
>
> Yet the object keys and data are still there. I assume that Im missing
> something really simple. Can anyone shed some light on this?
>
> Thanks
>
>
>
>


Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
Indeed.

If I remember correctly, s3 does not really have the concept of "folder"
like HDFS has, and Hive sort of makes up for it by creating a descriptor
file
where the partition "folder" is supposed to be. Maybe this is what is
missing here.

Perhaps you could try doing a "MSCK REPAIR TABLE tablename" to make sure
that the partitions are correctly loaded and then try again dropping that
particular partition?

Or look at your s3 folder if you see any such "partition folder file" and
check if it is missing for this particular partition?


On 5 April 2018 at 15:40, Richard A. Bross <r...@oaktreepeak.com> wrote:

> Leaving the column list out, here you go:
>
> # Detailed Table Information
> Database:   default
> Owner:  hadoop
> CreateTime: Thu Apr 05 13:24:33 UTC 2018
> LastAccessTime: UNKNOWN
> Retention:  0
> Location:   s3://zoomi-proto-warehouse-measurements/
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
> numFiles10
> numPartitions   7
> numRows 153
> orc.compressZLIB
> orc.create.indextrue
> rawDataSize 113563
> totalSize   37801
> transient_lastDdlTime   1522934673
>
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> Compressed: No
> Num Buckets:61
> Bucket Columns: [crs_id]
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.467 seconds, Fetched: 98 row(s)
>
>
> - Original Message -
> From: "Furcy Pin" <pin.fu...@gmail.com>
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:21:06 AM
> Subject: Re: ALTER TABLE DROP PARTITION not working on S3
>
>
> Hi Richard,
>
>
> could you please check if your table is EXTERNAL?
> You can see it with a "DESCRIBE FORMATTED table_name ;"
>
>
> That's what external tables are for, they don't delete underlying data
> when you drop them.
>
>
>
>
> On 5 April 2018 at 15:18, Richard A. Bross < r...@oaktreepeak.com > wrote:
>
>
> I think that someone put a file in there manually. Would that prevent Hive
> from dropping the partition. I also did a "drop table" and the s3 object
> keys persisted.
>
>
>
> - Original Message -
> From: "Richard A. Bross" < r...@oaktreepeak.com >
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:14:52 AM
> Subject: ALTER TABLE DROP PARTITION not working on S3
>
> Hi,
>
> I have a Hive managed table on S3, "api_measurements". I've tried dropping
> a partition like so:
>
> hive> alter table api_measurements drop if exists
> partition(daydate='2018-04-04', epoch=1522876500);
> Dropped the partition daydate=2018-04-04/epoch=1522876500
> OK
> Time taken: 2.109 seconds
>
> Yet the data is still on S3. Because object keys on S3 are always strings,
> I also tried this:
>
> hive> alter table api_measurements drop partition(daydate='2018-04-04',
> epoch='1522876500');
> OK
> Time taken: 0.135 seconds
>
> Yet the object keys and data are still there. I assume that Im missing
> something really simple. Can anyone shed some light on this?
>
> Thanks
>
>
>


Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin
Hi Richard,

could you please check if your table is EXTERNAL?
You can see it with a "DESCRIBE FORMATTED table_name ;"

That's what external tables are for, they don't delete underlying data when
you drop them.



On 5 April 2018 at 15:18, Richard A. Bross  wrote:

> I think that someone put a file in there manually.  Would that prevent
> Hive from dropping the partition.  I also did a "drop table" and the s3
> object keys persisted.
>
> - Original Message -
> From: "Richard A. Bross" 
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:14:52 AM
> Subject: ALTER TABLE DROP PARTITION not working on S3
>
> Hi,
>
> I have a Hive managed table on S3, "api_measurements".  I've tried
> dropping a partition like so:
>
> hive> alter table api_measurements drop if exists
> partition(daydate='2018-04-04', epoch=1522876500);
> Dropped the partition daydate=2018-04-04/epoch=1522876500
> OK
> Time taken: 2.109 seconds
>
> Yet the data is still on S3.  Because object keys on S3 are always
> strings, I also tried this:
>
> hive> alter table api_measurements drop partition(daydate='2018-04-04',
> epoch='1522876500');
> OK
> Time taken: 0.135 seconds
>
> Yet the object keys and data are still there.  I assume that Im missing
> something really simple.  Can anyone shed some light on this?
>
> Thanks
>
>


Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Furcy Pin
Hi Mahender,

Did you look at this? https://www.snappydata.io/blog/the-spark-database

But I believe that most people handle this use case by either using:
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...)
if the data is not too big
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big
and needs to be distributed

Spark generally makes it easy enough to query these other databases to
allow you to perform analytics.

Hive and Spark have been designed as OLAP tools, not OLTP.
I'm not sure what features you are seeking for your SCD but they probably
won't be part of Spark's core design.

Hope this helps,

Furcy



On 4 April 2018 at 11:29, Mahender Sarangam 
wrote:

> Hi,
> Does anyone has good architecture document/design principle for building
> warehouse application using Spark.
>
> Is it better way of having Hive Context created with HQL and perform
> transformation or Directly loading  files in dataframe and perform data
> transformation.
>
> We need to implement SCD 2 Type in Spark, Is there any better
> document/reference for building Type 2 warehouse object
>
> Thanks in advace
>
> /Mahender
>


Re: HQL parser internals

2018-03-20 Thread Furcy Pin
Hi Elliot,

Yes, the variable substitution is done before the parsing. This make
generic query validation much more complicated.

As I explained in my previous message, what I did was replacing these
${VARS} by strings "${VARS}" that the HiveParser would agree to parse,
and that I could recognize afterwards...



On 19 March 2018 at 22:33, Elliot West <tea...@gmail.com> wrote:

> Hello again,
>
> We're now testing our system against a corpus of Hive SQL statements in an
> effort to quickly highlight edge cases, limitations etc. We're finding that
> org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
> as ${hiveconf:varname}. Are variable substitutions handled prior to
> parsing or within the parser itself? If in a pre-procesing stage, is there
> any code or utility classes within Hive that we can use as a reference, or
> to provide this functionality?
>
> Cheers,
>
> Elliot.
>
> On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:
>
>> Thank you all for your rapid responses; some really useful information
>> and pointers in there.
>>
>> We'll keep the list updated with our progress.
>>
>> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
>> wrote:
>>
>>> +1 for using ParseDriver for this. I also have used it to intercept and
>>> augment query AST.
>>>
>>> Also, I would echo others sentiment that its quite ugly. It would be
>>> great if we can refactor/standardize this. That will make integrating
>>> other system a lot easier.
>>>
>>> Thanks,
>>> Dharmesh
>>>
>>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>>>
>>>> Hi Elliot,
>>>>
>>>> Actually, I have done quite similar work regarding Hive custom Parsing,
>>>> you should have a look at my project: https://github.com/flaminem/flamy
>>>>
>>>> The Hive parsing related stuff is here: https://github.com/flami
>>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>>>> A good starting point to see how to parse queries is here:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>>>
>>>>
>>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>>>> e.ParseDriver.
>>>>
>>>> val pd: ParseDriver = new ParseDriver
>>>> val tree: ASTNode = pd.parse(query, hiveContext)
>>>>
>>>> You then get the ASTNode, that you can freely parse and change.
>>>> Also, I must say that it is quite ugly to manipulate, and the Presto
>>>> Parser seems to be much better designed (but it is not the same syntax,
>>>> unfortunately),
>>>> I recommend to look at it to get better design ideas.
>>>>
>>>>
>>>> If you want to enrich your Hive syntax like I did (I wanted to be able
>>>> to parse ${VARS} in queries),
>>>> you will not be able to use the HiveParser without some workaround.
>>>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>>>> HiveParser would agree to parse,
>>>> and that I could recognize afterwards...
>>>>
>>>> Also, if you are familiar with Scala, I recommend using it, it helps a
>>>> lot...
>>>>
>>>> For instance, I have this class that transforms an AST back into a
>>>> string query:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>>> I could never have done something that good looking in Java...
>>>>
>>>> Finally this method helps a lot to understand how the hell the AST
>>>> works:
>>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>>>
>>>> Make sure to write *tons* of unit tests too, you'll need them.
>>>>
>>>> Hope this helps,
>>>>
>>>> Furcy
>>>>
>>>>
>>>>
>>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> > However, ideally we wish to manipulate the original query as
>>>>> delivered by the user (or as close to it as possible), and we’re finding
>>>>> 

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
And if you come across a comprehensive documentation of parquet
configuration, please share it!!!

The Parquet documentation says that it can be configured but doesn't
explain how: http://parquet.apache.org/documentation/latest/
and apparently, both TAJO (
http://tajo.apache.org/docs/0.8.0/table_management/parquet.html) and Drill (
https://drill.apache.org/docs/parquet-format/) seem to have some
configuration parameters for Parquet.
If Hive has configuration parameters for Parquet too, I couldn't find it
documented anywhere.



On 23 February 2018 at 16:48, Sun, Keith <ai...@ebay.com> wrote:

> I got your point and thanks for the nice slides info.
>
>
> So the parquet filter is not an easy thing and I will try that according
> to the deck.
>
>
> Thanks !
> ------
> *From:* Furcy Pin <pin.fu...@gmail.com>
> *Sent:* Friday, February 23, 2018 3:37:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Why the filter push down does not reduce the read data
> record count
>
> Hi,
>
> Unless your table is partitioned or bucketed by myid, Hive generally
> requires to read through all the records to find the records that match
> your predicate.
>
> In other words, Hive table are generally not indexed for single record
> retrieval like you would expect RDBMs tables or Vertica tables to be
> indexed to allow single record.
> Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
> bloom filters on specific columns of a table
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>,
> which could work as a kind of index.
> Also, depending on the query engine you are using (Hive, Spark-SQL,
> Impala, Presto...) and its version, they may or may not be able to leverage
> certain storage optimization.
> For example, Spark still does not support Hive Bucketed Table
> optimization. But it might come in the upcoming Spark 2.3.
>
>
> I'm much less familiar with Parquet, so if anyone has links to a good
> documentation for Parquet fine tuning (or even better a comparison with ORC
> features) that would be really helpful.
> By googling, I found these slides where someone at Netflix
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0>
> seems to have tried the same kind of optimization as you in Parquet.
>
>
>
>
>
> On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com> wrote:
>
> Hi,
>
>
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
>
>
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
>
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10
>
>
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
>
>
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
>
>
> Thanks,
>
> Keith
>
>
>
>


Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
Hi,

Unless your table is partitioned or bucketed by myid, Hive generally
requires to read through all the records to find the records that match
your predicate.

In other words, Hive table are generally not indexed for single record
retrieval like you would expect RDBMs tables or Vertica tables to be
indexed to allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
bloom filters on specific columns of a table
,
which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala,
Presto...) and its version, they may or may not be able to leverage certain
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization.
But it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good
documentation for Parquet fine tuning (or even better a comparison with ORC
features) that would be really helpful.
By googling, I found these slides where someone at Netflix

seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith  wrote:

> Hi,
>
>
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
>
>
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
>
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10
>
>
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
>
>
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
>
>
> Thanks,
>
> Keith
>
>
>


Re: HQL parser internals

2018-02-17 Thread Furcy Pin
Hi Elliot,

Actually, I have done quite similar work regarding Hive custom Parsing, you
should have a look at my project: https://github.com/flaminem/flamy

The Hive parsing related stuff is here:
https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good starting point to see how to parse queries is here:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492


Basically, all you need is to use a
org.apache.hadoop.hive.ql.parse.ParseDriver.

val pd: ParseDriver = new ParseDriver
val tree: ASTNode = pd.parse(query, hiveContext)

You then get the ASTNode, that you can freely parse and change.
Also, I must say that it is quite ugly to manipulate, and the Presto Parser
seems to be much better designed (but it is not the same syntax,
unfortunately),
I recommend to look at it to get better design ideas.


If you want to enrich your Hive syntax like I did (I wanted to be able to
parse ${VARS} in queries),
you will not be able to use the HiveParser without some workaround.
What I did was replacing these ${VARS} by strings "${VARS}" that the
HiveParser would agree to parse,
and that I could recognize afterwards...

Also, if you are familiar with Scala, I recommend using it, it helps a
lot...

For instance, I have this class that transforms an AST back into a string
query:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
I could never have done something that good looking in Java...

Finally this method helps a lot to understand how the hell the AST works:
https://github.com/flaminem/flamy/blob/master/src/main/scala/com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593

Make sure to write *tons* of unit tests too, you'll need them.

Hope this helps,

Furcy



On 16 February 2018 at 21:20, Gopal Vijayaraghavan 
wrote:

>
> > However, ideally we wish to manipulate the original query as delivered
> by the user (or as close to it as possible), and we’re finding that the
> tree has been modified significantly by the time it hits the hook
>
> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
> the bushy join conversion is already done by the time the hook gets called.
>
> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
> hook.
>
> > Additionally we wish to track back ASTNodes to the character sequences
> in the source HQL that were their origin (where sensible), and ultimately
> hope to be able regenerate the query text from the AST.
>
> I started work on a Hive-unparser a while back based on this class, but it
> a world of verbose coding.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/calcite/translator/
> ASTConverter.java#L850
>
> If you're doing active work on this, I'd like to help, because I need the
> AST -> query to debug CBO.
>
> > The use case, if you are interested, is a mutation testing framework for
> HQL. The testing of mutants is operational, but now we need to report on
> survivors, hence the need to track back from specific query elements to
> character sequences in the original query string.
>
> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
> have Impala vs Hive bug-for-bug compat.
>
> https://cwiki.apache.org/confluence/download/attachments/27362054/Random%
> 20Query%20Gen-%20Hive%20Meetup.pptx
>
> Cheers,
> Gopal
>
>
>


Re: Will partition pruning will happen on HIVE views?

2018-01-23 Thread Furcy Pin
Hi, I might be wrong, but I would expect views to work as a simple
replacement at the query level (like a subquery or a CTE).
This means that partition pruning should work as long as predicate pushdown
properly works.

For your specific use case, it should be easy enough to test, and you can
use an EXPLAIN DEPENDENCY to see if the partition pruning is correct.

2018-01-23 14:39 GMT+01:00 Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com>:

> Hi,
>
> I have a scenario.
>
> Hive table "Table1" created with Partition with txn date.
>
> A view "VIEW_TABLE1" is created on top of "Table" with the query 'select *
> from table1'.
>
> If I query the view with the where clause using txn_date, will partition
> pruning happen? Please throw some light how it will behave in HIVE.
>
>
> regards,
> Rams
>


Re: Dynamic vs Static partition

2018-01-09 Thread Furcy Pin
You need to enable nonstrict mode to be able to use dynamic partitioning.

Dynamic partitioning is more automatic, but might take more time because
Hive has to write to multiple outputs at once.
With MapReduce, this uses MultipleOutputs, and can fail with OOM when
writing too many partitions at once (especially ORC),
there is a fix for that (hive.optimize.sort.dynamic.partition, true by
default) which causes Hive to sort the output before writing it, when
dynamic partitioning is used,
this might make some queries much slower (especially when copying a
partitioned table).

Also, I think dynamic partitioning requires to put an exclusive write lock
on the table, meaning it can't be read while you write in it.
When you use static partitioning, you should be able to write a new
partition and still be able to read from the other partitions at the same
time.

In short: if you know where your partitions are going and can generate your
Hive queries programatically, use static partitioning, otherwise, use
dynamic partitioning.



2018-01-08 20:21 GMT+01:00 Alan Gates :

> When doing dynamic partitioning, Hive needs to look at each record and
> determine which partition to place it in.  In the case where all records go
> to the same partition, it is more efficient to tell Hive up front, that is,
> to use static partitioning.  So you can use dynamic partition for large
> files, it will just take more processing power and time.
>
> Alan.
>
> On Tue, Jan 2, 2018 at 5:53 PM, Sachit Murarka 
> wrote:
>
>> Hi,
>>
>> I am unclear about Dynamic and static partition.
>>
>>1. Why static partition is good for loading large files and why can’t
>>we use dynamic partition for the same?
>>2. Why does dynamic partition take more time in loading data than
>>static partitions?
>>
>>Also please explain when to use strict and nonstrict mode.
>>
>> Kind Regards,
>> Sachit Murarka
>>
>
>


Re: 答复: Re: Does Hive SQL support reading data without locking?

2018-01-02 Thread Furcy Pin
You can disable locking by setting

SET hive.support.concurrency=true ;

It can be used in a single session before a query.

However locking exists for a reason: what do you think will happen if you
overwrite files from a table while Tableau is reading them?
It really depends on your use case, but if possible, partitioning your
table would be better.

For instance, if the table is not too big, and if you bulk-import your
table every day
you could have one partition per day and a view that exposes the last
partition.
Some tools like Airflow support the easy retrieval of the latest available
partition.






2018-01-02 10:28 GMT+01:00 :

> Bulk import and no partition
>
>
> 发件人: Jörn Franke 
> 收件人: user@hive.apache.org
> 日期: 2018/01/02 14:56
> 主题: Re: Does Hive SQL support reading data without locking?
> --
>
>
>
> How do you import data ? Bulk import?
>
> What about using partitions ( or is the data too small for daily
> partitions?)
>
> On 2. Jan 2018, at 04:59, *l...@china-inv.cn*  wrote:
>
> Hi, All,
>
> We are using Hive to persist our data and we run cron jobs to import new
> data into Hive daily.
>
> At the same time, our users may query data from Hive at the same time by
> using a third party software like Tableau
>
> We found an issue that importing job will fail while the table is queried
> by user because of failing to get an exclusive lock
>
> Does Hive SQL support reading data without locking (read lock) just like
> 'with (nolock)' provided by SQL server?
>
> Thanks
>
> Boying
>
> --
>
>
> 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,
> 或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。
>
>
>
> This email message may contain confidential and/or privileged information.
> If you are not the intended recipient, please do not read, save, forward,
> disclose or copy the contents of this email or open any file attached to
> this email. We will be grateful if you could advise the sender immediately
> by replying this email, and delete this email and any attachment or links
> to this email completely and immediately from your computer system.
>
> --
>
>
>
>
> --
>
>
> 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,
> 或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。
>
>
>
> This email message may contain confidential and/or privileged information.
> If you are not the intended recipient, please do not read, save, forward,
> disclose or copy the contents of this email or open any file attached to
> this email. We will be grateful if you could advise the sender immediately
> by replying this email, and delete this email and any attachment or links
> to this email completely and immediately from your computer system.
>
>
> --
>
>


Re: Cannot create external table on S3; class S3AFileSystem not found

2017-12-10 Thread Furcy Pin
Hi Scott,

I ran into similar issues before.


I think that you need to do two things to enable Hive access to s3:


1. add the following jars to your Hive classpath, for instance by using
symlinks

ln -s $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar
$HIVE_HOME/lib/.
ln -s $HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar
$HIVE_HOME/lib/.


2. Configure your $HADOOP_HOME/conf/hdfs-site.xml with the following
parameters
(replace the $AWS_ values with your correct values, environment variable
substitution doesn't work until Hadoop 3 I believe)



  
fs.s3a.access.key
$AWS_ACCESS_KEY_ID
  
  
fs.s3a.secret.key
$AWS_SECRET_ACCESS_KEY
  




You can also check out the docker I made with Hive and Spark pre-installed
and pre-configured for Amazon
https://github.com/FurcyPin/docker-hive-spark

You can either use it directly or just look at how I did it.

Regards,

Furcy




2017-12-09 20:01 GMT+01:00 Scott Halgrim :

> Thanks, Elliott.
>
> I’m using Ubuntu 14.04.5 LTS, Hadoop 2.9.0, and Hive 2.3.2.
>
> I hadn’t had a CLASSPATH environment variable set, but I did just set it
> to have that jar on it and I get the same answer.
>
> I don’t really have the installation in “/path/to/…” but everything else
> in the path is correct. You’re right, I’m just obfuscating the actual value.
>
> A new odd behavior that didn’t happen yesterday, is that in hive, ‘SELECT
> * FROM TBLS;’ returns an error: ‘Table not found: TBLS’. Yet, I can connect
> to the MySQL backend and query it fine. In hive I can also query tables I
> know are there…just not TBLS.
>
> Hope that makes sense,
>
> Thanks again!
>
> Scott
>
> On Dec 9, 2017, 9:46 AM -0800, Elliot West , wrote:
>
> Which distribution are you using? Do you have hadoop-aws on the class
> path? Is ‘/path/to/hadoop/install’ a literal value or a placeholder that
> you’ using for the actual location?
>
> Cheers,
>
> Elliot.
>
> On Sat, 9 Dec 2017 at 00:08, Scott Halgrim 
> wrote:
>
>> Hi,
>>
>> I’ve been struggling with this for a few hours, hopefully somebody here
>> can help me out.
>>
>> We have a lot of data in parquet format on S3 and we want to use Hive to
>> query it. I’m running on ubuntu and we have a MySQL metadata store on AWS
>> RDS.
>>
>> The command in the hive client I’m trying to run is:
>>
>> CREATE EXTERNAL TABLE
>> my_schema.my_table
>> (account_id INT,
>> action VARCHAR(282),
>> another_id INT
>> yaid INT,
>> `date` TIMESTAMP,
>> deleted_at TIMESTAMP,
>> id INT,
>> lastchanged TIMESTAMP,
>> thing_index DOUBLE,
>> old_id INT,
>> parent_id INT,
>> running INT,
>> other_id INT,
>> another_thing VARCHAR(282),
>> title VARCHAR(282),
>> type_of VARCHAR(282))
>> PARTITIONED BY (snapshot_date DATE)
>> STORED AS parquet
>> LOCATION 's3a://bucket/folder/foo_my_schema.my_table’;
>>
>>
>> The error I get is:
>>
>> FAILED: SemanticException java.lang.RuntimeException: 
>> java.lang.ClassNotFoundException:
>> Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>
>>
>> I have this in my hive-site.xml file:
>>
>> 
>> hive.aux.jars.path
>> /path/to/hadoop-install/hadoop-2.9.0/share/
>> hadoop/tools/lib
>> 
>>
>>
>>
>> Another thing I tried was to create the external table without a location
>> and then alter it to have the location:
>>
>> alter table my_schema.my_table set location "s3a://bucket/folder/foo_my_
>> schema.my_table";
>>
>>
>> And then I get a different error:
>>
>> FAILED: SemanticException Cannot connect to namenode, please check if
>> host/port pair for s3a://bucket/folder/foo_my_schema.my_table is valid
>>
>>
>> What could I be missing?
>>
>> Thanks!
>>
>> Scott
>>
>


Re: "IS [NOT] NULL for a complex type"

2017-11-29 Thread Furcy Pin
Hello Jinchul,

in SQL, any type (even complex types such as structs) can be nulls.
And this happen as often as the use of (LEFT | RIGHT | FULL) JOINS:
when a record is not matched by the joined table, it will be NULL.

As far as I remember, directly creating NULL with complex types (for
testing purpose) is complicated in Hive
because of type checking: you can cast a NULL into a primary type but it is
not that easy to obtain a complex type.

Finally, about the meaning of NULL: it helped me a lot when I realized that
in SQL, a NULL should be interpreted as a "We don't know"
or "This value is missing and could be anything": this is why TRUE, FALSE,
and NULL implement the three-valued logic, as explained here:

https://en.wikipedia.org/wiki/Null_(SQL)#Comparisons_with_NULL_and_the_three-valued_logic_.283VL.29


Regards,

Furcy




2017-11-29 1:18 GMT+01:00 Jin Chul Kim :

> Hi,
>
> May I know the meaning of IS [NOT] NULL for a complex type such as STRUCT?
> As far as I know, we cannot assign NULL to struct directly.
> So, I expected them:
> 1) NULL returns if any of the elements in struct has NULL
> 2) NULL returns if all of the elements in struct have NULL
>
> By the way, my assumption was wrong in my example below. Could you let me
> know when struct is null?
>
> For example,
> create table t1(a struct);
> insert into t1 select named_struct('c1', 100, 'c2', 'test', 'c3', 1.234);
> insert into t1 select named_struct('c1', cast(null as int), 'c2', 'test',
> 'c3', 1.234);
> insert into t1 select named_struct('c1', 100, 'c2', cast(null as string),
> 'c3', 1.234);
> insert into t1 select named_struct('c1', 100, 'c2', 'test', 'c3',
> cast(null as double));
> insert into t1 select named_struct('c1', cast(null as int), 'c2',
> cast(null as string), 'c3', cast(null as double));
> select a is null, * from t1;
> false   {"c1":100,"c2":"test","c3":1.234}
> false   {"c1":null,"c2":"test","c3":1.234}
> false   {"c1":100,"c2":null,"c3":1.234}
> false   {"c1":100,"c2":"test","c3":null}
> false   {"c1":null,"c2":null,"c3":null}
>
> Best regards,
> Jinchul
>


Re: migrate hive cli to beeline

2017-11-22 Thread Furcy Pin
Hello,

As the table explains, it really depends on the number of concurrent
connections you will have on your HiveServer2,
but I confirm that you can experience OOM failures on either the
HiveServer2 or the Metastore if too many concurrent
queries are run simultaneously.

For larger clusters, as Cloudera recommends, you will have to use multiple
HiveServer2 instances with load-balancing.
Performances on the Metastore's SQL backend can also become a bottleneck at
some point.

Also, when using beeline to retrieve large amount of data as your query
output, don't forget to use this option, or it will hang and ultimately
fail:
"beeline --incremental=true"




2017-11-22 8:04 GMT+01:00 游垂凯 :

> Hello everyone:
> Recently,I want to migrate hive cli to beeline,but i'm reffered to
> cloudera official site about hive beeline(https://www.cloudera.
> com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_
> install.html#concept_alp_4kl_3q), it describe that the beeline cli should
> use a heap size at least 2GB,   hiveserver2 and the hive metastore require
> sufficient memory to run correctly.
> I want to know that the hiveserver and hive metastore cloudera
> recommand memory is really true?  Thanks for your help.
> Number of Concurrent ConnectionsHiveServer2 Heap Size Recommended RangeHive
> Metastore Heap Size Recommended Range
>
> Up to 40 concurrent connections
>
> Cloudera recommends splitting HiveServer2 into multiple instances and
> load-balancing once you start allocating over 16 GB to HiveServer2. This
> reduces the impact of Java garbage collection on active processing by the
> service.
> 12 - 16 GB 12 - 16 GB
> Up to 20 concurrent connections 6 - 12 GB 10 - 12 GB
> Up to 10 concurrent connections 4 - 6 GB 4 - 10 GB
> One connection 4 GB 4 GB由于本人英文太弱,不过我已经在补习英语过程中。我把我想描述的英文内容用中文再说一遍吧。
> 最近,我想从hive cli迁移使用beeline,但是看到cloudare官网这样描述hiveserver2
> https://www.cloudera.com/documentation/enterprise/5-8-
> x/topics/cdh_ig_hive_install.html#concept_alp_4kl_3q
> 
>  不知道官网写的并发连接数与hivesever2以及metastore内存配置是否是真的?不知道有
> 没有人测试过,因为我听有的开发同事说,hiveserve2开启后时间久了因性能原因可能需要重启。希望得到帮助谢谢。
>
>
>  youck
>


Re: Error "Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient"

2017-11-21 Thread Furcy Pin
Hi,

I recommend you try using HiveServer2 and beeline as explained here:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHiveServer2andBeeline.1

In particular, please make sure you have run "bin/schematool -dbType derby
-initSchema" and that it returned successfully,
as this command is required to initialize your derby backend.

Hope this helps,

Furcy



2017-11-21 1:37 GMT+01:00 Fabricio Pedroso Jorge :

> Hi all,
>
> Hadoop verrsion...: 2.6.0 (Oracle Linux 7)
> Hive version.: 2.3.2
>
> I am quite a newbie at the Hive world, and because of that, i am getting
> the following error averytime i try to execute a command at the hive CLI:
>
>
> *hive> show tables;FAILED: SemanticException
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient*
>
> My metastore is at $DERBY_HOME/data and my hive-site.xml has the following
> configuration, regarding the Derby connection:
>
>
>
>
>
>
>
> *javax.jdo.option.ConnectionURL
> jdbc:derby://h-master:1527/metastore_db;create=true
>   JDBC connect string for a JDBC metastore.  To use
> SSL to encrypt/authenticate the connection, provide database-specific SSL
> flag in the connection URL.  For example,
> jdbc:postgresql://myhost/db?ssl=true for postgres database.
> *
> *[...]*
>
>
>
>
> *  javax.jdo.option.ConnectionDriverName
> org.apache.derby.jdbc.ClientDriverDriver
> class name for a JDBC metastore  *
>
> I am using the Derby Database at the Network Server mode.
>
> Any help will be appreciated. Thanks.
>
> --
> *Fabrício Pedroso Jorge.*
>
> Administrador de Banco de Dados
>
> *Resumo Profissional:*
> http://br.linkedin.com/in/fabriciojorge
>
> *Contatos:*
> + 55 91 988991116 <+55%2091%2098899-1116>
> skype: fabricio.pedroso.jorge
> fpjb...@gmail.com
>


Re: Options for connecting to Apache Hive

2017-11-11 Thread Furcy Pin
Hi Jacob,

Like Elliot and Alan said, cross-language exchanges seem to be handled
using Thrift, and there are some ODBC drivers too.
I didn't find a C library, even if they are python, ruby and nodeJS
libraries.

Also, when Googling, beware not to confuse HiveServer1 (sometimes refered
to as Thrift Server) and HiverServer2.
For instance this page seem to show some Python Clients for HiveServer1
https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-ThriftC++Client

links for Python and Ruby clients are given at the end of this page:
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

But beware that the python client has been deprecated and you have to jump
a few links to find the most recent one.

Anyway, I also wrote some productivity tool for Hive (
https://github.com/FurcyPin/flamy), which is only a Command-Line-Interface
for now, but I am currently working on making a GUI with Scala Play +
ReactJS.
I believe it is hard to make a tool which is good for both PostgreSQL and
Hive, because despite being SQL-based, they have
different use-cases, different behaviors (Hive is generally much slower to
respond small queries) and different syntax.

If you decide not to spend time on adding Hive integration to your project,
perhaps you could redirect the users
that made that feature request to me. I would be happy to discuss their
need with them, and for now,
this is a one man project and I am mostly looking for users, so my work
will remain free and open source.

Regards,

Furcy



2017-11-10 11:28 GMT+01:00 Jakob Egger :

> Hi!
>
> I'm the developer of a database client tool, and I've received a request
> to add support for querying Apache Hive.
>
> (My tool lets the user execute SQL queries, and it allows browsing tables
> etc.)
>
> As a first step of evaluating this suggestion, I'm trying to find out if
> there is a convenient way to connect to Hive.
>
> From reading the documentation, it seems that the preferred way to connect
> seems to be using the JDBC driver. Since my app is not written in Java,
> this is probably not the way to go. Apart from that, I didn't find much on
> this topic in the docs.
>
> I have a few questions:
>
> 1) What ways are there to connect to Apache Hive?
>
> 2) Is there a C client library?
>
> 3) Is there any documentation on the wire protocol that Hive uses for
> client / server communication?
>
> I'd appreciate if someone who knows more about the project could point me
> in the right direction!
>
> Best regards,
> Jakob


Re: partitioned hive table

2017-10-31 Thread Furcy Pin
Hi,

If you want to load pre-existing records, instead of inserting data in this
partition, you should use the ADD PARTITION statement
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions

or simply the MSCK REPAIR TABLE statement
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)


Something that new Hive users often miss out is that Hive does not detect
automatically external data, and that sometimes a table's data and metadata
can be off sync.

What happened in your case is that Hive keeps a record count for the
partition in its metadata,
when you insert data in the partition, Hive updates its count on the fly
with the number of rows you have inserted (here: 1).

So when you do a SELECT *, all your json files are read, but when you do a
SELECT COUNT(*), Hive will just fetch that number to respond faster.

By running COMPUTE STATISTICS this number is updated, but the correct way
is to use MSCK REPAIR TABLE to tell Hive to update its partition metadata.

Regards,

Furcy









2017-10-31 1:25 GMT+01:00 Jiewen Shao :

> Thanks Mich,
> ANALYZE TABLE PARTITION(dt='2017-08-20, bar='hello'')  COMPUTE STATISTICS
> indeed make count(*) returns correct value (for the partition only).
>
> but my hive table was not able to get data from those pre-existed json
> file unless I insert one record for the partition AND run ANALYZE TABLE
> ... COMPUTE STATISTICS for the partition. I must have missed something.
>
> How to make those preexisted json visible in hive table?
>
> On Mon, Oct 30, 2017 at 4:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> have you analyzed table for the partition?
>>
>> ANALYZE TABLE test_table PARTITION('2017-08-20, bar='hello'') COMPUTE
>> STATISTICS;
>>
>> and do count(*) from table
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 October 2017 at 22:29, Jiewen Shao  wrote:
>>
>>> Hi, I have persisted lots of JSON files on S3 under partitioned
>>> directories such as /bucket/table1/dt=2017-10-28/bar=hello/*
>>>
>>> 1. Now I created a hive table:
>>> CREATE EXTERNAL TABLE table1 ( )
>>> PARTITIONED BY (dt string, bar string) ROW FORMAT serde
>>> 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://bucket/table1';
>>>
>>> 2. Under hive commandline
>>> select * from table1;// return nothing
>>>
>>> 3. INSERT INTO TABLE  table1 PARTITION (dt='2017-08-28', bar='hello')
>>> select ;
>>>
>>> 4. now select * from table1;// return all the data from that
>>> partition
>>>
>>> 5. select count(*) from table1;  // returns 1
>>>
>>> Can someone explain what did  I miss?
>>>
>>> Thanks a lot!
>>>
>>
>>
>
2017-10-31 1:25 GMT+01:00 Jiewen Shao :

> Thanks Mich,
> ANALYZE TABLE PARTITION(dt='2017-08-20, bar='hello'')  COMPUTE STATISTICS
> indeed make count(*) returns correct value (for the partition only).
>
> but my hive table was not able to get data from those pre-existed json
> file unless I insert one record for the partition AND run ANALYZE TABLE
> ... COMPUTE STATISTICS for the partition. I must have missed something.
>
> How to make those preexisted json visible in hive table?
>
> On Mon, Oct 30, 2017 at 4:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> have you analyzed table for the partition?
>>
>> ANALYZE TABLE test_table PARTITION('2017-08-20, bar='hello'') COMPUTE
>> STATISTICS;
>>
>> and do count(*) from table
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 October 2017 at 22:29, Jiewen Shao  wrote:
>>
>>> Hi, I have persisted lots of JSON files on S3 under partitioned
>>> directories such as /bucket/table1/dt=2017-10-28/bar=hello/*
>>>
>>> 1. Now I created a hive table:
>>> 

Re: Hive locking mechanism on read partition.

2017-10-12 Thread Furcy Pin
According to the documentation link you gave:

"A 'S' lock on table and relevant partition is acquired when a read is
being performed. For all other operations, an 'X' lock is taken on the
partition. However, if the change is only applicable to the newer
partitions, a 'S' lock is acquired on the table, whereas if the change is
applicable to all partitions, a 'X' lock is acquired on the table..."

They also give an example:

*insert into T2(partition P2) select .. T1 partition P1*

*S on T2, T1, T1.P1 and X on T2.P2*


However, it seems that someone else wrote this bellow too:

*The recipe listed above will not work as specified, because of the
hierarchical nature of locks.*

The 'S' lock for table T is specified as follows:

   - Call create( ) to create a node with pathname "/warehouse/T/read-".
   This is the lock node used later in the protocol. Make sure to set the
   sequence and ephemeral flag.
   - *Call getChildren( ) on the lock node without setting the watch flag.*
   - *If there is a child with a pathname starting with "write-" and a
   lower sequence number than the one obtained, the lock cannot be acquired.
   Delete the node created in the first step and return.*
   - Otherwise the lock is granted.

The 'X' lock for table T is specified as follows:

   - Call create( ) to create a node with pathname "/warehouse/T/write-".
   This is the lock node used later in the protocol. Make sure to set the
   sequence and ephemeral flag.
   - Call getChildren( ) on the lock node without setting the watch flag.
   - If there is a child with a pathname starting with "read-" or "write-"
   and a lower sequence number than the one obtained, the lock cannot be
   acquired. Delete the node created in the first step and return.
   - Otherwise the lock is granted.



So, as I understand it, the S lock on results_table is blocked by the X
lock on partition task_id=5556,
which means that getChildren() on the results_table finds a child
(partition task_id=5556) with a X lock
and doesn't grant the lock.

I am not part of the Hive development team, but I would say that this looks
like a bug, or at least
a misconception to me. But I found many very old JIRA issues about locking
that are still open, so I don't know
if this feature is actively maintained.

It would be preferable to test this on a more recent version before
submitting a JIRA, though.


As an unsafe workaround, you can add this line before each of your read
queries:

SET hive.support.concurrency=false ;





2017-10-12 12:58 GMT+02:00 Igor Kuzmenko :

> Hello, I'm using HDP 2.5.0.0  with included hive 1.2.1. And I have problem
> with locking mechanism.
>
> Most of my queries to hive looks like this.
>
> *(1)insert into table results_table partition(task_id=${task_id})*
> *select * from data_table  where ;*
>
> results_table partitioned by task_id field and i expect to get exclusive
> lock on corresponding partition.Which is true:
>
> Lock ID Database Table Partition   StateBlocked By Type
> Transaction ID
> 136639682.4 default results_table task_id=5556   ACQUIRED
>  EXCLUSIVE   NULL
>
>
>
> Another type of query is fetching data from results_table:
>
> *(2)  select * from results_table where task_id = ${task_id}*
>
> This select doesn't require any map reduce and executes fast. This is
> exactly what I want.
> But if I execute this two queries at the same time I can't perform read
> from result_table partition while inserting data into another.
>
> Locks looks like this:
>
> Lock ID Database Table Partition   State Blocked By Type
>   Transaction ID
> 136639682.4 default results_table task_id=5556   ACQUIRED
> EXCLUSIVE NULL
> 136639700.1 default results_table NULL   WAITING 136639682.4
> SHARED_READ NULL
> 136639700.2 default results_table task_id=5535WAITING
> SHARED_READ NULL
>
>
> Reading data from specified partition requires shared lock on whole table.
> This prevents me to get data untill first query completes.
>
> As I can see on this page
>  
> this
> is expected behaivor. But I don't understand why we need lock on table.
> Can I get rid of shared lock on whole table, while still having shared
> lock on specific partition?
>
>
>
>


Re: [Questio]Which record does Hive give the bigger number when I use row_number

2017-10-11 Thread Furcy Pin
Hello,

Either one can receive the bigger row_num, in an underteministic fashion
(which is NOT equivalent to random).
Simply put, it will be whichever is treated last by Hive, which you have no
way to know.

If your two rows differ on other columns, you might want to add them to
your ORDER BY clause to ensure consistency.
If you do want to have them randomly shuffled, you can simply use "ORDER BY
cost, rand()"

Finally, there are other variants to row_number that behave slightly
differently, check out this link:
https://blog.jooq.org/2014/08/12/the-difference-between-row_number-rank-and-dense_rank/






On Wed, Oct 11, 2017 at 4:33 PM, 孙志禹  wrote:

> Dear all,
>Thanks since it's the first time for me to have a honor to ask
> questions here.
> I used the hql script below:
> -- -
> select
> user_id
> , cost_date  -- datetime
> , cost  -- int
> , row_number over( partition by user_id order by cost  )
> as row_num
> from table_A
> -- -.
> * The question is,* if for a special *user_id*( e.g. *user_id *=
> '1'),  there are two records with the same *cost *in the table, and I
> know by using the function *row_number *Hive will give  different
> *row_nums *for both records, so which one will get the bigger *row_num*?
> Thanks! And it's also okay to me if you give me a web-link which can
> give the answer.
> 
> Anci Sun from China
>


Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin
You're welcome :-)

On Tue, Sep 12, 2017 at 1:06 PM, Jinhui Qin <qin.jin...@gmail.com> wrote:

> Furcy,
>
> Thank you so much for the detailed explanation, I realized that those
> overhead seems unavoidable as each statement needs to be turned into a
> MapReduce job before it can be executed, and the execution of MapReduce
> jobs also has some overhead such as provisioning the resources for the
> computation and persisting the results on the hdfs as you mentioned, these
> are the routines and unavoidable.
>
>  I will consider what Jorn suggested trying to collect the records and do
> a bulk load instead of using single insertion for each record.
>
>
> Jorn and Furcy, thanks again for the explanations and suggestions!
>
>
> On Sep 12, 2017 4:53 AM, "Furcy Pin" <furcy@flaminem.com> wrote:
>
> Hi,
>
> this is a very common question, as many people knowing SQL are used to
> RDBMS like MySQL, Oracle, or SQL Server.
> The first thing you need to know about Hive is that, in the first place,
> it has not been designed to replace
> such databases. Not when they are used for transaction processing anyway.
>
> This is stated in the Hive Tutorial, in the section "What Hive is NOT"
> https://cwiki.apache.org/confluence/display/Hive/Tutorial
>
> Transcationality is a feature that has been added afterwards, and as Jörn
> stated, it can be fast when used with TEZ + LLAP.
>
> In the logs you sent, you can see that Hive is running a MapReduce job to
> perform your task:
>
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
>
> If you are familiar with MapReduce, you should now that however small your
> dataset is, what takes time here
> is:
>
>1. Provisionning executors on YARN
>2. Starting one JVM per mapper and reducer (here you only have 1
>mapper and 0 reducer, as stated by the logs). The starting time of a JVM is
>generally a few seconds nowadays, except when to have to load 200Mb of
>Hadoop jars to start, in that case it takes around 10 seconds.
>3. Perform the task (a few milliseconds in your case)
>4. Persisting the results on HDFS (which requires a few hdfs
>operations and can take a few seconds to minutes if you write a lot of
>files, but in your case should be quick)
>
> The main improvement that a Tez LLAP or Spark backend will do is that the
> first to steps are already done
> and waiting for your queries to run.
> Hive as been designed in the first place to go faster than RDBMS in the
> cases where:
> - your data is too large to fit on a single instance, and sharding is
> painful
> - your jobs mostly consist in analytical processing, like full table
> aggregations
>
> In such case, the correct way to use Hive is by partitioning your table by
> day (if you run nightly batches)
> and generate a new partition every day.
> If you want to change something in your table (e.g. fix a bug), you just
> regenerate it.
>
> If you need fast response time for updating and fetching records, and
> scalability, perhaps you should look into HBase, Cassandra, or Kudu.
>
>
>
>
> On Mon, Sep 11, 2017 at 9:18 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Why do you want to do single inserts?
>> It has been more designed for bulk loads.
>> In any case newer version of Hive 2 using TEZ +llap improve it
>> significantly (also for bulk analysis). Nevertheless, it is good practice
>> to not use single inserts in an analysis systems, but try to combine and
>> bulk-load them.
>>
>> On 11. Sep 2017, at 21:01, Jinhui Qin <qin.jin...@gmail.com> wrote:
>>
>>
>>
>> Hi,
>>
>> I am new to Hive. I just created a simple table in hive and inserted two
>> records, the first insertion took 16.4 sec, while the second took 14.3 sec.
>> Why is that very slow? is this the normal performance you get in Hive using
>> INSERT ? Is there a way to improve the performance of a single "insert" in
>> Hive? Any help would be really appreciated. Thanks!
>>
>> Here is the record from a terminal in Hive shell:
>>
>> =
>>
>> hive> show tables;
>> OK
>> Time taken: 2.758 seconds
>> hive> create table people(id int, name string, age int);
>> OK
>> Time taken: 0.283 seconds
>> hive> insert into table people(1,'Tom A', 20);
>> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
>> Total jobs = 3
>> Launching Job 1 out of 3
>> Number of reduce tasks is set to 0 si

Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin
Hi,

this is a very common question, as many people knowing SQL are used to
RDBMS like MySQL, Oracle, or SQL Server.
The first thing you need to know about Hive is that, in the first place, it
has not been designed to replace
such databases. Not when they are used for transaction processing anyway.

This is stated in the Hive Tutorial, in the section "What Hive is NOT"
https://cwiki.apache.org/confluence/display/Hive/Tutorial

Transcationality is a feature that has been added afterwards, and as Jörn
stated, it can be fast when used with TEZ + LLAP.

In the logs you sent, you can see that Hive is running a MapReduce job to
perform your task:

MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 700 msec
OK

If you are familiar with MapReduce, you should now that however small your
dataset is, what takes time here
is:

   1. Provisionning executors on YARN
   2. Starting one JVM per mapper and reducer (here you only have 1 mapper
   and 0 reducer, as stated by the logs). The starting time of a JVM is
   generally a few seconds nowadays, except when to have to load 200Mb of
   Hadoop jars to start, in that case it takes around 10 seconds.
   3. Perform the task (a few milliseconds in your case)
   4. Persisting the results on HDFS (which requires a few hdfs operations
   and can take a few seconds to minutes if you write a lot of files, but in
   your case should be quick)

The main improvement that a Tez LLAP or Spark backend will do is that the
first to steps are already done
and waiting for your queries to run.
Hive as been designed in the first place to go faster than RDBMS in the
cases where:
- your data is too large to fit on a single instance, and sharding is
painful
- your jobs mostly consist in analytical processing, like full table
aggregations

In such case, the correct way to use Hive is by partitioning your table by
day (if you run nightly batches)
and generate a new partition every day.
If you want to change something in your table (e.g. fix a bug), you just
regenerate it.

If you need fast response time for updating and fetching records, and
scalability, perhaps you should look into HBase, Cassandra, or Kudu.




On Mon, Sep 11, 2017 at 9:18 PM, Jörn Franke  wrote:

> Why do you want to do single inserts?
> It has been more designed for bulk loads.
> In any case newer version of Hive 2 using TEZ +llap improve it
> significantly (also for bulk analysis). Nevertheless, it is good practice
> to not use single inserts in an analysis systems, but try to combine and
> bulk-load them.
>
> On 11. Sep 2017, at 21:01, Jinhui Qin  wrote:
>
>
>
> Hi,
>
> I am new to Hive. I just created a simple table in hive and inserted two
> records, the first insertion took 16.4 sec, while the second took 14.3 sec.
> Why is that very slow? is this the normal performance you get in Hive using
> INSERT ? Is there a way to improve the performance of a single "insert" in
> Hive? Any help would be really appreciated. Thanks!
>
> Here is the record from a terminal in Hive shell:
>
> =
>
> hive> show tables;
> OK
> Time taken: 2.758 seconds
> hive> create table people(id int, name string, age int);
> OK
> Time taken: 0.283 seconds
> hive> insert into table people(1,'Tom A', 20);
> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0098, Tracking URL =
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
> job_1505146047428_0098
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 0
> 2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> 2.7 sec
> MapReduce Total cumulative CPU time: 2 seconds 700 msec
> Ended Job = job_1505146047428_0098
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/
> people/.hive-staging_hive_2017-09-11_13-40-52_106_462156758110461544
> 1-1/-ext-1
> Loading data to table default.people
> Table default.people stats: [numFiles=1, numRows=1, totalSize=11,
> rawDataSize=10]
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
> Time taken: 16.417 seconds
> hive> insert into table people values(1,'Tom A', 20);
> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> 

Re: ORC Transaction Table - Spark

2017-08-24 Thread Furcy Pin
As far as I know, Spark can't read Hive's transactionnal tables yet:
https://issues.apache.org/jira/browse/SPARK-16996




On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal 
wrote:

> So, there is no way possible right now for Spark to read Hive 2.x data ?
>
> On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman  > wrote:
>
>> This looks like you have some data written by Hive 2.x and Hive 1.x code
>> trying to read it.
>>
>> That is not supported.
>>
>>
>>
>> *From: *Aviral Agarwal 
>> *Reply-To: *"user@hive.apache.org" 
>> *Date: *Wednesday, August 23, 2017 at 12:24 AM
>> *To: *"user@hive.apache.org" 
>> *Subject: *Re: ORC Transaction Table - Spark
>>
>>
>>
>> Hi,
>>
>> Yes it caused by wrong naming convention of the delta directory :
>>
>> /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645
>> 253_0645253_0001
>>
>> How do I solve this ?
>>
>> Thanks !
>> Aviral Agarwal
>>
>>
>>
>> On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman <
>> ekoif...@hortonworks.com> wrote:
>>
>> Could you do recursive “ls” in your table or partition that you are
>> trying to read?
>>
>> Most likely you have files that don’t follow expected naming convention
>>
>>
>>
>> Eugene
>>
>>
>>
>>
>>
>> *From: *Aviral Agarwal 
>> *Reply-To: *"user@hive.apache.org" 
>> *Date: *Tuesday, August 22, 2017 at 5:39 AM
>> *To: *"user@hive.apache.org" 
>> *Subject: *ORC Transaction Table - Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to read hive orc transaction table through Spark but I am
>> getting the following error
>>
>>
>> Caused by: java.lang.RuntimeException: serious problem
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:1021)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
>> cInputFormat.java:1048)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>> .
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NumberFormatException: For input string: "0645253_0001"
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:998)
>> ... 118 more
>>
>>
>> Any help would be appreciated.
>>
>> Thanks and Regards,
>> Aviral Agarwal
>>
>>
>>
>
>


Re: One column into multiple column.

2017-08-24 Thread Furcy Pin
Hello,

You can use the split(string, pattern) UDF, that returns an array.
You can compute this array in a subquery, and then assign a[0], a[1], a[2]
... to each column.
The split UDF will only be called once per row.



On Thu, Aug 24, 2017 at 12:11 AM, Deepak Khandelwal <
dkhandelwal@gmail.com> wrote:

> Can someone tell the best way to implement below in hive.
>
>  how can we take input from column c1 from tab Such that c1 has multiple
> values delimited by pipe. Each of the delimited value from col c1 of table
> t1 needs to be inserted into separate column in table t2.
>
> I can write a UDF for this but this udf I have to call 10 times to
> retrieve 10 delimited values from c1 of table t1 and then insert them into
> 10 separate cols of table t2. I am just trying to find out if there is a
> better way to do this so that I don't have to call udf 10 times
>  Thanks
>


Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Furcy Pin
I would suggest to use a subquery

WITH unique_currency AS (
  SELECT
CCY_CD,
MAX(CNTRY_DESC) as CNTRY_DESC
  FROM CURRENCY
  GROUP BY CCY_CD
)

and then perform your left join on it.

Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a little less costly than a max.
Sometimes, they also have functions like max_by(x, y)
 that
would allow you to get the most recent description.

It is a shame that this function is not included in Hive yet, but still you
can find some UDAF implementations on github
.



On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Need your suggestion on the below.
>
> Have two tables TXN and CURRENCY.
>
> Need all records in TXN and hence doing Left Join with CURRENCY.
>
> *Two problems :*
> 1. CURRENCY table may contain duplicate records hence it needs to be
> handled through RANK or some other function.
> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
> clause' then we will loose the EUR records which should not happen.
>
> Please suggest a solution to over come both the problems. For duplicated
> records it is fine if we select any of the CNTRY_DESC.
>
> *Table : CURRENCY*
>
> *Table : TXN*
>
> *CCY_CD*
>
> *CNTRY_DESC*
>
> *EFF_ST_DT*
>
> *EFF_END_DT*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> INR
>
> Indian Rupee
>
> 1-Jan-15
>
> 20-Feb-16
>
> 1
>
> INR
>
> 16-Feb-17
>
> INR
>
> Indian Rupee New
>
> 21-Feb-16
>
> 20-Feb-99
>
> 2
>
> USD
>
> 16-Feb-17
>
> USD
>
> US Dollar
>
> 1-Jan-15
>
> 20-Feb-16
>
> 3
>
> SGD
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar
>
> 1-Jan-15
>
> 20-Feb-17
>
> 4
>
> EUR
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar New
>
> 15-Feb-17
>
> 20-Feb-99
>
> SGD
>
> Singapore Dollar Latest
>
> 16-Feb-17
>
> 16-Feb-17
>
> *Expected Output*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> *CNTRY_DESC*
>
> 1
>
> INR
>
> 16-Feb-17
>
> Indian Rupee
>
> 2
>
> USD
>
> 16-Feb-17
>
> US Dollar
>
> 3
>
> SGD
>
> 16-Feb-17
>
> Singapore Dollar Latest (Any of three valid valid is fine)
>
> 4
>
> EUR
>
> 16-Feb-17
>
> 
>
>
>
> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
> where
> TXN_DT between EFF_ST_DT and EFF_END_DT;
>
>
>
> This query will drop the "EUR" record because of the where clause used.
> It cannot be handled with case statement instead of 'where clause' as we
> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>
> regards,
> Rams
>


Re: How can I identify permissions issues given problematic SQL?

2017-08-16 Thread Furcy Pin
Hi Giles,

Did you try running the query with the EXPLAIN AUTHORIZATION clause?

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain#LanguageManualExplain-TheAUTHORIZATIONClause

On Wed, Aug 16, 2017 at 3:08 AM, Bear Giles  wrote:

> Hi, an issue came up during recent AT and I'm hoping someone will have
> ideas on how we can check for this in the future, esp. if we can automate
> it somehow.
>
> In this case we had a client using our Hive client in the existing release
> without problems. When they copied their process to the UAT system for the
> final User Acceptance Testing it failed and they reported it as a defect in
> our pending release.
>
> I looked at it - when I ran the query I got a generic SQL Exception
> message. (If there were further details we didn't capture them.) It
> consistently took about 20 minutes to get the error. I simplified the query
> repeatedly in order to reduce the number of unknowns but it was always a
> generic SQL Exception that took about 20 minutes... except for the one time
> it took 4 hours(!).
>
> However we had a valid connection. I could execute 'select 1 as x' and get
> an immediate response. (Maybe 20s, but "immediate" when you've been waiting
> 20 minutes or more.)  Clearly I didn't have bad credentials.
>
> I tried running DESCRIBE table and SHOW tables but didn't get anything
> useful due to a bug in our software - we're treating those as a statement
> instead of a query so I didn't get anything useful. I couldn't push a fix
> either since we were in UAT and the other machines weren't whitelisted to
> access their servers.
>
> I made an informed guess that it could be a permissions issue with the
> underlying tables and they should check that. Nothing else made sense - we
> were clearly establishing a connection and the only difference between
> production and UAT was the host running the software. The customer was
> skeptical but eventually double-checked the connection properties and
> permissions and determined that there was a difference between the systems.
> I don't know what the ultimate problem was. I'm pretty sure it wasn't just
> a different account that didn't have the table - IIRC I did a test where I
> tried to select records from a non-existent table and it returned quickly
> with a meaningful error.
>
> My questions:
>
> 1. What conditions could result in the ability to establish a connection,
> run a query such as 'select 1 as x', but take 20 minutes to throw an error
> when I run a query like 'select guid from foo limit 1'?
>
> 2. How can I test for them? Now that we're post-UAT (and this is a known
> concern) we can add logic to perform standard tests for missing permissions
> if we know what to do.
>
> Thanks.
>
>
> Bear Giles
>
> Sr. Java Application Engineer
> bgi...@snaplogic.com
> Mobile: 720-749-7876 <(720)%20749-7876>
>
>
> 
>
>
>
> *SnapLogic Inc | 929 Pearl St #200 | 80303 CO 80302 | USA*
>
> *SnapLogic Inc | 2 W 5th Avenue 4th Floor | San Mateo CA 94402 | USA   *
>
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. The integrity and security of this message cannot be guaranteed on
> the Internet.
>


Introducing a new open source project: flamy

2017-07-05 Thread Furcy Pin
Dear Hive users,

it is my pleasure to announce that my company Flaminem just open-sourced
a productivity tool for Hive developers: Flamy.

https://github.com/flaminem/flamy

In short, Flamy is a tool to help organising, validating and running SQL
queries and manage their dependencies.
By analyzing queries, Flamy can find dependencies between tables, draw the
dependency graph, and run the queries in the right order.
It is also a great tool to quickly validate your hive queries without
having to actually run them.
It is currently compatible with Hive and Spark-SQL, and is especially
helpful when using Hive on Amazon's EMR.

This tool has been developped and used internally since 2014 by our
company,
giving us a huge boost in productivity for designing, deploying, running
and maintaining Hive workflows.
We would like to give back to the open source community by open sourcing
this project
and encouraging Hive users to try it.

In the repo's wiki you will find everything required to install and use
flamy, along with a tutorial:
https://github.com/flaminem/flamy/wiki


Feedback of any type would be most welcome: comments, suggestions,
questions...
And if you like it, be sure to star the repository and to spread the word!

Thanks,

Furcy Pin
CTO @ Flaminem


Re: Format dillema

2017-06-20 Thread Furcy Pin
Another option would be to try Facebook's Presto https://prestodb.io/

Like Impala, Presto is designed for fast interactive querying over Hive
tables, but it is also capable of querying data from many other SQL sources
(mySQL, postgreSQL, Kafka, Cassandra, ...
https://prestodb.io/docs/current/connector.html)

In terms of performances on small queries, it seems to be as fast as
Impala, a league over Spark-SQL, and of course two leagues over Hive.

Unlike Impala, Presto is also able to read ORC file format, and make the
most of it (e.g. read pre-aggregated values from ORC headers).

It can also make use of Hive's bucketing feature, while Impala still cannot:
https://github.com/prestodb/presto/issues/
https://issues.apache.org/jira/browse/IMPALA-3118

Regards,

Furcy





On Tue, Jun 20, 2017 at 5:36 AM, Sruthi Kumar Annamneedu <
sruthikumar...@gmail.com> wrote:

> Try using Parquet with Snappy compression and Impala will work with this
> combination.
>
> On Sun, Jun 18, 2017 at 3:35 AM, rakesh sharma  > wrote:
>
>> We are facing an issue of format. We would like to do bi style queries
>> from hive using impala and that supports parquet but we would like the data
>> to be compressed to the best ratio like orc. But impala cannot query orc
>> formats. What can be a design consideration for this. Any help
>>
>> Thanks
>> Rakesh
>>
>> Get Outlook for Android 
>>
>>
>


Re: drop table - external - aws

2017-05-17 Thread Furcy Pin
for that, sublime text + multi-line edit is your friend !

https://www.youtube.com/watch?v=-paR5m6m-Nw

On Wed, May 17, 2017 at 7:24 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> yeah. that's a potential idea too.  gotta put the time in to script it
> with 200+ tables though.
>
> On Wed, May 17, 2017 at 10:07 AM, Furcy Pin <furcy@flaminem.com>
> wrote:
>
>> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.
>>
>>
>> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar <vih...@cloudera.com
>> > wrote:
>>
>>> This is interesting and possibly a bug. Did you try changing them to
>>> managed tables and then dropping or truncating them? How do we reproduce
>>> this on our setup?
>>>
>>> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague <sprag...@gmail.com>
>>> wrote:
>>>
>>>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>>>> just so i could drop those tables from the metastore.
>>>>
>>>> note to self.  be careful - be real careful - with "sharing" hive
>>>> metastores between different compute paradigms.
>>>>
>>>> Regards,
>>>> Stephen.
>>>>
>>>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague <sprag...@gmail.com>
>>>> wrote:
>>>>
>>>>> hey guys,
>>>>> here's something bizarre.   i created about 200 external tables with a
>>>>> location something like this 'hdfs:///path'.  this was three
>>>>> months ago and now i'm revisiting and want to drop these tables.
>>>>>
>>>>> ha! no can do!
>>>>>
>>>>> that  is long gone.
>>>>>
>>>>> Upon issuing the drop table command i get this:
>>>>>
>>>>> Error while processing statement: FAILED: Execution Error, return code
>>>>> 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>>>> MetaException(message:java.lang.IllegalArgumentException:
>>>>> java.net.UnknownHostException: )
>>>>>
>>>>> where  is that old host name.
>>>>>
>>>>> so i ask is there a work around for this?  given they are external
>>>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>>>
>>>>> thanks,
>>>>> Stephen
>>>>>
>>>>
>>>>
>>>
>>
>


Re: drop table - external - aws

2017-05-17 Thread Furcy Pin
Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.


On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar 
wrote:

> This is interesting and possibly a bug. Did you try changing them to
> managed tables and then dropping or truncating them? How do we reproduce
> this on our setup?
>
> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague 
> wrote:
>
>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>> just so i could drop those tables from the metastore.
>>
>> note to self.  be careful - be real careful - with "sharing" hive
>> metastores between different compute paradigms.
>>
>> Regards,
>> Stephen.
>>
>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague 
>> wrote:
>>
>>> hey guys,
>>> here's something bizarre.   i created about 200 external tables with a
>>> location something like this 'hdfs:///path'.  this was three
>>> months ago and now i'm revisiting and want to drop these tables.
>>>
>>> ha! no can do!
>>>
>>> that  is long gone.
>>>
>>> Upon issuing the drop table command i get this:
>>>
>>> Error while processing statement: FAILED: Execution Error, return code 1
>>> from org.apache.hadoop.hive.ql.exec.DDLTask.
>>> MetaException(message:java.lang.IllegalArgumentException:
>>> java.net.UnknownHostException: )
>>>
>>> where  is that old host name.
>>>
>>> so i ask is there a work around for this?  given they are external
>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>
>>> thanks,
>>> Stephen
>>>
>>
>>
>


Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-07 Thread Furcy Pin
Hi Dmitry,

I believe what you are referring to is similar to what the MSCK REPAIR
TABLE command does in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Simply put: if you define a partitioned Hive table via a CREATE statement,
and then if you put data on hdfs that matches the way Hive organizes data,
in your case something like:
hdfs:///user/hive/warehouse/db.db/my_table/date=YYMMDD/content_type=Presentation/data_file_01.parquet
(and you can have multiple parquet files per folder),
then you can either use the ADD PARTITION statement to tell hive that the
data are there, or just use MSCK REPAIR TABLE to have Hive automatically
create the partitions corresponding to the files that you added on hdfs
manually (as long as their paths follows Hive partition's naming convention)


However, Hive is not able to infer by itself the column mapping, and you
have to write the CREATE statement corresponding to your parquet files
first.
If you want a tool that can automatically create a Hive table with the
correct columns given a parquet file, you should have a look at Spark.

Lastly, I don't know what your input application does nor what your data
are, but rather than csv you should perhaps
be considering Avro (easy to serialize, like a json but more optimized, and
compatible with the Hadoop ecosystem including Hive), especially if you
have nested data.









On Thu, Apr 6, 2017 at 11:19 PM, Dmitry Goldenberg  wrote:

> I'm assuming, given this:
>
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
>
> we'd have to organize the input Parquet files into subdirectories where
> each subdirectory contains data just for the given 'date' (YYMMDD), then
> within that subdirectory, content would be organized by content_type, one
> file per content_type value.  How does Hive make the association of a
> partition with a subdirectory naming or know to look for files for
> content_type, and how would it match content_type='Presentation' -- would
> the file just need to be named "Presentation"?
>
>
> On Thu, Apr 6, 2017 at 5:05 PM, Dmitry Goldenberg <
> dgoldenb...@hexastax.com> wrote:
>
>> >> properly split and partition your data before using LOAD if you want
>> hive to be able to find it again.
>>
>> If the destination table is defined as
>> CREATE TABLE IF NOT EXISTS db.mytable (
>>   `item_id` string,
>>   `timestamp` string,
>>   `item_comments` string)
>> PARTITIONED BY (`date`, `content_type`)
>> STORED AS PARQUET;
>>
>> and supposing that we have the data "in hand" (in memory or as CSV files)
>> how does one go about the 'proper split and partition' so it adheres to:
>> PARTITIONED BY (`date`, `content_type`)  ?
>>
>> Thanks
>>
>>
>> On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris <
>> ryan.har...@zionsbancorp.com> wrote:
>>
>>> “If we represent our data as delimited files” ….the question is how you
>>> plan on getting your data into these parquet files since it doesn’t sound
>>> like your data is already in that format….
>>>
>>>
>>>
>>> If your data is not already in parquet format, you are going to need to
>>> run **some** process to get it into that format…why not just use hive
>>> (running a query on an external table) to perform the conversion?
>>>
>>>
>>>
>>> “and Hive represents it as Parquet internally” That entirely depends on
>>> the declared STORED AS format when you define the table.  The files backing
>>> the hive table **could** be TEXT, sequence, RC, ORC, Parquet…  If you
>>> declared the table to be backed by delimited text, you could format your
>>> data into standard text files (not parquet) and then add the data to the
>>> hive table using LOAD DATA.
>>>
>>>
>>>
>>> So, why NOT use text data for the table storage?  There is no way to
>>> optimize future queries against that data.
>>>
>>>
>>>
>>> One hypothetical workflow assuming that your data is currently
>>> delimited….
>>>
>>>
>>>
>>> You could either have a hive managed table, with the table data stored
>>> as TEXTFILE using some delimiter based SerDe, and you could then use LOAD
>>> DATA to put your original raw files into this table.   OR, you could use an
>>> external table (not managed by hive) to point to the data wherever it
>>> currently resides.  (The only difference between the two choices here is
>>> whether the original raw files end up in ‘/user/hive/warehouse/tablename’
>>> or the current HDFS path where they reside.
>>>
>>>
>>>
>>> From there, you could query FROM that temp table, INSERT into your final
>>> destination table, and the data will be formatted according to the data
>>> definition of your destination table.
>>>
>>>
>>>
>>>
>>>
>>> If you want to (for whatever reason) use LOAD DATA INPATH to shove the
>>> original data directly into your final destination table you must
>>>
>>> 1)  Ensure 

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Furcy Pin
Hi Dmitry,

If I understand what you said correctly:

At the beginning you have csv files on hdfs,
and at the end you want a partitioned Hive table as parquet.

And your question is: "can I do this using only one Hive table and a LOAD
statement?"

The answer to that question is "no".


The correct way to do this is what you are currently doing right now:

Create a table db.origtable STORED as TEXTFILE and use LOAD data to put the
csv data in it (they will stay in csv)
Create a table db.mytable as PARQUET and then use Hive to insert data into
it from db.origtable

Hive is only able to:
- query data from a Hive table
- write into another Hive table
- load data "as is" into a table without changing its format


The LOAD DATA statement cannot transform your data, nor change its format.
If you want to transform your data and/or change the format, you need to use
a regulare INSERT INTO|OVERWRITE TABLE ... SELECT query.

By the way I would recommend using INSERT OVERWRITE to make your query
idempotent
and avoid issues if you accidentally run it twice.

Now, if you really want to only have *one* hive table, I guess you can
either:
- make table db.origtable temporary
- directly use spark to read the csv files and insert into Hive as Parquet.
- or, if you can, write your input data directly as parquet instead of csv,
but you would still need 2 tables if you also need to perform a GROUP BY.

Hope this helps.


On Thu, Apr 6, 2017 at 2:48 PM, Dmitry Goldenberg 
wrote:

> Thanks, Ryan.
>
> I was actually more curious about scenario B. If we represent our data as
> delimited files, why don't we just use LOAD DATA INPATH and load it right
> into the final, parquet, partitioned table in one step, bypassing dealing
> with the temp table?
>
> Are there any advantages to having a temp table besides the validation?
> One advantage could possibly be making it a transactional table and being
> able to run direct INSERT's into the temp table, avoiding having to deal
> with delimited files and LOAD DATA INPATH.
>
> If we go with route B, LOAD DATA INPATH directly into the parquet,
> partitioned table, would we have to:
>
> 1) represent the input files as Parquet? - it looks like the data is still
> delimited, and Hive represents it as Parquet internally
> 2) do anything specific in the input files / with the input files in order
> to make partitioning work, or does Hive just take the data and take full
> care of partitioning it?
>
>
>
> On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris 
> wrote:
>
>> For A) I’d recommend mapping an EXTERNAL table to the raw/original source
>> files…then you can just run a SELECT query from the EXTERNAL source and
>> INSERT into your destination.
>>
>>
>>
>> LOAD DATA can be very useful when you are trying to move data between two
>> tables that share the same schema but 1 table is partitioned and the other
>> table is NOT partitioned…once the files have been inserted into the
>> unpartitioned table the source files from the hive warehouse can be added
>> to the partitioned table using LOAD DATA.  Another place I’ve frequently
>> used LOAD DATA is when synchronizing hive table data between two clusters,
>> the hive warehouse data files can be copied from one cluster to the other
>> with distcp and then loading the data flies to the duplicate cluster using
>> LOAD DATA to ensure the metadata is recorded in hive metastore.
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 3:31 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> [External Email]
>> --
>>
>> Right, that makes sense, Dudu.
>>
>>
>>
>> So basically, if we have our data in "some form", and a goal of loading
>> it into a parquet, partitioned table in Hive, we have two choices:
>>
>>
>>
>> A. Load this data into a temporary table first. Presumably, for this we
>> should be able to do a LOAD INPATH, from delimited data files. Perhaps we
>> could designate the temp table as transactional and then simply do direct
>> INSERT's into this temp table - ? Then, as the second step, we'd do an
>> INSERT... SELECT, to move the data into the destination table, and then
>> DROP the temp table.
>>
>>
>>
>> B. Represent the data as a delimited format and do a LOAD INPATH directly
>> into the destination table. Understandably, we lose the 'data verification'
>> this way. If we go this route, must the data in the input files be in the
>> PARQUET format or in a delimited format?  I would guess, the former.  And,
>> how does partitioning play into it?  How would the input data need to be
>> organized and inserted so as to adhere to the partitions (the 'date' and
>> 'content-type' columns, in my example)?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu 
>> wrote:
>>
>> “LOAD” is very 

Re: hivevar in partition clause

2017-03-21 Thread Furcy Pin
Hello Jan,

I guess this is because you wrote ${hivevar=PNAME} with an = instead of a :

regards,

Furcy

On Tue, Mar 21, 2017 at 5:50 PM, Brotanek, Jan 
wrote:

> Hi, I have following shell script:
>
>
>
> #!/bin/bash
>
> PNAME=20170320_203225
>
> TNAME=tablea1
>
> /usr/bin/hive -hiveconf hive.cli.errors.ignore=true -hivevar PNAME=$PNAME
> -hivevar TNAME=$TNAME -v -f ./query.hql
>
>
>
> Trying to pass PNAME hivevar to query.hql script:
>
>
>
> insert into table a1.${hivevar:TNAME}_orc PARTITION
> (part_col=${hivevar=PNAME})
>
> select
>
> a1,
>
> a2,
>
> a3
>
> from a1.${hivevar:TNAME}
>
> ;
>
>
>
> Substitution works fine in table name and from clause. However in
> (part_col=${hivevar=PNAME}) I get error cannot recognize input near '$' '{'
> 'hivevar' in constant.
>
>
>
> If I add quotes (part_col=”${hivevar=PNAME}”), it works fine, but never
> expands variable, resulting having text ”${hivevar=PNAME}”) in my partition.
>
> How should I quote PNAME to avoid error and having my variable expanded?
>
>
>
> Can anyone help?
>
> Thank you!
>
>
>


Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin
Actually, there is a legacy feature in Hive that would do exactly what you
need :

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification


Another way would simply to use dynamic partitioning :
INSERT INTO TABLE employee_orc PARTITION *(country, office)* select * from
employee where country='USA' and office='HQ-TX';

The only downside of this method (that I know of), is that Hive will put a
lock on every partitions, even if only one is inserted.


I also know a third solution, but I can't talk about it yet.



On Tue, Dec 13, 2016 at 3:27 PM, Joaquin Alzola 
wrote:

> Hi Suresh
>
>
>
> I choose the * and not the specific fields because I have 520 columns.
>
> The data that I tested was only a testing ground.
>
>
>
> I suppose then that I need to select the 520 fileds. L
>
>
>
>
>
>
>
> *From:* Suresh Kumar Sethuramaswamy [mailto:rock...@gmail.com]
> *Sent:* 13 December 2016 14:19
> *To:* user@hive.apache.org
> *Subject:* Re: PARTITION error because different columns size
>
>
>
> Hi Joaquin
>
>
>
> In hive , when u run 'select * from employee' it is going to return
> the partitioned columns also at the end,  whereas you don't want that to be
> inserted into ur ORC table , so ur insert query should look like
>
>
>
>   INSERT INTO TABLE employee_orc PARTITION (country='USA',
> office='HQ-TX') select eid,salary from employee where country='USA' and
> office='HQ-TX';
>
>
>
>
>
>  Remember partition in hive is a physical folder name
>
>
>
> Regards
>
> Suresh
>
>
>
>
>
>
>
> On Tue, Dec 13, 2016 at 6:37 AM Joaquin Alzola 
> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Hi List
>
>
>
>
>
>
>
> I change Spark to 2.0.2 and Hive 2.0.1.
>
>
>
> I have the bellow tables but the INSERT INTO TABLE employee_orc PARTITION
> (country='USA', office='HQ-TX') select * from employee where country='USA'
> and office='HQ-TX';
>
>
>
> Is giving me à Cannot insert into table `default`.`employee_orc` because
> the number of columns are different: need 4 columns, but query has 6
> columns.;
>
>
>
>
>
>
>
> When doing select it is adding the Partition as columns ….
>
>
>
>
>
>
>
> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
>
>
>
> salary String, destination String)
>
>
>
> COMMENT 'Employee details'
>
>
>
> PARTITIONED BY(country string, office string)
>
>
>
> ROW FORMAT DELIMITED
>
>
>
> FIELDS TERMINATED BY '\t'
>
>
>
> LINES TERMINATED BY '\n'
>
>
>
> STORED AS TEXTFILE;
>
>
>
>
>
>
>
> CREATE TABLE IF NOT EXISTS employee_orc ( eid int, name String,
>
>
>
> salary String, destination String)
>
>
>
> COMMENT 'Employee details'
>
>
>
> PARTITIONED BY(country string, office string)
>
>
>
> STORED AS ORC tblproperties ("orc.compress"="ZLIB");
>
>
>
>
>
>
>
> 0: jdbc:hive2://localhost:1> LOAD DATA LOCAL INPATH
> '/mnt/sample.txt.gz' INTO TABLE employee PARTITION (country='USA',
> office='HQ-TX');
>
>
>
> +-+--+
>
>
>
> | Result  |
>
>
>
> +-+--+
>
>
>
> +-+--+
>
>
>
> No rows selected (0.685 seconds)
>
>
>
> 0: jdbc:hive2://localhost:1> select * from employee;
>
>
>
> +---+--+-++-
> -+-+--+
>
>
>
> |  eid  | name | salary  |destination | country  | office
> |
>
>
>
> +---+--+-++-
> -+-+--+
>
>
>
> | 1201  | Gopal| 45000   | Technical manager  | USA  | HQ-TX
> |
>
>
>
> | 1202  | Manisha  | 45000   | Proof reader   | USA  | HQ-TX
> |
>
>
>
> | 1203  | Masthanvali  | 4   | Technical writer   | USA  | HQ-TX
> |
>
>
>
> | 1204  | Kiran| 4   | Hr Admin   | USA  | HQ-TX
> |
>
>
>
> | 1205  | Kranthi  | 3   | Op Admin   | USA  | HQ-TX
> |
>
>
>
> +---+--+-++-
> -+-+--+
>
>
>
> 5 rows selected (0.358 seconds)
>
>
>
> 0: jdbc:hive2://localhost:1> INSERT INTO TABLE employee_orc PARTITION
> (country='USA', office='HQ-TX') select * from employee where country='USA'
> and office='HQ-TX';
>
>
>
> Error: org.apache.spark.sql.AnalysisException: Cannot insert into table
> `default`.`employee_orc` because the number of columns are different: need
> 4 columns, but query has 6 columns.; (state=,code=0)
>
>
>
>
>
>
>
>
>
>
>
> 0: jdbc:hive2://localhost:1> describe employee_orc;
>
>
>
> +--++--+--+
>
>
>
> | col_name | data_type  | comment  |
>
>
>
> +--++--+--+
>
>
>
> | eid  | int| NULL |
>
>
>
> | name | string | NULL |
>
>
>
> | salary   | string | NULL |
>
>
>
> | destination  | string | NULL |
>
>
>
> | country  | string | NULL |
>
>
>
> | office   | string | 

Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin
Hi Joaquin,

Suresh was faster than me ...

Also, you should check this :
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

On Tue, Dec 13, 2016 at 3:19 PM, Suresh Kumar Sethuramaswamy <
rock...@gmail.com> wrote:

> Hi Joaquin
>
> In hive , when u run 'select * from employee' it is going to return
> the partitioned columns also at the end,  whereas you don't want that to be
> inserted into ur ORC table , so ur insert query should look like
>
>   INSERT INTO TABLE employee_orc PARTITION (country='USA',
> office='HQ-TX') select eid,salary from employee where country='USA' and
> office='HQ-TX';
>
>
>  Remember partition in hive is a physical folder name
>
> Regards
> Suresh
>
>
>
> On Tue, Dec 13, 2016 at 6:37 AM Joaquin Alzola 
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi List
>>
>>
>>
>>
>>
>> I change Spark to 2.0.2 and Hive 2.0.1.
>>
>>
>> I have the bellow tables but the INSERT INTO TABLE employee_orc PARTITION
>> (country='USA', office='HQ-TX') select * from employee where country='USA'
>> and office='HQ-TX';
>>
>>
>> Is giving me à Cannot insert into table `default`.`employee_orc` because
>> the number of columns are different: need 4 columns, but query has 6
>> columns.;
>>
>>
>>
>>
>>
>> When doing select it is adding the Partition as columns ….
>>
>>
>>
>>
>>
>>
>> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
>>
>>
>> salary String, destination String)
>>
>>
>> COMMENT 'Employee details'
>>
>>
>> PARTITIONED BY(country string, office string)
>>
>>
>> ROW FORMAT DELIMITED
>>
>>
>> FIELDS TERMINATED BY '\t'
>>
>>
>> LINES TERMINATED BY '\n'
>>
>>
>> STORED AS TEXTFILE;
>>
>>
>>
>>
>>
>> CREATE TABLE IF NOT EXISTS employee_orc ( eid int, name String,
>>
>>
>> salary String, destination String)
>>
>>
>> COMMENT 'Employee details'
>>
>>
>> PARTITIONED BY(country string, office string)
>>
>>
>> STORED AS ORC tblproperties ("orc.compress"="ZLIB");
>>
>>
>>
>>
>>
>> 0: jdbc:hive2://localhost:1> LOAD DATA LOCAL INPATH
>> '/mnt/sample.txt.gz' INTO TABLE employee PARTITION (country='USA',
>> office='HQ-TX');
>>
>>
>> +-+--+
>>
>>
>> | Result  |
>>
>>
>> +-+--+
>>
>>
>> +-+--+
>>
>>
>> No rows selected (0.685 seconds)
>>
>>
>> 0: jdbc:hive2://localhost:1> select * from employee;
>>
>>
>> +---+--+-++-
>> -+-+--+
>>
>>
>> |  eid  | name | salary  |destination | country  |
>> office  |
>>
>>
>> +---+--+-++-
>> -+-+--+
>>
>>
>> | 1201  | Gopal| 45000   | Technical manager  | USA  |
>> HQ-TX   |
>>
>>
>> | 1202  | Manisha  | 45000   | Proof reader   | USA  |
>> HQ-TX   |
>>
>>
>> | 1203  | Masthanvali  | 4   | Technical writer   | USA  |
>> HQ-TX   |
>>
>>
>> | 1204  | Kiran| 4   | Hr Admin   | USA  |
>> HQ-TX   |
>>
>>
>> | 1205  | Kranthi  | 3   | Op Admin   | USA  |
>> HQ-TX   |
>>
>>
>> +---+--+-++-
>> -+-+--+
>>
>>
>> 5 rows selected (0.358 seconds)
>>
>>
>> 0: jdbc:hive2://localhost:1> INSERT INTO TABLE employee_orc PARTITION
>> (country='USA', office='HQ-TX') select * from employee where country='USA'
>> and office='HQ-TX';
>>
>>
>> Error: org.apache.spark.sql.AnalysisException: Cannot insert into table
>> `default`.`employee_orc` because the number of columns are different: need
>> 4 columns, but query has 6 columns.; (state=,code=0)
>>
>>
>>
>>
>>
>>
>>
>>
>> 0: jdbc:hive2://localhost:1> describe employee_orc;
>>
>>
>> +--++--+--+
>>
>>
>> | col_name | data_type  | comment  |
>>
>>
>> +--++--+--+
>>
>>
>> | eid  | int| NULL |
>>
>>
>> | name | string | NULL |
>>
>>
>> | salary   | string | NULL |
>>
>>
>> | destination  | string | NULL |
>>
>>
>> | country  | string | NULL |
>>
>>
>> | office   | string | NULL |
>>
>>
>> | # Partition Information  ||  |
>>
>>
>> | # col_name   | data_type  | comment  |
>>
>>
>> | country  | string | NULL |
>>
>>
>> | office   | string | NULL |
>>
>>
>> +--++--+--+
>>
>>
>>
>>
>>
>> 0: jdbc:hive2://localhost:1>  describe employee;
>>
>>
>> +--++--+--+
>>
>>
>> | col_name | data_type  | comment  |
>>
>>
>> +--++--+--+
>>
>>
>> | eid  | int| NULL |
>>
>>
>> | name | string | NULL |
>>
>>
>> | salary   | 

Re: How to move tasks under reducer to Mapper phase

2016-12-10 Thread Furcy Pin
Hi Mahender,

it's hard to say what happen without seeing the actual query.

Hive has several ways to perform joins. There is a complete description of
how it does it here:

https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization
Sadly, the illustrations are broken.

There is also this presentation :
https://www.youtube.com/watch?v=OB4H3Yt5VWM

And the corresponding slides :
https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf

However, these docs are from the Map-Reduce era and quite old now.
So it is hard to tell if everything works the same way with Tez today.

If all your tables are big, I would say there is not much to optimize
except trying to bucket and sort them before.


Last but not least:

When I get this kind of behavior (reducer stuck during a JOIN), more often
than not,
it is simply because the JOIN clause is incorrect, and the reducer
generates way too much data.

Just imagine what would happen if you did a "JOIN ON 1 = 1"
between two tables with 10^9 records... you can actually kill a cluster
with this,
if you let it run long enough.





On Fri, Dec 9, 2016 at 10:31 PM, Mahender Sarangam <
mahender.bigd...@outlook.com> wrote:

> Hi,
>
> We are performing left joining on 5-6 larger tables. We see job is hanging
> around 95%. All the mappers completed fast and some of the reducer are also
> completed fast. but some of reducer are hanging state because single task
> is running on large data. Below are the Mapper and Reducer captured.
>
>
>
>- Is there a way to move task running under Reducer phase to Mapper
>phase. I mean tweaking with memory settings or modifying the query to have
>more mapper tasks than reducer task.
>
>
>- Is there a way to know what part of query is taken by task which is
>running for long time. or what amount of rows this task is running upon (
>so that i can think of partition or alternate approach)
>- Any other memory setting to resolve hanging issue. Below is our
>memory settings
>
> SET hive.tez.container.size = -1;
> SET hive.execution.engine=tez;
> SET hive.mapjoin.hybridgrace.hashtable=FALSE;
> SET hive.optimize.ppd=true;
> SET hive.cbo.enable =true;
> SET hive.compute.query.using.stats =true;
> SET hive.exec.parallel=true;
> SET hive.vectorized.execution.enabled=true;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.auto.convert.join=false;
> SET hive.auto.convert.join.noconditionaltask=false;
> set hive.tez.java.opts = "-Xmx3481m";
> set hive.tez.container.size = 4096;
> --SET mapreduce.map.memory.mb=4096;
> --SET mapreduce.map.java.opts = -Xmx3000M;
> --SET mapreduce.reduce.memory.mb = 2048;
> --SET mapreduce.reduce.java.opts = -Xmx1630M;
> SET fs.block.size=67108864;
>
>
> Thanks in advance
>
>
> -Mahender
>
>
>
>
>


Re: Table not found in the definition of view

2016-12-01 Thread Furcy Pin
Hi,

you should replace

WITH table AS (subquery)
SELECT ...
FROM table

with

SELECT ...
FROM(
  subquery
) table

Regards.

On Thu, Dec 1, 2016 at 12:32 PM, Priyanka Raghuvanshi 
wrote:

> Hi All
>
>
> Getting error 'Table not found in the definition of view ' if a view is
> created using 'WITH' clause to use the  result of one query in another.
>
>
> This issue has been resolved for Hive 1.3.0 and 2.0.0 but mine is 0.13
>
> Regards
>
> Priyanka Raghuvanshi
>


Re: Hive on Spark not working

2016-11-28 Thread Furcy Pin
ClassNotFoundException generally means that jars are missing from your
class path.

You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive

On Tue, Nov 29, 2016 at 2:03 AM, Joaquin Alzola 
wrote:

> Hi Guys
>
>
>
> No matter what I do that when I execute “select count(*) from employee” I
> get the following output on the logs:
>
> It is quiet funny because if I put hive.execution.engine=mr the output is
> correct. If I put hive.execution.engine=spark then I get the bellow errors.
>
> If I do the search directly through spark-shell it work great.
>
> +---+
>
> |_c0|
>
> +---+
>
> |1005635|
>
> +---+
>
> So there has to be a problem from hive to spark.
>
>
>
> Seems as the RPC(??) connection is not setup …. Can somebody guide me on
> what to look for.
>
> spark.master=spark://172.16.173.31:7077
>
> hive.execution.engine=spark
>
> spark.executor.extraClassPath/mnt/spark/lib/spark-1.6.2-
> yarn-shuffle.jar:/mnt/hive/lib/hive-exec-2.0.1.jar
>
>
>
> Hive2.0.1à Spark 1.6.2 –> Hadoop – 2.6.5 à Scala 2.10
>
>
>
> 2016-11-29T00:35:11,099 WARN  [RPC-Handler-2]: rpc.RpcDispatcher
> (RpcDispatcher.java:handleError(142)) - Received error
> message:io.netty.handler.codec.DecoderException: 
> java.lang.NoClassDefFoundError:
> org/apache/hive/spark/client/Job
>
> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(
> ByteToMessageDecoder.java:358)
>
> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(
> ByteToMessageDecoder.java:230)
>
> at io.netty.handler.codec.ByteToMessageCodec.channelRead(
> ByteToMessageCodec.java:103)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(
> ChannelInboundHandlerAdapter.java:86)
>
> at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:308)
>
> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:294)
>
> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(
> DefaultChannelPipeline.java:846)
>
> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(
> AbstractNioByteChannel.java:131)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:511)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:468)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:382)
>
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:111)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/
> Job
>
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>
> at java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readName(DefaultClassResolver.java:154)
>
> at org.apache.hive.com.esotericsoftware.kryo.util.
> DefaultClassResolver.readClass(DefaultClassResolver.java:133)
>
> at org.apache.hive.com.esotericsoftware.kryo.Kryo.
> readClass(Kryo.java:670)
>
> at org.apache.hive.com.esotericsoftware.kryo.
> serializers.ObjectField.read(ObjectField.java:118)
>
> at org.apache.hive.com.esotericsoftware.kryo.
> serializers.FieldSerializer.read(FieldSerializer.java:551)
>
> at org.apache.hive.com.esotericsoftware.kryo.Kryo.
> readClassAndObject(Kryo.java:790)
>
> at org.apache.hive.spark.client.rpc.KryoMessageCodec.decode(
> 

Re: Hive metadata on Hbase

2016-10-26 Thread Furcy Pin
Hi Mich,

No, I am not using HBase as a metastore now, but I am eager for it to
become production ready and released in CDH and HDP.

Concerning locks, I think HBase would do fine because it is ACID at the row
level. It only appends data on HDFS, but
it works by keeping regions in RAM, plus a write-ahead-log for failure
recovery.
So updates on rows are atomic and ACID.
This allows to have acid guarantees between elements that are stored on the
same row.
Since HBase supports a great number of dynamic columns in each rows
(large-columnar store, like Cassandra), the
smart way to design your tables is quite different from RDBMS.
I would expect that they will have something like a hbase table with one
row per hive table, with all the associated data with it. This would make
all modifications on a table atomic.

Concerning locks, as they involve multiple tables, I guess they would have
to manually put a global lock on the "hbase lock table" before editing it.

I agree that you should not touch the system tables too much, but sometimes
you have to remove the deadlock or fix an inconsistency yourself. I guess
removing deadlocks in HBase should not be much harder, using the
hbase-shell (new syntax to learn, however)

It would be nice if Hive had some syntax to manually remove deadlocks when
they happen, you would not have to worry about the metastore implementation
then.



On Wed, Oct 26, 2016 at 12:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Hi Furcy,
>
> Having used Hbase for part of Batch layer in Lambda Architecture I have
> come to conclusion that it is a very good product despite the fact that
> because of its cryptic nature it is not much loved or appreciated. However,
> it may be useful to have a Hive metastore skin on top of Hbase tables so
> admin and others can interrogate Hbase tables. Definitely there is a need
> for some sort of interface to Hive metastore on Hbase, whether through Hive
> or Phoenix.
>
> Then we still have to handle lock and concurrency on metastore tables.
> RDBMS is transactional and ACID compliant. I do not know enough about
> Hbase. As far as I know Hbase appends data. Currently when I have an issue
> with transactions and locks I go to metadata and do some plastic surgery on
> TRXN and LOCKS tables that resolves the issue. I am not sure how I am going
> to achieve that in Hbase. Puritans might argue that one should not touch
> these system tables but things are not generally that simple.
>
> Are you using Hbase as Hive metastore now?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 October 2016 at 13:44, Furcy Pin <furcy@flaminem.com> wrote:
>
>> Hi Mich,
>>
>> I mostly agree with you, but I would comment on the part about using
>> HBase as a maintenance free core product:
>> I would say that most medium company using Hadoop rely on Hortonworks or
>> Cloudera, that both provides a pre-packaged HBase installation. It would
>> probably make sense for them to ship pre-installed versions of Hive relying
>> on HBase as metastore.
>> And as Alan stated, it would also be a good way to improve the
>> integration between Hive and HBase.
>>
>> I am not well placed to give an opinion on this, but I agree that
>> maintaining integration between both HBase and regular RDBMS might be a
>> real pain.
>> I am also worried about the fact that if indeed HBase grant us the
>> possibility to have all nodes calling the metastore, then any optimization
>> making use
>> of this will only work for a cluster with a Hive metastore on HBase?
>>
>> Anyway, I am still looking forward to this, as despite working in a small
>> company, our metastore sometimes seems to be a bottleneck, especially
>> when running more than 20 queries on tables with 10 000 partitions...
>> But perhaps migrating it on a bigger host would be enough for us...
>>
>>
>>
>> On Mon, Oct 24, 2016 at 10:21 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Alan for detailed explanation.
>>>
>>> Please bear in mind that any tool that needs to work with som

Re: Hive metadata on Hbase

2016-10-25 Thread Furcy Pin
Hi Mich,

I mostly agree with you, but I would comment on the part about using HBase
as a maintenance free core product:
I would say that most medium company using Hadoop rely on Hortonworks or
Cloudera, that both provides a pre-packaged HBase installation. It would
probably make sense for them to ship pre-installed versions of Hive relying
on HBase as metastore.
And as Alan stated, it would also be a good way to improve the integration
between Hive and HBase.

I am not well placed to give an opinion on this, but I agree that
maintaining integration between both HBase and regular RDBMS might be a
real pain.
I am also worried about the fact that if indeed HBase grant us the
possibility to have all nodes calling the metastore, then any optimization
making use
of this will only work for a cluster with a Hive metastore on HBase?

Anyway, I am still looking forward to this, as despite working in a small
company, our metastore sometimes seems to be a bottleneck, especially
when running more than 20 queries on tables with 10 000 partitions...
But perhaps migrating it on a bigger host would be enough for us...



On Mon, Oct 24, 2016 at 10:21 PM, Mich Talebzadeh  wrote:

> Thanks Alan for detailed explanation.
>
> Please bear in mind that any tool that needs to work with some repository
> (Oracle TimesTen IMDB has its metastore on Oracle classic), SAP Replication
> Server has its repository RSSD on SAP ASE and others
> First thing they do, they go and cache those tables and keep it in memory
> of the big brother database until they are shutdown. I reversed engineered
> and created Hive data model from physical schema (on Oracle). There are
> around 194 tables in total that can be easily cached.
>
> For small medium enterprise (SME), they don't really have much data so
> anything will do and they are the ones that use open source databases. For
> bigger companies, they already pay bucks for Oracle and alike and they are
> the one that would not touch an open source database (not talking about big
> data), because in this new capital-sensitive risk-averse world, they do
> not want to expose themselves to unnecessary risk.  So I am not sure
> whether they will take something like Hbase as a core product, unless it is
> going to be maintenance free.
>
> Going back to your point
>
> ".. but you have to pay for an expensive commercial license to make the
> metadata really work well is a non-starter"
>
> They already do and pay more if they have to. We will stick with Hive
> metadata on Oracle with schema on SSD
> .
>
> HTH
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2016 at 20:14, Alan Gates  wrote:
>
>> Some thoughts on this:
>>
>> First, there’s no plan to remove the option to use an RDBMS such as
>> Oracle as your backend.  Hive’s RawStore interface is built such that
>> various implementations of the metadata storage can easily coexist.
>> Obviously different users will make different choices about what metadata
>> store makes sense for them.
>>
>> As to why HBase:
>> 1) We desperately need to get rid of the ORM layer.  It’s causing us
>> performance problems, as evidenced by things like it taking several minutes
>> to fetch all of the partition data for queries that span many partitions.
>> HBase is a way to achieve this, not the only way.  See in particular
>> Yahoo’s work on optimizing Oracle access https://issues.apache.org/jira
>> /browse/HIVE-14870  The question around this is whether we can optimize
>> for Oracle, MySQL, Postgres, and SQLServer without creating a maintenance
>> and testing nightmare for ourselves.  I’m skeptical, but others think it’s
>> possible.  See comments on that JIRA.
>>
>> 2) We’d like to scale to much larger sizes, both in terms of data and
>> access from nodes.  Not that we’re worried about the amount of metadata,
>> but we’d like to be able to cache more stats, file splits, etc.  And we’d
>> like to allow nodes in the cluster to contact the metastore, which we do
>> not today since many RDBMSs don’t handle a thousand plus simultaneous
>> connections well.  Obviously both data and connection scale can be met with
>> high end commercial stores.  But saying that we have this great open source
>> database but you have to pay for an expensive commercial license to make
>> the metadata really work well is a non-starter.
>>
>> 3) By using tools within the 

Re: Hive metadata on Hbase

2016-10-24 Thread Furcy Pin
Hi Mich,

the umbrella JIRA for this gives a few reason.
https://issues.apache.org/jira/browse/HIVE-9452
(with even more details in the attached pdf
https://issues.apache.org/jira/secure/attachment/12697601/HBaseMetastoreApproach.pdf
)

In my experience, Hive tables with a lot of partitions (> 10 000) may
become really slow, especially with Spark.
The latency induced by the metastore can be really big compared to the
whole duration of the query itself,
because the driver needs to fetch a lot of info about partitions just to
optimize the query, before even running it.

I guess another advantage is that using a RDBMS as metastore makes it a
SPOF, unless you setup replication etc. while, HBase would give HA for free.



On Mon, Oct 24, 2016 at 9:06 AM, Mich Talebzadeh 
wrote:

> @Per
>
> We run full transactional enabled Hive metadb on an Oracle DB.
>
> I don't have statistics now but will collect from AWR reports no problem.
>
> @Jorn,
>
> The primary reason Oracle was chosen is because the company has global
> licenses for Oracle + MSSQL + SAP and they are classified as Enterprise
> Grade databases.
>
> None of MySQL and others are classified as such so they cannot be deployed
> in production.
>
> Besides, for us to have Hive metadata on Oracle makes sense as our
> infrastructure does all the support, HA etc for it and they have trained
> DBAs to look after it 24x7.
>
> Admittedly we are now relying on HDFS itself plus Hbase as well for
> persistent storage. So the situation might change.
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2016 at 06:46, Per Ullberg  wrote:
>
>> I thought the main gain was to get ACID on Hive performant enough.
>>
>> @Mich: Do you run with ACID-enabled tables? How many
>> Create/Update/Deletes do you do per second?
>>
>> best regards
>> /Pelle
>>
>> On Mon, Oct 24, 2016 at 7:39 AM, Jörn Franke 
>> wrote:
>>
>>> I think the main gain is more about getting rid of a dedicated database
>>> including maintenance and potential license cost.
>>> For really large clusters and a lot of users this might be even more
>>> beneficial. You can avoid clustering the database etc.
>>>
>>> On 24 Oct 2016, at 00:46, Mich Talebzadeh 
>>> wrote:
>>>
>>>
>>> A while back there was some notes on having Hive metastore on Hbase as
>>> opposed to conventional RDBMSs
>>>
>>> I am currently involved with some hefty work with Hbase and Phoenix for
>>> batch ingestion of trade data. As long as you define your Hbase table
>>> through Phoenix and with secondary Phoenix indexes on Hbase, the speed is
>>> impressive.
>>>
>>> I am not sure how much having Hbase as Hive metastore is going to add to
>>> Hive performance. We use Oracle 12c as Hive metastore and the Hive
>>> database/schema is built on solid state disks. Never had any issues with
>>> lock and concurrency.
>>>
>>> Therefore I am not sure what one is going to gain by having Hbase as the
>>> Hive metastore? I trust that we can still use our existing schemas on
>>> Oracle.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Per Ullberg*
>> Data Vault Tech Lead
>> Odin Uppsala
>> +46 701612693 <+46+701612693>
>>
>> Klarna AB (publ)
>> Sveavägen 46, 111 34 Stockholm
>> Tel: +46 8 120 120 00 <+46812012000>
>> Reg no: 556737-0431
>> klarna.com
>>
>>
>


Re: Table substitution

2016-08-25 Thread Furcy Pin
Hi Rakesh,

this is an interesting question,

Did you look at the cube and rollup possibilities?

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

I believe that you could use this feature to merge your table 1 and table 2
in one single table with a
GROUP BY business_unit, balance, account WITH ROLLUP

partition the result table with GROUPING__ID

and then your select query would have to be

select business_unit, balance from new_table where business_unit='mybu' and
GROUPING_ID = 1

If you want to get something more automatic after that, I guess there must
be tools
out there capable of providing a GUI on top of a cube, automatically
generating the SQL
and running it via JDBC connectors to Hive, Spark or Impala...

After googling a little, I believe that Apache Lens matches the description.

https://lens.apache.org/user/olap-cube.html

I also found this old presentation related to it :
http://events.linuxfoundation.org/sites/events/files/slides/ApacheCon-Datacubes_in_Apache_Hive_0.pdf

Hope this helps,

Furcy





On Thu, Aug 25, 2016 at 1:27 PM, Rakesh Kant  wrote:

> I have multiple aggregate tables created at different levels of
> aggregation grouped by smaller subset of keys. As an example-
> Table 1 : business_unit, account, balance
> Table 2 : business_unit, balance
>
> When a query is written as select business_unit, balance from table1 where
> business_unit='mybu'; it would be faster to use table 2 instead but how do
> I substitute table 2 instead on table 1 in the query?
>
> Is there a way to add custom handlers either in the driver, optimizer or
> other points ?
>
> Any help or pointers will be appreciated.
>
> RK
>
> Get Outlook for iOS 
>
>


Re: Populating tables using hive and spark

2016-08-22 Thread Furcy Pin
Hi Nitin,

I confirm that there is something odd here.

I did the following test :

create table test_orc (id int, name string, dept string) stored as ORC;
insert into table test_orc values (1, 'abc', 'xyz');
insert into table test_orc values (2, 'def', 'xyz');
insert into table test_orc values (3, 'pqr', 'xyz');
insert into table test_orc values (4, 'ghi', 'xyz');


I ended up with 4 files on hdfs:

00_0
00_0_copy_1
00_0_copy_2
00_0_copy_3


Then I renamed 00_0_copy_2 to part-0, and I still got COUNT(*) = 4
with hive.
So this is not a file name issue.

I then removed one of the files, and I got this :

> SELECT COUNT(1) FROM test_orc ;
+--+--+
| _c0  |
+--+--+
| 4|
+--+--+

> SELECT * FROM test_orc ;
+--+++--+
| test_orc.id  | test_orc.name  | test_orc.dept  |
+--+++--+
| 1| abc| xyz|
| 2| def| xyz|
| 4| ghi| xyz|
+--+++--+
3 rows selected (0.162 seconds)

So, my guess is that when Hive inserts data, it must keep somewhere in the
metastore the number of rows in the table.
However, if the files are modified by someone else than Hive itself,
(either manually or with Spark), you end up with an inconsistency.

So I guess we can call it a bug:

Hive should detect that the files changed and invalidate its pre-calculated
count.
Optionally, Spark should be nice with Hive and update the the count when
inserting.

I don't know if this bug has already been reported, and I tested on Hive
1.1.0, so perhaps it is already solved in later releases.

Regards,

Furcy


On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar 
wrote:

> Hi!
>
> I've noticed that hive has problems in registering new data records if the
> same table is written to using both the hive terminal and spark sql. The
> problem is demonstrated through the commands listed below
>
> 
> hive> use default;
> hive> create table test_orc (id int, name string, dept string) stored as
> ORC;
> hive> insert into table test_orc values (1, 'abc', 'xyz');
> hive> insert into table test_orc values (2, 'def', 'xyz');
> hive> select count(*) from test_orc;
> OK
> 2
> hive> select distinct(name) from test_orc;
> OK
> abc
> def
>
> *** files in hdfs path in warehouse for the created table ***
>
>
> ​
>
> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
> >>> column_names = ['identity_id', 'emp_name', 'dept_name']
> >>> data_df = sqlContext.createDataFrame(data_points, column_names)
> >>> data_df.show()
>
> +---++-+
> |identity_id|emp_name|dept_name|
> +---++-+
> |  3| pqr|  xyz|
> |  4| ghi|  xyz|
> +---++-+
>
> >>> data_df.registerTempTable('temp_table')
> >>> sqlContext.sql('insert into table default.test_orc select * from
> temp_table')
>
> *** files in hdfs path in warehouse for the created table ***
>
> ​
> hive> select count(*) from test_orc; (Does not launch map-reduce job)
> OK
> 2
> hive> select distinct(name) from test_orc; (Launches map-reduce job)
> abc
> def
> ghi
> pqr
>
> hive> create table test_orc_new like test_orc stored as ORC;
> hive> insert into table test_orc_new select * from test_orc;
> hive> select count(*) from test_orc_new;
> OK
> 4
> ==
>
> Even if I restart the hive services I cannot get the proper count output
> from hive. This problem only occurs if the table is written to using both
> hive and spark. If only spark is used to insert records into the table
> multiple times, the count query in the hive terminal works perfectly fine.
>
> This problem occurs for tables stored with different storage formats as
> well (textFile etc.)
>
> Is this because of the different naming conventions used by hive and spark
> to write records to hdfs? Or maybe it is not a recommended practice to
> write tables using different services?
>
> Your thoughts and comments on this matter would be highly appreciated!
>
> Thanks!
> Nitin
>
>
>


Re: Iterating over partitions using the metastore API

2016-08-04 Thread Furcy Pin
Hi Elliot,

I guess you can use IMetaStoreClient.listPartitionsNames instead, and then
use IMetaStoreClient.getPartition for each partition.
This might be slow though, as you will have to make 10 000 calls to get
them.

Another option I'd consider is connecting directly to the Hive metastore.
This require a little more configuration (grant read-only access to your
process to the metastore), and might make your implementation dependent
on the metastore underlying implementation (mysql, postgres, derby), unless
you use a ORM to query it.
Anyway, you could ask the metastore directly via JDBC for all the
partitions, and get java.sql.ResultSet that can be iterated over.

Regards,

Furcy


On Thu, Aug 4, 2016 at 1:29 PM, Elliot West  wrote:

> Hello,
>
> I have a process that needs to iterate over all of the partitions in a
> table using the metastore API.The process should not need to know about the
> structure or meaning of the partition key values (i.e. whether they are
> dates, numbers, country names etc), or be required to know the existing
> range of partition values. Note that the process only needs to know about
> one partition at any given time.
>
> Currently I am naively using the IMetaStoreClient.listPartitions(String,
> String, short) method to retrieve all partitions but clearly this is not
> scalable for tables with many 10,000s of partitions. I'm finding that even
> with relatively large heaps I'm running into OOM exceptions when the
> metastore API is building the List return value. I've
> experimented with using IMetaStoreClient.listPartitionSpecs(String,
> String, int) but this too seems to have high memory requirements.
>
> Can anyone suggest how I can better iterate over partitions in a manner
> that is more considerate of memory usage?
>
> Thanks,
>
> Elliot.
>
>


Re: Doubt on Hive Partitioning.

2016-08-02 Thread Furcy Pin
I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to
work.

Since MR is deprecated in 2.0, I assume we should not expect any future
perf optimisation on this side.

It has been implemented for Hive on Spark, though.
https://issues.apache.org/jira/browse/HIVE-9152




On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian 
wrote:

> Is this partition pruning fixed in MR too except for TEZ in newer hive
> version?
>
> Regards,
> Q
>
> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke  wrote:
>
>> It happens in old hive version of the filter is only in the where clause
>> and NOT in the join clause. This should not happen in newer hive version.
>> You can check it by executing explain dependency query.
>>
>> On 01 Aug 2016, at 11:07, Abhishek Dubey > > wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a very big table *t* with billions of rows and it is partitioned
>> on a column *p*. Column *p * has datatype text and values like ‘201601’,
>> ‘201602’…upto ‘201612’.
>>
>> And, I am running a query like : *Select columns from t where
>> p=’201604’.*
>>
>>
>>
>> My question is : Can there be a scenario/condition/probability that my
>> query will do a complete table scan on *t* instead of only reading data
>> for specified partition key. If yes, please put some light on those
>> scenario.
>>
>>
>>
>> I’m asking this because someone told me that there is a probability that
>> the query will ignore the partitioning and do a complete table scan to
>> fetch output.
>>
>>
>>
>> *Thanks & Regards,*
>> *Abhishek Dubey*
>>
>>
>>
>>
>


Re: Doubt on Hive Partitioning.

2016-08-01 Thread Furcy Pin
Hi Abhishek,

Yes, it can happen.

The only such scenarios I can think of are when you use a WHERE clause with
a non-constant clause.
As far as I know, partition only work on constant clauses, because it has
to evaluate them *before* starting the query in order to prune the
partitions.

For instance:

WHERE p = otherColumn
> here the predicate will depend on the row being read, thus all rows must
be read.
> if otherColumn is a partition, I don't think it work either

WHERE p IN (SELECT p FROM t2)
> here we could argue that Hive could optimize this by computing the sub
query first,
> and then do the partition pruning, but sadly I don't think this
optimisation has been implemented yet


WHERE f(p) = 'constant'
or
WHERE p = f('constant')

where f is a non-deterministic or non-stateful UDF.
An example of non-deterministic function are rand() and
unix_timestamp() because
it is evaluated differently at each row

So if you want today's partition, you should use instead current_date(),
which is deterministic,  since it takes the time of compilation of the
query.
It is only available since Hive 1.2.0 though.

You can know if a Hive UDF is deterministic and stateful by looking at the
class annotation UDFType in it's source code.
If you plan on writing your own UDF, don't forget to specifiy this
annotation as well.

hope this helps,

Furcy




On Mon, Aug 1, 2016 at 11:07 AM, Abhishek Dubey 
wrote:

> Hi All,
>
>
>
> I have a very big table *t* with billions of rows and it is partitioned
> on a column *p*. Column *p * has datatype text and values like ‘201601’,
> ‘201602’…upto ‘201612’.
>
> And, I am running a query like : *Select columns from t where p=’201604’.*
>
>
>
> My question is : Can there be a scenario/condition/probability that my
> query will do a complete table scan on *t* instead of only reading data
> for specified partition key. If yes, please put some light on those
> scenario.
>
>
>
> I’m asking this because someone told me that there is a probability that
> the query will ignore the partitioning and do a complete table scan to
> fetch output.
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>


Re: Re-arrange columns

2016-07-26 Thread Furcy Pin
Hi,

I think I had similar issues to yours.

Did you look in the Hive documentation at what the CASCADE keyword does on
ADD or CHANGE COLUMNS statements?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterColumn

>From what I understand, the behavior of Hive when adding or changing
columns is different from standard RDBMS:
I think that adding or moving a column in Hive will only change the
metadata and not the underlying data.
To cope with that, the metastore keeps separate schema information for the
table itself and for its partitions.
This way, when you change the schema of your table, the old partition can
still be read according to the old schema,
and new partitions will be created with the new schema (unless you use
CASCADE)

There are two dangers with this, however:
- if you create a new column, it will not be accessible in the old
partitions.
- if you regenerate (override) an existing partition, the new data will
correspond to the new schema, but the partition's metadata will not be
updated.
   (I believe this could be considered as a bug, with a workaround since
the "ALTER TABLE table_name ADD COLUMNS ..."
   can also be applied to one specific partition at a time: "ALTER TABLE
table_name PARTITION partition_spec ADD COLUMNS ...")

If you use CASCADE, the change you apply to the table will be immediately
applied to its partitions as well.
But if you don't regenerate your existing partitions, I believe you will
have problems as well, since your partition's schema will not match the
underlying data.


So, I guess it mostly depend on if (and when) you plan to regenerate your
partition to add the new column to your existing data.
But you can either:

A. Drop your table and do a msck repair table.

B. To reduce unavailability: create another table, populate it, and then
swap your tables.

[image: Inline image 1]

C. Do your column change with CASCADE, and regenerate your partitions
immediately,
but they might not be correctly readable between the time you make your
change and regenerate them.

D. Do your column change without CASCADE: you can still query the old
partitions (without the new column though)
and after regenerating a partition, change its schema with a "ALTER TABLE
table_name PARTITION partition_spec ADD COLUMN ..."

E. (I'm not sure this one works) Do your column change without CASCADE,
create a copy of the table, generate the partitions there, and then use
EXCHANGE PARTITION (
https://cwiki.apache.org/confluence/display/Hive/Exchange+Partition) to
move the partitions from the new table to the old.

Hope this helps, let me know how it turns out,

Furcy







On Tue, Jul 26, 2016 at 2:34 AM, Binh Nguyen Van  wrote:

> Hi,
>
> I am writing an application that insert data into Hive and it depends on
> the order of columns in table so I have to rearrange my columns to make
> work around that but I have difficulty of doing that. Could you please help?
>
> Here is my problem:
>
> I have table 'tbl' which is created by using this statement:
> CREATE EXTERNAL TABLE tbl (col1 int, col3 string) PARTITIONED BY (dt string) 
> STORED AS PARQUET LOCATION '/wh/db/tb'.
> Now I want to add new column 'col2' with type int to that table and I want to 
> put it after 'col1' and before 'col3' so I use these two statements:
> ALTER TABLE tbl ADD COLUMNS (col2 int);
> ALTER TABLE tbl CHANGE col2 col2 int AFTER col1;
>
> With Hive 1.2.1, these two statements are executed successfully but I got 
> type cast exception when I query data back.
> With Hive 2.x, The second alter statement failed with exception: 'Unable to 
> alter table. The following columns have types incompatible with the existing 
> columns in their respective positions'
>
> I could work around this by drop old table and then create a new table
> with new schema but this requires me to to run MSCK to update metadata for
> that table and this process could be very slow when I have a lot of data
> and a lot of partition so I am looking for a better one.
>
> Please help!
> Thanks
> -Binh
> ​
>


Re: Could i use Hive SQL parser in our application?

2016-05-18 Thread Furcy Pin
Hi,


This is feasible.

I already did it in some tool I developed.
Unfortunately I can't open-source it yet, but I can tell you how I dit it.

There is this class in Hive source code

https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/ql/src/java/org/apache/hadoop/hive/ql/tools/LineageInfo.java

which shows a simple example where the HiveParser is used.

Getting the AST from a query is as simple as.

ParseDriver pd = new ParseDriver();
ASTNode tree = pd.parse(query);

all you have to do next is deal with the AST to do what you want to do, but
that's the where things get complicated.




On Wed, May 18, 2016 at 12:43 PM, Heng Chen 
wrote:

> Hi, guys.
>
>   Recently,  we need to integrate Hive SQL parser in our application.
> Is there any way to do it?
>
> Thanks!
>


Discussion: permanent UDF with database name

2015-12-17 Thread Furcy Pin
Hi Hive users,

I would like to pursue the discussion that happened during the design of
the feature:
https://issues.apache.org/jira/browse/HIVE-6167

Some concern where raised back then, and I think that maybe now that it has
been implemented, some user feedbacks could bring water to the mill.

Even if I understand the utility of grouping UDFs inside databases, I find
it really annoying not to be able to define my UDFs globally.

For me, one of the main interests of UDFs is to extend the built-in Hive
functions with the company's user-defined functions, either because some
useful generic function are missing in the built-in functions or to add
business-specific functions.

In the latter case, I understand very well the necessity of qualifying them
with a business-specific database name. But in the former case?


Let's take an example:
It happened several times that we needed a Hive UDF that was did not exist
yet on the Hive version that we were currently running. To use it, all we
had to do was take the UDF's source code from a more recent version of
Hive, built it in a JAR, and add the UDF manually.

When we upgraded, we only add to remove our UDF since it was now built-in.

(To be more specific it happened with collect_list prior to Hive 0.13).

With HIVE-6167, this became impossible, since we ought to create a
"database_name.function_name", and use it as is. Hence, when upgrading we
need to rename everywhere "database_name.function_name" with
"function_name".

This is just an example, but I would like to emphasize the point that
sometimes we want to create permanent UDFs that are as global as built-in
UDFs and not bother if it is a built-in or user-defined function. As
someone pointed out in HIVE-6167's discussion, imagine if all the built-in
UDFs had to be called with "sys.function_name".

I would just like to have other Hive user's feedback on that matter.

Did anyone else had similar issues with this behavior? How did you treat
them?

Maybe it would make sense to create a feature request for being able to
specify a GLOBAL keyword when creating a permanent UDF, when we really want
it to be global?

What do you think?

Regards,

Furcy


Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin
maybe you compile and run against different versions of spark?

On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh  wrote:

> Trying to run Hive on Spark 1.3 engine, I get
>
>
>
> conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark
> assembly has been built with Hive, including Datanucleus jars on classpath
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.connect.timeout=1000
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.threads=8
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property:
> hive.spark.client.server.connect.timeout=9
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03
> 17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577
>
> *15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:
> Exception in thread "main" java.lang.NoSuchFieldError:
> SPARK_RPC_CLIENT_CONNECT_TIMEOUT*
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:46)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> java.lang.reflect.Method.invoke(Method.java:606)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> Any clues?
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin
The field SPARK_RPC_CLIENT_CONNECT_TIMEOUT seems to have been added to Hive
in the 1.1.0 release

https://github.com/apache/hive/blob/release-1.1.0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

Are you using an older version of Hive somewhere?


On Thu, Dec 3, 2015 at 7:15 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Thanks I tried all L
>
>
>
> I am trying to make Hive use Spark and apparently Hive can use version 1.3
> of Spark as execution engine. Frankly I don’t know why this is not working!
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Furcy Pin [mailto:furcy@flaminem.com]
> *Sent:* 03 December 2015 18:07
> *To:* user@hive.apache.org
> *Cc:* u...@spark.apache.org
> *Subject:* Re: Any clue on this error, Exception in thread "main"
> java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT
>
>
>
> maybe you compile and run against different versions of spark?
>
>
>
> On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Trying to run Hive on Spark 1.3 engine, I get
>
>
>
> conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:18 [stderr-redir-1]: INFO client.SparkClientImpl: Spark
> assembly has been built with Hive, including Datanucleus jars on classpath
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.connect.timeout=1000
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.threads=8
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property: hive.spark.client.secret.bits=256
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: Warning:
> Ignoring non-spark config property:
> hive.spark.client.server.connect.timeout=9
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl: 15/12/03
> 17:53:19 INFO client.RemoteDriver: Connecting to: rhes564:36577
>
> *15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:
> Exception in thread "main" java.lang.NoSuchFieldError:
> SPARK_RPC_CLIENT_CONNECT_TIMEOUT*
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:46)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:146)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMetho

Re: Best way to load CSV file into Hive

2015-11-01 Thread Furcy Pin
Hi Vijaya,

If you need some nice ETL capabilities, you may want to try
https://github.com/databricks/spark-csv

Among other things, spark-csv let you read the csv as is and create and
insert a copy of the
data into a Hive table with any format you like (Parquet, ORC, etc.)

If you have a header row, it can remove it and use it get the column names
directly, and it can also perform automatic type detection.
You can specify the delimiterChar and the quoteChar, but I did not see the
escapeChar in the doc.

In the end, it's as easy as :

val df : DataFrame =
sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("delimiter", ",") // Specify delimiter
.option("quote", "\"") // Specify quoteChar
.option("inferSchema", "true") // Automatically infer data types
.load("path/to/data.csv")

df.write
  .format("orc")
  .saveAsTable("db_name.table_name")


I believe HDP now supports spark.




On Sat, Oct 31, 2015 at 10:30 PM, Jörn Franke  wrote:

> You clearly need to escape those characters as for any other tool. You may
> want to use avro instead of csv , xml or JSON etc
>
> On 30 Oct 2015, at 19:16, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomire...@whishworks.com> wrote:
>
> Hi,
>
> I have a CSV file which contains hunderd thousand rows and about 200+
> columns. Some of the columns have free text information, which means it
> might contain characters like comma, colon, quotes etc with in the column
> content.
>
> What is the best way to load such CSV file into Hive?
>
> Another serious issue, I have stored the file in a location in HDFS and
> then created an external hive table on it. However, upon running Create
> external table using HDP Hive View, the original CSV is no longer present
> in the folder where it is meant to be. Not sure on how HDP processes and
> where it is stored? My understanding was that EXTERNAL table wouldnt be
> moved from their original HDFS location?
>
> Request someone to help out!
>
>
> Thanks & Regards
> Vijay
>
>
>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>
>


Re: locks are held on tables even when no job running

2015-10-24 Thread Furcy Pin
lto:ekoif...@hortonworks.com]
>> *Sent:* 23 October 2015 17:13
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: locks are held on tables even when no job running
>>
>>
>>
>> Mich,
>>
>> how were you running/killing the job?  was it ^C of CLI or something
>> else?
>>
>> (The only time you’d get Exclusive lock is to drop an object.  (With
>> DbTxnManager which looks like what you are using))
>>
>> The locks will timeout but
>> https://issues.apache.org/jira/browse/HIVE-11317 may be relevant.
>>
>>
>>
>> Furcy,
>>
>> could you file a Jira with a repro for the deadlock you are describing?
>>
>>
>>
>> Thanks,
>>
>> Eugene
>>
>>
>>
>>
>>
>>
>>
>> *From: *Mich Talebzadeh <m...@peridale.co.uk>
>> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Date: *Friday, October 23, 2015 at 1:18 AM
>> *To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Subject: *RE: locks are held on tables even when no job running
>>
>>
>>
>> Hi Furcy,
>>
>>
>>
>> Thanks for the info.
>>
>>
>>
>> I ran the same job twice, killing it first time and starting again.
>> Actually your point about 5 min duration seems to be correct. my process
>> basically creates a new hive table with two additional columns and populate
>> it from an existing table hence the locks
>>
>>
>>
>> Even if the job is killed “WAITING EXCLUSIVE” locks are still held on the
>> new table (see below) and that is the cause of the issue
>>
>>
>>
>> Lock ID DatabaseTable   Partition   State   Type
>> Transaction ID  Last Hearbeat   Acquired At UserHostname
>>
>> 14031   asehadoop   t   NULLWAITING EXCLUSIVE   NULL
>> 1445586539232   NULLhduser  rhes564
>>
>> 14029   asehadoop   t_staging   NULLACQUIRED
>> SHARED_READ NULL1445586247044   1445585940653   hduser  rhes564
>>
>> 14029   asehadoop   t   NULLACQUIREDSHARED_READ
>> NULL1445586247044   1445585940654   hduser  rhes564
>>
>> 14030   asehadoop   t   NULLWAITING EXCLUSIVE   NULL
>> 1445586471827   NULL    hduser  rhes564
>>
>>
>>
>> I am not sure this is behaving like classic RDBMS like Sybase or MSSQL
>> where the rollback is happening after KILL command and the locks are held
>> until rollback is complete. Killing a process itself will not release the
>> locks!
>>
>>
>>
>> Regards,
>>
>>
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4,
>> volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Furcy Pin [mailto:furcy@flaminem.com <furcy@flaminem.com>]
>>
>> *Sent:* 23 October 2015 09:08
>> *To:* user@hive.apache.org
>> *Subject:* Re: locks are held on tables even when no job running
>>
>>
>>
>> Hi Mich,
>>
>>
>>
>> I believe the duration of locks is defined by hive.txn.timeout, which is
>> 5 min by default.
>>
>> https://cwiki.apache.org/confl

Re: locks are held on tables even when no job running

2015-10-23 Thread Furcy Pin
Hi Mich,

I believe the duration of locks is defined by hive.txn.timeout, which is 5
min by default.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Retry your SHOW LOCKS command and check that the Last HeartBeat is not
changing.
If it is, it means your query is still active somehow. If it isn't, the
lock should disappear by itself after the timeout.

Also, I don't know if this is a known issue in Hive's Jira, but we noticed
that trying to read and write data
into the same table within a single query creates a deadlock.

If that is what you were trying to do, you should rather write your data in
a temporary file and then move it back into the table.








On Fri, Oct 23, 2015 at 9:56 AM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
>
>
> What is the duration of locks held in Hive?
>
>
>
> I have got the following locks in Hive, although I have already killed the
> jobs!
>
>
>
>
>
> Lock ID DatabaseTable   Partition   State   Type
> Transaction ID  Last Hearbeat   Acquired At UserHostname
>
> 14031   asehadoop   t   NULLWAITING EXCLUSIVE   NULL
> 1445586539232   NULLhduser  rhes564
>
> 14029   asehadoop   t_staging   NULLACQUIRED
> SHARED_READ NULL1445586247044   1445585940653   hduser  rhes564
>
> 14029   asehadoop   t   NULLACQUIREDSHARED_READ
> NULL1445586247044   1445585940654   hduser  rhes564
>
> 14030   asehadoop   t   NULLWAITING EXCLUSIVE   NULL
> 1445586471827   NULLhduser  rhes564
>
>
>
>
>
> mapred job -list
>
> Total jobs:0
>
>   JobId  State   StartTime
> UserName   Queue  Priority   UsedContainers
> RsvdContainers  UsedMem RsvdMem NeededMem AM info
>
>
>
> No locks are held in metastore (Oracle in my case) as well.
>
>
>
> Thanks
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Alias vs Assignment

2015-10-08 Thread Furcy Pin
Hi folks,


I would like to start a discussion with the Hive user and developper
community about an element of syntax present in SQL Server that could be
nice to have in Hive.


Back in 2012, before I started Hive, and was using SQL Server, I came
accross this post :

http://sqlblog.com/blogs/aaron_bertrand/archive/2012/01/23/bad-habits-to-kick-using-as-instead-of-for-column-aliases.aspx

that convinced me to write my queries like

#1
SELECT
myColumn = someFunction(someColumn),
myOtherColumn = someOtherFunction(someOtherColumn)
FROM ...

rather than

#2
SELECT
someFunction(someColumn) as myColumn
someOtherFunction(someOtherColumn) as myOtherColumn
FROM ...

The two syntax are equivalent in SQL Server, but only the second is allowed
in Hive.

In my opinion, there are two advantages of using #1 over #2 (and it seems
the blog post I mention above only mentions the first) :

   1. Readability: usually the name of the columns you are computing
   matters more than how you compute them.
   2. Updates: #1 can easily be transformed into an update query, #2
   requires some rewriting (thank god I discovered Sublime Text and its
   multi-line editing)


On the other side, #1 is unfortunately not ISO compliant, even though IMHO
ISO did not pick the best choice this time... Besides Hive it would not be
Hive's first deviation from ISO.

I would like to hear what do you people think, would it be a good idea to
implement this in Hive?

Cheers,

Furcy


Re: transactional table + vectorization + where = bug

2015-09-23 Thread Furcy Pin
I just created it :


   1. HIVE-11933 <https://issues.apache.org/jira/browse/HIVE-11933>


On Mon, Sep 21, 2015 at 6:37 PM, Alan Gates <alanfga...@gmail.com> wrote:

> I am not aware of this issue.  Please file a JIRA, and if it does turn out
> to be a duplicate we can mark it as such.
>
> Alan.
>
> Furcy Pin <furcy@flaminem.com>
> September 19, 2015 at 2:36
> Hi,
>
>
> We bumped into a bug when using vectorization on a transactional table.
> Here is a minimal example :
>
> DROP TABLE IF EXISTS vectorization_transactional_test ;
> CREATE TABLE vectorization_transactional_test (
> id INT
> )
> CLUSTERED BY (id) into 3 buckets
> STORED AS ORC
> TBLPROPERTIES('transactional'='true') ;
>
> INSERT INTO TABLE vectorization_transactional_test values
> (1)
> ;
>
> SET hive.vectorized.execution.enabled=true ;
>
> SELECT
> *
> FROM vectorization_transactional_test
> WHERE id = 1
> ;
>
> With vectorization enable, the last query will fail with a n
> ArrayOutOfBoundException in the mappers.
> Here is the full stack:
>
> FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
> processing row
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
> at
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error
> evaluating 1
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.processOp(VectorSelectOperator.java:126)
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.processOp(VectorFilterOperator.java:111)
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
> at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
> at
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
> ... 9 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.hive.ql.exec.vector.expressions.ConstantVectorExpression.evaluateLong(ConstantVectorExpression.java:102)
> at
> org.apache.hadoop.hive.ql.exec.vector.expressions.ConstantVectorExpression.evaluate(ConstantVectorExpression.java:150)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.processOp(VectorSelectOperator.java:124)
> ... 15 more
>
>
> Of course, disabling vectorization removes the bug.
>
> More annoyingly, when the table is used in a JOIN, the job doesn't fail
> but returns a wrong result instead :
> for instance an empty table, while disabling vectorization returns a
> non-empty one. This behavior is harder to reproduce with a minimal example.
>
> We experienced this bug in version 1.1.0-cdh5.4.2.
>
> We didn't find any JIRA related to this, is it a known bug, or should we
> create a new JIRA?
>
> Best,
>
> Furcy
>
>
>
>
>
>


transactional table + vectorization + where = bug

2015-09-19 Thread Furcy Pin
Hi,


We bumped into a bug when using vectorization on a transactional table.
Here is a minimal example :

DROP TABLE IF EXISTS vectorization_transactional_test ;
CREATE TABLE vectorization_transactional_test (
id INT
)
CLUSTERED BY (id) into 3 buckets
STORED AS ORC
TBLPROPERTIES('transactional'='true') ;

INSERT INTO TABLE vectorization_transactional_test values
(1)
;

SET hive.vectorized.execution.enabled=true ;

SELECT
*
FROM vectorization_transactional_test
WHERE id = 1
;

With vectorization enable, the last query will fail with a n
ArrayOutOfBoundException in the mappers.
Here is the full stack:

FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error
evaluating 1
at
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.processOp(VectorSelectOperator.java:126)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.processOp(VectorFilterOperator.java:111)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.hadoop.hive.ql.exec.vector.expressions.ConstantVectorExpression.evaluateLong(ConstantVectorExpression.java:102)
at
org.apache.hadoop.hive.ql.exec.vector.expressions.ConstantVectorExpression.evaluate(ConstantVectorExpression.java:150)
at
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.processOp(VectorSelectOperator.java:124)
... 15 more


Of course, disabling vectorization removes the bug.

More annoyingly, when the table is used in a JOIN, the job doesn't fail but
returns a wrong result instead :
for instance an empty table, while disabling vectorization returns a
non-empty one. This behavior is harder to reproduce with a minimal example.

We experienced this bug in version 1.1.0-cdh5.4.2.

We didn't find any JIRA related to this, is it a known bug, or should we
create a new JIRA?

Best,

Furcy


Re: WHERE ... NOT IN (...) + NULL values = BUG

2015-07-07 Thread Furcy Pin
Thanks matshyeq,

you are right, I tested it on other sql engines and the result is the same.
(but I still find this confusing...)

SELECT 1 IN (1,2,3,NULL) ;
 true

SELECT 1 IN (2,3) ;
 false

SELECT 1 IN (2,3,NULL) ;
 NULL

SELECT 1 NOT IN (1,2,3,NULL) ;
 false

SELECT 1 NOT IN (2,3,NULL) ;
 NULL

SELECT 1 NOT IN (2,3) ;
 true






On Tue, Jul 7, 2015 at 5:24 PM, Grant Overby (groverby) grove...@cisco.com
wrote:

  I call it my billion-dollar mistake. It was the invention of the null
 reference in 1965.”
 — Tony Hoare


 *Grant Overby*
 Software Engineer
 Cisco.com http://www.cisco.com/
 grove...@cisco.com
 Mobile: *865 724 4910 865%20724%204910*



Think before you print.

 This email may contain confidential and privileged material for the sole
 use of the intended recipient. Any review, use, distribution or disclosure
 by others is strictly prohibited. If you are not the intended recipient (or
 authorized to receive for the recipient), please contact the sender by
 reply email and delete all copies of this message.

 Please click here
 http://www.cisco.com/web/about/doing_business/legal/cri/index.html for
 Company Registration Information.




   From: matshyeq matsh...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Tuesday, July 7, 2015 at 9:25 AM
 To: user user@hive.apache.org
 Subject: Re: WHERE ... NOT IN (...) + NULL values = BUG

   Obviously, the expected answer is always 2.

  That's incorrect.
 It's expected behaviour, SQL standard and I would expect every other DBs
 behave same way.
 The direct comparison to NULL returns FALSE. Always. Doesn't matter if
  used as  ,=, IN, NOT IN.
 IS (NOT) NULL is the right way to handle such cases. COALESCE is some
 alternative too.

  Thank you,
 Kind Regards
 ~Maciek

 On Tue, Jul 7, 2015 at 11:46 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi folks,

  just to let my fellow Hive users know that we found a bug with subquery
 in where clauses and created a JIRA for it.

  https://issues.apache.org/jira/browse/HIVE-11192

  The latest version seems to be affected.

  Regards,

  Furcy Pin





Re: CSV file reading in hive

2015-02-13 Thread Furcy Pin
Hi Sreeman,

Unfortunately, I don't think that Hive built-in format can currently read
csv files with fields enclosed in double quotes.
More generally, for having ingested quite a lot of messy csv files myself,
I would recommend you to write a MapReduce (or Spark) job
for cleaning your csv before giving it to Hive. This is what I did.
The (other) kind of issue I've met were among :

   - File not encoded in utf-8, making special characters unreadable for
   Hive
   - Some lines with missing or too many columns, which could shift your
   columns and ruin your stats.
   - Some lines with unreadable characters (probably data corruption)
   - I even got some lines with java stack traces in it

I hope your csv is cleaner than that, and would recommend that if you have
the control on how it is generated, replace your current separator with tab
(and replace inline tabs with \t) or something like that.

There might be some open source tools for data cleaning already out there.
I plan to release mine one day, once I've migrated it to Spark maybe, and
if my company agrees.

If you're lazy, I heard that Dataiku Studio (which has a free version) can
do such thing, though I never used it myself.

Hope this helps,

Furcy



2015-02-13 7:30 GMT+01:00 Slava Markeyev slava.marke...@upsight.com:

 You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
 BY ',' ESCAPED BY '\'. Check the DDL for details
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



 On Thu, Feb 12, 2015 at 8:19 PM, Sreeman sreebalin...@gmail.com wrote:

  Hi All,

 How all of you are creating hive/Impala table when the CSV file has some
 values with COMMA in between. it is like

 sree,12345,payment made,but it is not successful





 I know opencsv serde is there but it is not available in lower versions
 of Hive 14.0






 --

 Slava Markeyev | Engineering | Upsight
 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev



Re: Hash algorithm issues

2015-01-21 Thread Furcy Pin
Hi Murali,

As you can see from the source code of Hive's hash UDF:

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java
https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java

it is basically using Java's hashCode method.

In Java, I believe hashCode is mostly used for HashMaps because it is
really efficient to compute, however it is really poor at avoiding
collisions.

For instance, in Java , 100L.hashCode()==100

Moreover, the .hashCode() method (as well as hive's hash) returns an int.

So if you hash more than 2^32 bigints into Integers of size 32 you'll end
up with... a lot of collisions.

I suggest you go for writing your own UDF that directly compute the hash as
hex format, and perhaps use another hash function, eg md5.
It is pretty easy to find md5 UDF implementations you can start from,
eg: https://gist.github.com/dataminelab/1050002

There is no guarantee of quality however, I do not personally use this one,
so I don't know if it works fine or not.















2015-01-21 22:38 GMT+01:00 murali parimi muralikrishna.par...@icloud.com:

 Hello team,

 We are extracting data from netezza and loading into hive tables. While
 loading data, we are using hash function to mask few PII data for security
 reasons.

 One such column is acct_num stored as bigint in netezza, which we are
 storing in a string column after converting hash of that acct_num to a hex
 format.

 Now the issue is we found same value is generated for distinct acct_num in
 most of the records. So any known issues with the algorithm that hash
 function uses in hive?

 Thanks,
 Murali

 Sent from my iPhone


Re: Hive order by not working properly

2015-01-07 Thread Furcy Pin
Hi Vengatesh,

AFAIK, in Hive, the columns used in the ORDER BY clause needs to be present
in the SELECT clause.
I guess this is because it has to be passed to the final reducer.

This is why in your case the first query works and the second does not.
The following query, however should work too:

SELECT ACCOUNTID, *concat('test_',ACCOUNTID)* from table *order by*
 ACCOUNTID;

As you already found, the correct way to write this query is to alias your
column and order with it.
Note that this works too:

SELECT *concat('test_',ACCOUNTID) as **ACCOUNTID* from table *order by*
ACCOUNTID;
I am still using 0.12 though, maybe later versions changed some things.

Hope this helps,

Furcy Pin


PS:

On the other hand, Hive has the opposite behavior in GROUP BY clauses, eg
this will not work:

SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table
GROUP BY *tempColumn* *order by* tempColumn;

while this will work:

SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table
GROUP BY *concat('test_',ACCOUNTID)* *order by* tempColumn;

and this will either:

SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table
GROUP BY *concat('test_',ACCOUNTID)* *order by* *concat('test_',ACCOUNTID) *
;

From my understanding, the reason is that the parsing of the query
compiles the GROUP BY clause before the SELECT (as it will depend on it),
so it doesn't know yet that *concat('test_',ACCOUNTID)* will be named
tempColumn.
On the other hand, ORDER BY is compiled after the SELECT so it does not
recognise *concat('test_',ACCOUNTID)* which has been named tempColumn.





2015-01-07 11:11 GMT+01:00 vengatesh.babu vengatesh.b...@zohocorp.com:

 Hi,

 Hive *order by* not working properly with columns involved in *functions.*

 *Column without function:*

 SELECT *ACCOUNTID* from table *order by* ACCOUNTID;

 (Above query works fine in both HIVE  MYSQL)

 *Column with function:*

 SELECT *concat('test_',ACCOUNTID)* from table *order by* ACCOUNTID;

 In mysql , Above query working fine.

 In hive, below exception thrown

 *FAILED: SemanticException [Error 10004]: Line 1:59 Invalid table alias or
 column reference 'ACCOUNTID': (possible column names are: _c0)*

 Temporary Solution :

 SELECT *concat('test_',ACCOUNTID) as tempColumn* from table *order by*
  tempcolumn;

 Is there any solution available without assigning column as tempcolumn ?

 Thanks

 Vengatesh Babu K M






Re: How to match three letter month name in Hive queries

2014-12-21 Thread Furcy Pin
Hi Thimut,

I believe that the UDF unix_timestamp uses the java class SimpleDateFormat.
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

From the doc, you can see that m denotes a minute while M denotes a
month.
For your problem, -MMM-dd should do the trick.

Regards,

Furcy


2014-12-21 4:36 GMT+01:00 Thimuth Amarakoon thim...@gmail.com:

 Hi,

 How can we match a date value like *2014-Dec-20* in unix_timestamp()? The
 pattern *-MM-dd* matches 2014-12-20 format. But -mmm-dd or
 -m-dd is not doing the trick for matching the month name.

 Thanks and regards,
 Thimuth



Re: UDTF KryoException: unable create/find class error in hive 0.13

2014-09-10 Thread Furcy Pin
Hi,

If I get your code right, you are trying to implement a sort of explode
function for MapMapString,String that lists aFieldName values in the
inner map.
I think you could achieve this with a simple
LATERAL VIEW explode(mapmap) T as key, map
and then selecting map[aFieldName].

If you want to go on with your UDTF (to complexify it I presume), I suggest
you to have a look at the implementation of the explode UDTF
https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFExplode.java
and try to do everything in a similar way.

For instance I noticed that your forwardObj is of type ArrayListString,
while it should probably be a a Text[].

Regards,

Furcy


2014-09-09 21:33 GMT+02:00 Echo Li echo...@gmail.com:

 Hi Furcy,

 Thanks for sharing. I modified my code to mark the map variables
 transient but still got same error. this is the code:

 public class fun_name extends GenericUDTF {

 private PrimitiveObjectInspector stringOI = null;

 transient MapString, MapString, String mapObject;
 transient MapString, String eventDetails;

 @Override
 public void close() throws HiveException {

 }

 @Override
 public StructObjectInspector initialize(ObjectInspector[] args) throws
 UDFArgumentException {

 if (args.length != 1) {
 throw new UDFArgumentException(fun_name() takes exactly one
 argument);
 }

 if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE
  ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory()
 != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
 throw new UDFArgumentException(fun_name() takes a string as a
 parameter);
 }

 stringOI = (PrimitiveObjectInspector) args[0];


 ArrayListString fieldNames = new ArrayListString(1);
 ArrayListObjectInspector fieldOIs = new
 ArrayListObjectInspector(1);

 fieldNames.add(field1);
 fieldNames.add(field2);


 fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

 fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

 return
 ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,
 fieldOIs);
 }

 @Override
 public void process(Object[] record) throws HiveException {

 ObjectMapper mapper = new ObjectMapper();

 try {

 final String document = (String)
 stringOI.getPrimitiveJavaObject(record[0]);

 mapObject = mapper.readValue(document,
 new TypeReferenceMapString, MapString, String() {
 });

 for (String someId : mapObject.keySet()) {


 eventDetails = mapObject.get(someId);

 ListString forwardObj = new ArrayListString();

 forwardObj.add(someId);
 forwardObj.add(eventDetails.get(aFieldName));

 forward(forwardObj);
}

 }
 }
 }

 to compile and build the jar:

 *javac -Xlint:deprecation fun_naemjavajar cvf from_json.jar
 from_json.class 'fun_name$1.class'*

 add jar and create function in hive:


 *add jar fun_name.jar;create temporary function fun_name as 'fun_name';*
 succeeded;

 when run a query, got error:

 Vertex failed, vertexName=Map 1, vertexId=vertex_1407434664593_24842_1_00,
 diagnostics=[Vertex Input: impression_test initializer failed., 
 o*rg.apache.hive.com.esotericsoftware.kryo.KryoException:
 Unable to find class: fun_name*
 Serialization trace:
 genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc)
 conf (org.apache.hadoop.hive.ql.exec.UDTFOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
 aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.tez.TezTask


 could be something else?

 On Tue, Sep 9, 2014 at 1:31 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi,

 I think I encountered this kind of serialization problem when writing
 UDFs.
 Usually, marking every fields of the UDF as *transient* does the trick.

 I guess the error means that Kryo tries to serialize the UDF class and
 everything that is inside, and by marking them as transient
 you ensure that it will not and that they will be instantiated in the
 default constructor or during the call of initialize()

 Please keep me informed if it works or not,

 Regards,

 Furcy


 2014-09-09 1:44 GMT+02:00 Echo Li echo...@gmail.com:

 I wrote a UDTF in hive 0.13, the function parse a column which is json
 string and return a table. The function compiles successfully by adding
 hive-exec-0.13.0.2.1.2.1-471.jar to classpath, however when the jar is
 added to hive and a function created using the jar then I try to run a
 query using that function, I got

Re: UDTF KryoException: unable create/find class error in hive 0.13

2014-09-09 Thread Furcy Pin
Hi,

I think I encountered this kind of serialization problem when writing UDFs.
Usually, marking every fields of the UDF as *transient* does the trick.

I guess the error means that Kryo tries to serialize the UDF class and
everything that is inside, and by marking them as transient
you ensure that it will not and that they will be instantiated in the
default constructor or during the call of initialize()

Please keep me informed if it works or not,

Regards,

Furcy


2014-09-09 1:44 GMT+02:00 Echo Li echo...@gmail.com:

 I wrote a UDTF in hive 0.13, the function parse a column which is json
 string and return a table. The function compiles successfully by adding
 hive-exec-0.13.0.2.1.2.1-471.jar to classpath, however when the jar is
 added to hive and a function created using the jar then I try to run a
 query using that function, I got error:

 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find
 class: class_name

 I went through all steps in a lower version hive (0.10) everything works
 fine, I searched around and seams that is caused by the ‘kryo” serde, so my
 question is, is there a fix? and where to find it?

 thank you.



Re: MatchPath UDF usage info ?

2014-09-03 Thread Furcy Pin
Hi Muhammad,

From what I've googled a few months ago on the subject, MatchPath UDF has
been removed from Cloudera and Hortonworks releases because TeraData
claims it violates one of their patent (apparently renaming it did not
suffice).

I guess that if you really need it, it might be possible to add it yourself
as an external UDF since the code is still available out there, but I have
no idea
whether TeraData would have the right to come after you (or not?) if you do.

By the way, if anyone has news on the current situation with MatchPath and
TerraData, that would be welcome.

Furcy




2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi asif.abb...@gmail.com:

 Hi,

 Many thanks for sending these links. Looking forward to more documentation
 around this.

 BTW, why does  hive-exec-0.13.0.2.1.1.0-385.jar not have any class
 files for MatchPath UDF ? Have they been chucked out to a separate JAR
 file?
 I can see that  hive-exec-0.13.0.jar has the appropriate class files,
 and have tried to use them. They work well with the demo data set but we
 certainly need more documentation around this.

 Regards,
 Asif Abbasi




 On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 Thanks for pointing out that we still need documentation for this in the
 wiki.  (I've added a doc comment to HIVE-5087
 https://issues.apache.org/jira/browse/HIVE-5087.)  In the meantime,
 googling Hive npath turned up these sources of information:

- https://github.com/hbutani/SQLWindowing/wiki
- http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
 (slides
20-21)
-

http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


 -- Lefty


 On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi 
 asif.abb...@gmail.com wrote:

 Hi All,

 I am trying to use MatchPath UDF (Previously called NPath). Does anybody
 have a document around its syntax and usage?

 Regards,
 Asif






Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
Hi all,

I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.


My query goes like this :

SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
) T
WHERE rownum = 10

However, the table may contain millions of events for the same user and I'm
getting
an OutOfMemory Error in the reduce phase, inside the following method:

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)



It seems that the windowing functions were designed to store a Buffer
containing all

results for each PARTITION, and writes everything once all rows of
that partition

have been read.


This make windowing with Hive not very scalable...


My questions are:


a) Is there a reason why it was implemented this way rather than in a
streaming fashion?


b) Do you know how I could rewrite the query to avoid the problem (if
possible without having to write my own UDF)?



Thanks,


Furcy


Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin
Thank you very much for your answer Ashutosh,

it seems non-trivial indeed !


2014-08-20 17:51 GMT+02:00 Ashutosh Chauhan hashut...@apache.org:

 Hi Furcy,

 Sorry that you run into this, but this is a non-trivial problem. But, good
 news is Harish has done some good work in this area. See following links.
 All this will be available in upcoming release, date for which is under
 discussion on dev list.

 1. https://issues.apache.org/jira/browse/HIVE-6999
 2. https://issues.apache.org/jira/browse/HIVE-7062
 3. https://issues.apache.org/jira/browse/HIVE-7063
 4. https://issues.apache.org/jira/browse/HIVE-7143
 5. https://issues.apache.org/jira/browse/HIVE-7344

 Hope it helps,
 Ashutosh


 On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi all,

 I have an event table with (user_id, timestamp, event)
 and I'm trying to write a query to get the first 10 events for each user.


  My query goes like this :

 SELECT user_id, event
 FROM
 (
 SELECT
 user_id,
 event,
 ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
 FROM eventTable
 ) T
 WHERE rownum = 10

 However, the table may contain millions of events for the same user and
 I'm getting
 an OutOfMemory Error in the reduce phase, inside the following method:


 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:80)





 It seems that the windowing functions were designed to store a Buffer 
 containing all



 results for each PARTITION, and writes everything once all rows of that 
 partition

 have been read.


 This make windowing with Hive not very scalable...


 My questions are:


 a) Is there a reason why it was implemented this way rather than in a 
 streaming fashion?


 b) Do you know how I could rewrite the query to avoid the problem (if 
 possible without having to write my own UDF)?







 Thanks,




 Furcy










about collect_set and ordering

2014-07-30 Thread Furcy Pin
Hi all,

I just wanted to point out a little gotcha we got while using the
collect_set UDF :

You should not perform directly a GROUP BY over a collect_set(...), because
the
set is cast as an array and is not necessarilly sorted.

For example, we ran a query looking like this;

SELECT
set,
COUNT(1) as nb
GROUP BY set
FROM
(
SELECT
colA
collect_set(colB) as set
FROM db.table
GROUP BY colA
) T
;

and got :

[A] 10205
[B] 93856
[A,B] 34865
[B,A] 48324

We had to replace it with a sort_array(collect_set(...)).

I just wanted to point out that perhaps this subtelty should be mentionned
in the doc of the collect_set UDF...

As a corollary, do you guys think a collect_sorted_set using a TreeSet
would be useful (and/or
more efficient than using sort_array(collect_set)) ?

Regards,

Furcy


Re: Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-07-01 Thread Furcy Pin
After doing some simple tests, multiple insert into the same partition of
the same table is disallowed by hive,
except when using dynamic partitioning...

In that specific case, I don't know if the behavior should be considered as
an undefined behavior, a feature or a bug...

What do you think?



2014-06-30 9:15 GMT+02:00 Furcy Pin furcy@flaminem.com:

 Hi all,


 Does anyone know what is the expected behavior on a query like this :

 FROM source
 INSERT OVERWRITE TABLE dest
 SELECT *
 INSERT INTO TABLE dest
 SELECT *

 is it the same as

 FROM source
 INSERT INTO TABLE dest
 SELECT *
 INSERT OVERWRITE TABLE dest
 SELECT *

 ?

 I'm asking because I ran a query looking like the first one on hive 0.12,
 and it seems to me that the OVERWRITE was completely ignored :
 I mean that the old data (which was there before the query) was still
 there...
 And I checked, it was no access right problem, table was 777, no sticky
 bits
 (I already had issues with ghost data on OVERWRITE because of it)

 One last thing : my table was partitioned and I used nonstrict dynamic
 partitioning,
 perhaps it had an impact on the behavior too.

 Thanks,

 Furcy






Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-06-30 Thread Furcy Pin
Hi all,


Does anyone know what is the expected behavior on a query like this :

FROM source
INSERT OVERWRITE TABLE dest
SELECT *
INSERT INTO TABLE dest
SELECT *

is it the same as

FROM source
INSERT INTO TABLE dest
SELECT *
INSERT OVERWRITE TABLE dest
SELECT *

?

I'm asking because I ran a query looking like the first one on hive 0.12,
and it seems to me that the OVERWRITE was completely ignored :
I mean that the old data (which was there before the query) was still
there...
And I checked, it was no access right problem, table was 777, no sticky
bits
(I already had issues with ghost data on OVERWRITE because of it)

One last thing : my table was partitioned and I used nonstrict dynamic
partitioning,
perhaps it had an impact on the behavior too.

Thanks,

Furcy


Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin
Hi,

note that if your table contains the max value several time, all the
occurences will be returned. Note also that if it contains a null it will
be returned too.

How about this?

 SELECT score FROM student ORDER BY score DESC LIMIT 1 ;


Note that on this query Impala is incoherent with Hive or mySQL, since
Impala considers
that NULL is *greater* than everything while Hive and mySQL consider NULL
to be *smaller* than everything

And just for fun, you can also try in Hive:

FROM
(
  SELECT score FROM student
  DISTRIBUTE BY '1'
  SORT BY score DESC
) M
SELECT score
LIMIT 1
;

FROM
(
  SELECT score FROM student
  DISTRIBUTE BY '1'
) M
REDUCE score USING 'sort -rn | head -1' AS score
;

The second is just to demonstrate the possibilities of custom reduce, but
is greatly inefficient (in speed and memory).
And be sure to use SET mapred.reduce.tasks=1 ; before if you don't want
idle reduce tasks...

(I guess I have some spare time too!)



2014-04-23 3:27 GMT+02:00 Subramanian, Sanjay (HQP) 
sanjay.subraman...@roberthalf.com:

  Hey guys

  TABLE=STUDENT
  COLUMN=SCORE

  U want to find the max value in the column without using any aggregation
 functions.

  Its easy in a RDB context but I was trying to get a solution in Hive
 (clearly I have some spare time on my hands - LOL)

  select
  nfr.score
 from
  student nfr
 left outer join
  (select
   a.score as fra,
   b.score as frb
  from
   (select
'1' as dummy,
score
   from
student
   ) a

  join
   (select
'1' as dummy,
score
   from
student
   ) b
  ON
   a.dummy = b.dummy
  where
   a.score  b.score
  ) frab
 on
  frab.fra=nfr.score
 where
  frab.fra is null

  Thanks

 Warm Regards


  Sanjay





Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin
Ok,

At any rate, you could replace your dummy join by a cross join:

SELECT
  nfr.score
FROM student nfr
LEFT OUTER JOIN
 (
   SELECT
  a.score as fra
   FROM student a
   CROSS JOIN student b
   WHERE a.score  b.score
 ) frab
ON
 frab.fra = nfr.score
WHERE
 frab.fra is null and nfr.score is not null
;

Regards,

Furcy



2014-04-23 16:49 GMT+02:00 Sanjay Subramanian sanjaysubraman...@yahoo.com:

 Thanks
 For the sake of this question I wanted to avoid all order by and limit
 syntax . It's more of a challenge question
 Regards
 Sanjay

 Sent from my iPhone

 On Apr 23, 2014, at 2:51 AM, Furcy Pin furcy@flaminem.com wrote:

 Hi,

 note that if your table contains the max value several time, all the
 occurences will be returned. Note also that if it contains a null it will
 be returned too.

 How about this?

  SELECT score FROM student ORDER BY score DESC LIMIT 1 ;


 Note that on this query Impala is incoherent with Hive or mySQL, since
 Impala considers
 that NULL is *greater* than everything while Hive and mySQL consider NULL
 to be *smaller* than everything

 And just for fun, you can also try in Hive:

  FROM
 (
   SELECT score FROM student
   DISTRIBUTE BY '1'
   SORT BY score DESC
 ) M
 SELECT score
 LIMIT 1
 ;

 FROM
 (
   SELECT score FROM student
   DISTRIBUTE BY '1'
 ) M
 REDUCE score USING 'sort -rn | head -1' AS score
 ;

 The second is just to demonstrate the possibilities of custom reduce, but
 is greatly inefficient (in speed and memory).
 And be sure to use SET mapred.reduce.tasks=1 ; before if you don't want
 idle reduce tasks...

 (I guess I have some spare time too!)



 2014-04-23 3:27 GMT+02:00 Subramanian, Sanjay (HQP) 
 sanjay.subraman...@roberthalf.com:

  Hey guys

  TABLE=STUDENT
  COLUMN=SCORE

  U want to find the max value in the column without using any
 aggregation functions.

  Its easy in a RDB context but I was trying to get a solution in Hive
 (clearly I have some spare time on my hands - LOL)

  select
  nfr.score
 from
  student nfr
 left outer join
  (select
   a.score as fra,
   b.score as frb
  from
   (select
'1' as dummy,
score
   from
student
   ) a

  join
   (select
'1' as dummy,
score
   from
student
   ) b
  ON
   a.dummy = b.dummy
  where
   a.score  b.score
  ) frab
 on
  frab.fra=nfr.score
 where
  frab.fra is null

  Thanks

 Warm Regards


  Sanjay