Re: ODBC-hiveserver2 question

2018-02-23 Thread Andrew Sears

Add JAR works with HDFS, though perhaps not with ODBC drivers.ADD JAR hdfs://:8020/hive_jars/hive-contrib-2.1.1.jar should work (depending on your nn port and confirm this file exists)Alternative syntaxADD JAR hdfs:/hive_jars/hive-contrib-2.1.1.jarThe ODBC driver could be having an issue with the forward slashes.The guaranteed method is to create a permanent association by adding the JAR to hive/lib or hadoop/lib on hiveserver2 node.Copying to hive-client/auxlib/ and restarting Hive is an option.Adding following property to Hive-env.sh is an optionHIVE_AUX_JARS_PATH=There may be a trace function for your ODBC driver to see a more detailed error.  Some ODBC drivers may not support the ADD JAR syntax.cheers,AndrewOn February 23, 2018 at 3:27 PM Jörn Franke  wrote:   Add jar works only with local files on the Hive server.On 23. Feb 2018, at 21:08, Andy Srine < andy.sr...@gmail.com> wrote:  Team,Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) supported in hiveserver2 via an ODBC connection? Some relevant points:I am able to do it in Hive 2.1.1 via JDBC (beeline), but not via an ODBC client.In Hive 1.2.1, I can add a jar from the local node, but not a JAR on HDFS.Some old blogs online say HiveServer2 doesn't support "ADD JAR " period. But thats not what I experience via beeline.Let me know your thoughts and experiences.Thanks,Andy 
 


Re: Proposal: File based metastore

2018-02-23 Thread Alexander Kolbasov
Would it be useful to have a tool that can save database(s), table(s) and
partition(s) metadata in a file and then import this file in another
metastore? These files can be stored together with data files or elsewhere.

This would allow for targeted exchange of metadata between multiple HMS
services.

- Alex


Re: ODBC-hiveserver2 question

2018-02-23 Thread Jörn Franke
Add jar works only with local files on the Hive server.

> On 23. Feb 2018, at 21:08, Andy Srine  wrote:
> 
> Team,
> 
> Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) 
> supported in hiveserver2 via an ODBC connection? 
> 
> Some relevant points:
> I am able to do it in Hive 2.1.1 via JDBC (beeline), but not via an ODBC 
> client.
> In Hive 1.2.1, I can add a jar from the local node, but not a JAR on HDFS.
> Some old blogs online say HiveServer2 doesn't support "ADD JAR " period. But 
> thats not what I experience via beeline.
> Let me know your thoughts and experiences.
> 
> Thanks,
> Andy
> 


ODBC-hiveserver2 question

2018-02-23 Thread Andy Srine
Team,

Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;)
supported in hiveserver2 via an ODBC connection?

Some relevant points:

   - I am able to do it in Hive 2.1.1 via JDBC (beeline), but not via an
   ODBC client.
   - In Hive 1.2.1, I can add a jar from the local node, but not a JAR on
   HDFS.
   - Some old blogs online say HiveServer2 doesn't support "ADD JAR "
   period. But thats not what I experience via beeline.

Let me know your thoughts and experiences.

Thanks,
Andy


Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
And if you come across a comprehensive documentation of parquet
configuration, please share it!!!

The Parquet documentation says that it can be configured but doesn't
explain how: http://parquet.apache.org/documentation/latest/
and apparently, both TAJO (
http://tajo.apache.org/docs/0.8.0/table_management/parquet.html) and Drill (
https://drill.apache.org/docs/parquet-format/) seem to have some
configuration parameters for Parquet.
If Hive has configuration parameters for Parquet too, I couldn't find it
documented anywhere.



On 23 February 2018 at 16:48, Sun, Keith  wrote:

> I got your point and thanks for the nice slides info.
>
>
> So the parquet filter is not an easy thing and I will try that according
> to the deck.
>
>
> Thanks !
> --
> *From:* Furcy Pin 
> *Sent:* Friday, February 23, 2018 3:37:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Why the filter push down does not reduce the read data
> record count
>
> Hi,
>
> Unless your table is partitioned or bucketed by myid, Hive generally
> requires to read through all the records to find the records that match
> your predicate.
>
> In other words, Hive table are generally not indexed for single record
> retrieval like you would expect RDBMs tables or Vertica tables to be
> indexed to allow single record.
> Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
> bloom filters on specific columns of a table
> ,
> which could work as a kind of index.
> Also, depending on the query engine you are using (Hive, Spark-SQL,
> Impala, Presto...) and its version, they may or may not be able to leverage
> certain storage optimization.
> For example, Spark still does not support Hive Bucketed Table
> optimization. But it might come in the upcoming Spark 2.3.
>
>
> I'm much less familiar with Parquet, so if anyone has links to a good
> documentation for Parquet fine tuning (or even better a comparison with ORC
> features) that would be really helpful.
> By googling, I found these slides where someone at Netflix
> 
> seems to have tried the same kind of optimization as you in Parquet.
>
>
>
>
>
> On 23 February 2018 at 12:02, Sun, Keith  wrote:
>
> Hi,
>
>
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
>
>
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
>
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10
>
>
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
>
>
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
>
>
> Thanks,
>
> Keith
>
>
>
>


Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith
I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the 
deck.


Thanks !


From: Furcy Pin 
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires 
to read through all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record 
retrieval like you would expect RDBMs tables or Vertica tables to be indexed to 
allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom 
filters on specific columns of a 
table,
 which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, 
Presto...) and its version, they may or may not be able to leverage certain 
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But 
it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good 
documentation for Parquet fine tuning (or even better a comparison with ORC 
features) that would be really helpful.
By googling, I found these slides where someone at 
Netflix
 seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith 
> wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith




Hive Sum Query on Decimal Column Returns Zero When Expected Result Has Too Many Digits

2018-02-23 Thread William Garvie
Hello,

I have an issue where I'm running a sum query on a decimal column and I'm
getting 0 as the result whenever my expected result is around 22 digits or
larger.

I created a stack overflow question here:
https://stackoverflow.com/questions/48836455/hive-sum-query-on-decimal-column-returns-zero-when-expected-result-has-too-many

Is this a known issue or is this something that is likely an issue with the
way that I have Hive setup?

Any help would be great appreciated.

Thank you,
Evan


Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin
Hi,

Unless your table is partitioned or bucketed by myid, Hive generally
requires to read through all the records to find the records that match
your predicate.

In other words, Hive table are generally not indexed for single record
retrieval like you would expect RDBMs tables or Vertica tables to be
indexed to allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
bloom filters on specific columns of a table
,
which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala,
Presto...) and its version, they may or may not be able to leverage certain
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization.
But it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good
documentation for Parquet fine tuning (or even better a comparison with ORC
features) that would be really helpful.
By googling, I found these slides where someone at Netflix

seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith  wrote:

> Hi,
>
>
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
>
>
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
>
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10
>
>
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
>
>
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
>
>
> Thanks,
>
> Keith
>
>
>


Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith
Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith