Re: Can hive bear high throughput streaming data ingest?

2020-03-20 Thread Jörn Franke
Why don’t you write them directly on local storage and then write them all to 
HDFS?

Then you can create an external table in Hive on them and do analyses

> Am 20.03.2020 um 08:30 schrieb "wangl...@geekplus.com.cn" 
> :
> 
> 
> https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2
> 
> I want to stream my app log to Hive using flume on the edge app server.
> Since HDFS is not friendly to frequently write, I am afraid this way can not 
> bear  high throuthput.
> 
> Any suggesions on this?
> 
> Thanks,
> Lei
> 
> wangl...@geekplus.com.cn 
> 


Re: Less than(<) & Greater than(>) condition failing on string column but min/max is working

2019-08-20 Thread Jörn Franke
You could just move the ints outside the Map.
Alternatively you can convert the String to Int : cast (strcolumn to int)
See:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions

> Am 12.08.2019 um 21:41 schrieb Anup Tiwari :
> 
> Hi All,
> 
> I have a use case where i wanted to store multiple columns into a single map 
> data type column but while doing so i came across a use case where i have 
> mainly 2 type of column bigint and string so i stored them in a map column 
> defined as  and then i queried the key which hold integer 
> values in a map column.
> 
> Case 1 :- when i am applying > , < conditional operator then returning output 
> is correct.
> Case 2 :- Taking min/max on integer column and then it is giving me incorrect 
> output.
> 
> Now i understood that it is due datatype of map column which is "string" but 
> just a question, why min/max is working on string and <,> conditions failing ?
> 
> Please note that the column which i am querying, will always contain int 
> value so let me know if you have any suggestions. 
> 
> Combining and storing value in map datatype is necessary because each row can 
> have different-different number of metadata depending upon event name.
> 
> Please find table structure below :-
> 
> CREATE TABLE `anup.test_map`(
>   `SID` string, 
>   `eventName` string, 
>   `timestamp` timestamp, 
>   `merge` map)
> STORED AS PARQUET;
> 
> Regards,
> Anup Tiwari


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? 
Can you share some code?

> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade :
> 
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0. 
> Even if i use parquet files the result would be same, because after all 
> sparkSQL isn't able to descend into the subdirectories over which the table 
> is created. Could there be any other way?
> Thanks,
> Rishikesh
> 
>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh  
>> wrote:
>> which versions of Spark and Hive are you using.
>> 
>> what will happen if you use parquet tables instead?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade  
>>> wrote:
>>> Hi.
>>> I have built a Hive external table on top of a directory 'A' which has data 
>>> stored in ORC format. This directory has several subdirectories inside it, 
>>> each of which contains the actual ORC files.
>>> These subdirectories are actually created by spark jobs which ingest data 
>>> from other sources and write it into this directory.
>>> I tried creating a table and setting the table properties of the same as 
>>> hive.mapred.supports.subdirectories=TRUE and 
>>> mapred.input.dir.recursive=TRUE.
>>> As a result of this, when i fire the simplest query of select count(*) from 
>>> ExtTable via the Hive CLI, it successfully gives me the expected count of 
>>> records in the table.
>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>> 
>>> I think the sparkSQL isn't able to descend into the subdirectories for 
>>> getting the data while hive is able to do so.
>>> Are there any configurations needed to be set on the spark side so that 
>>> this works as it does via hive cli? 
>>> I am using Spark on YARN.
>>> 
>>> Thanks,
>>> Rishikesh
>>> 
>>> Tags: subdirectories, subdirectory, recursive, recursion, hive external 
>>> table, orc, sparksql, yarn


Re: Converting Hive Column from Varchar to String

2019-07-18 Thread Jörn Franke
You have to create a new table with this column as varchar and do a select 
insert from the old table. 

> Am 18.07.2019 um 01:14 schrieb William Shen :
> 
> Hi all,
> 
> I assumed that it should be compatible to convert column type varchar to 
> string, however, after running ALTER TABLE table CHANGE col col STRING, I 
> encounter the following error when querying the column from hive:
> 
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> 
> Anyone encountered this before, or know how to work around this?
> 
> Thank you!
> 
> - Will


Re: Out Of Memory Error

2019-01-09 Thread Jörn Franke
Which hive version and engine?

If it is tez then you can also try mr as an engine set hive.execution.engine=mr 
that will use less memory. Check also the max heap space configuration on the 
nodes . Maybe you have physically 16 gb memory but the Java process takes only 
4 or so memory.

Maybe your query could be also expressed differently, but I miss background 
information on the use case.

Last but not least : size on disk != size in memory especially if you go beyond 
simple queries.

Try also with a smaller subset of the data when you reach the memory limit 

> Am 10.01.2019 um 07:57 schrieb Sujeet Pardeshi :
> 
> Hi Pals,
> I have the below Hive SQL which is hitting the following error “at 
> java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: 
> Java heap space at”. It’s basically going out of memory. The table on which 
> the query is being hit has 246608473 (246 million) records, its size is 
> around 43 GB’s. I am running this sql on a Hadoop cluster which has 4 nodes, 
> every node has 16GB memory and 128 GB disk space. I can definitely increase 
> the memory, can scale up more clusters and try but is there something that I 
> can do to make this query work without having to touch the clusters or the 
> memory?
>  
> create table t1_content_pages_agg_by_month stored as orc
> as
> select * from (
> select A.dt
>,A.year
>,A.month
>,A.bouncer
>,A.visitor_type
>,A.device_type
>,A.pg_domain_name
>,A.pg_page_url
>,A.class1_id
>,A.class2_id
>,A.total_page_view_time
>,row_number() over ( PARTITION BY A.dt,A.year, A.month, 
> A.bouncer,A.visitor_type,A.device_type) as rank
> from content_pages_agg_by_month A
> )AA
> ;
>  
> Regards,
> 
> Sujeet Singh Pardeshi
> 
> Software Specialist
> 
> SAS Research and Development (India) Pvt. Ltd.
> 
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra, 411 
> 013
> off: +91-20-30418810  
> 
>  "When the solution is simple, God is answering…" 
>  


Re: Dynamic partition pruning

2018-12-19 Thread Jörn Franke
Can you please provide us more details:
Number of rows in each table and per partition, the table structure, hive 
version, table format, is table sorted or partitioned on dt?

Why don’t you use a join, potentially with a mapjoin hint?

> Am 19.12.2018 um 09:02 schrieb Prabhakar Reddy :
> 
> Hello,
> 
> I have a table large_table with more than 50K partitions and when I run below 
> query it is running for ever.The other table small_table2 has only five 
> partitions and when ever I run below query it seems to be scanning all 
> partitions rather than scanning only five partitions which are there in 
> smaller table.
> 
> select * from large_table a  where a.dt in (select dt from small_table2) 
> limit 5;
> 
> Could you please confirm if this is the expected behavior or any way we can 
> tune this query to fetch results faster?
> 
> Regards
> Prabhakar Reddy


Re: [feature request] auto-increment field in Hive

2018-09-16 Thread Jörn Franke
He did not say that it is for an acid table... 

Then how the state (ie the sequence) is managed within the UDF is another story 
, but there are various options

> On 15. Sep 2018, at 20:10, Shawn Weeks  wrote:
> 
> It doesn't help if you need concurrent threads writing to a table but we are 
> just using the row_number analytic and a max value subquery to generate 
> sequences on our star schema warehouse. It has worked pretty well so far. To 
> provide true sequence support would require changes on the hive meta database 
> side as well as locking so nothing has been done on it in a long time. A 
> simple UDF isn't capable of providing true unique sequence support.
> 
> Thanks
> Shawn
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Saturday, September 15, 2018 6:09 AM
> To: user@hive.apache.org
> Subject: Re: [feature request] auto-increment field in Hive
> 
> If you really need it then you can write an UDF for it. 
> 
>> On 15. Sep 2018, at 11:54, Nicolas Paris  wrote:
>> 
>> Hi
>> 
>> Hive does not provide auto-increment columns (=sequences). Is there any
>> chance that feature will be provided in the future ?
>> 
>> This is one of the highest limitation in hive data warehousing in
>> replacement of RDBMS right now.
>> 
>> Thanks,
>> 
>> -- 
>> nicolas


Re: [feature request] auto-increment field in Hive

2018-09-15 Thread Jörn Franke
If you really need it then you can write an UDF for it. 

> On 15. Sep 2018, at 11:54, Nicolas Paris  wrote:
> 
> Hi
> 
> Hive does not provide auto-increment columns (=sequences). Is there any
> chance that feature will be provided in the future ?
> 
> This is one of the highest limitation in hive data warehousing in
> replacement of RDBMS right now.
> 
> Thanks,
> 
> -- 
> nicolas


Re: Improve performance of Analyze table compute statistics

2018-08-26 Thread Jörn Franke
You can partition it and only compute statistics for new partitions...

> On 26. Aug 2018, at 12:43, Prabhakar Reddy  wrote:
> 
> Hello,
> 
> Are there any properties that I can set to improve the performance of Analyze 
> table compute statistics statement.My data sits in s3 and I see it's taking 
> one second per file to read the schema of each file from s3.
> 
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema 
> 
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema
> 
> It takes around 80 seconds for 76 files with total size of 23 GB.
> 
> 
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: exec.Task (SessionState.java:printInfo()) - Table 
> dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76, 
> numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing 
> command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0); 
> Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (SessionState.java:printInfo()) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: CliDriver (SessionState.java:printInfo()) - Time taken: 81.992 
> seconds
> 
> If I run the same command with few columns then the query runs 60% faster.Is 
> there any property that I can modify to reduce the time taken for this read?
> 
> Regards
> Prabhakar Reddy
> 
> 
>  


Re: Enabling Snappy compression on Parquet

2018-08-22 Thread Jörn Franke
No parquet and orc have internal compression which must be used over the 
external compression that you are referring to.

 Internal compression can be decompressed in parallel which is significantly 
faster. Internally parquet supports only snappy, gzip,lzo, brotli (2.4.), lz4 
(2.4), zstd (2.4).

> On 22. Aug 2018, at 07:33, Tanvi Thacker  wrote:
> 
> Hi Patrick,
> 
> What are other formats supported? 
> - As far as I know, you can set any compression with any format (ORC, Text 
> with snappy ,gzip etc). Are you looking for any specific format or 
> compression?
> 
> How can I verify a file is compressed and with what algorithm? 
> -  you may check parquet-tools if they provide any meta information about 
> compression.
> 
> And, on another note, if you are already having an uncompressed data and you 
> are creating a table with snappy compression, you need to do use "CREATE into 
> new_compressed table as select * from un_compressed_table" in order to 
> actually compress the data
> 
> Regards,
> Tanvi Thacker
> 
>> On Fri, Aug 10, 2018 at 6:30 AM Patrick Duin  wrote:
>> Hi,
>> 
>> I got some hive tables in Parquet format and I am trying to find out how 
>> best to enable compression.
>> 
>> Done a bit of searching and the information is a bit scattered but I found I 
>> can use this hive property to enable compression.It needs to be set before 
>> doing an insert.
>> 
>> set parquet.compression=SNAPPY;
>> 
>> What other formats are supported? 
>> How can I verify a file is compressed and with what algorithm? 
>> 
>> Thanks,
>> Patrick


Re: 回复: Does Hive 3.0 only works with hadoop3.x.y?

2018-07-22 Thread Jörn Franke
Hadoop 3.0 brings anyway some interesting benefits such as reduced storage 
needs (you dont need to replicate anymore 3 times for reliability reasons), so 
that may be convincing.

> On 22. Jul 2018, at 08:28, 彭鱼宴 <461292...@qq.com> wrote:
> 
> Hi Tanvi,
> 
> Thanks! I will check that and have a talk with my colleagues to consider 
> about the upgrading.
> 
> Best,
> Zhefu Peng
> 
> 
> -- 原始邮件 --
> 发件人: "Tanvi Thacker";
> 发送时间: 2018年7月21日(星期六) 下午3:24
> 收件人: "user";
> 主题: Re: Does Hive 3.0 only works with hadoop3.x.y?
> 
> I would recommend upgrading to Hadoop 3.0 or 3.1 because of the following 
> reasons:-
> 
> It may be possible that Hadoop 2.x transitively brings some dependencies 
> which may conflict with libraries used by hive( like unpredictable library 
> google guava etc), which will affect your runtime environment.
> Hive might be utilizing some of the new public APIs which are exposed in 3.x 
> line of Hadoop , so with Hadoop 2.x you may see some 
> ClassNotFound/NoSuchMethod in runtime if your query is addressing such code 
> path.
> In production, you must use the same the dependencies in which hive is 
> compiled and tested.
> https://github.com/apache/hive/blob/rel/release-3.0.0/pom.xml#L149
> 
> Thanks,
> Tanvi Thacker
> 
> 
>> On Thu, Jul 19, 2018 at 8:15 PM, Sungwoo Park  wrote:
>> I would say yes (because I am actually running Hive 3.0 on Hadoop 2.7.6 and 
>> HDP 2.7.5), provided that you make small changes to the source code to Hive 
>> 3.0. However, I have not tested Hive 3.0 on Spark.
>> 
>> --- Sungwoo 
>> 
>>> On Thu, Jul 19, 2018 at 10:34 PM, 彭鱼宴 <461292...@qq.com> wrote:
>>> Hi Sungwoo,
>>> 
>>> Just want to confirm, does that mean I just need to update the hive 
>>> version, without updating the hadoop version?
>>> 
>>> Thanks!
>>> 
>>> Best,
>>> Zhefu Peng
>>> 
>>> 
>>> -- 原始邮件 --
>>> 发件人: "Sungwoo Park";
>>> 发送时间: 2018年7月19日(星期四) 晚上8:20
>>> 收件人: "user";
>>> 主题: Re: Does Hive 3.0 only works with hadoop3.x.y?
>>> 
>>> Hive 3.0 make a few function calls that depend on Hadoop 3.x, but they are 
>>> easy to replace with code that compiles okay on Hadoop 2.8+. I am currently 
>>> running Hadoop 3.x on Hadoop 2.7.6 and HDP 2.6.4 to test with the TPC-DS 
>>> benchmark, and have not encountered any compatibility issue yet. I 
>>> previously posted a diff file that lets us compile Hadoop 3.x on Hadoop 
>>> 2.8+.
>>> 
>>> http://mail-archives.apache.org/mod_mbox/hive-user/201806.mbox/%3CCAKHFPXDDFn52buKetHzSXTtjzX3UMHf%3DQvxm9QNNkv9r5xBs-Q%40mail.gmail.com%3E
>>>  
>>> 
>>> --- Sungwoo Park
>>> 
>>> 
 On Thu, Jul 19, 2018 at 8:21 PM, 彭鱼宴 <461292...@qq.com> wrote:
 Hi,
 
 I already deployed hive 2.2.0 on our hadoop cluster. And recently, we 
 deployed the spark cluster with 2.3.0, aiming at using the feature that 
 hive on spark engine. However, when I checked the website of hive release, 
 I found the text below:
 21 May 2018 : release 3.0.0 available
 This release works with Hadoop 3.x.y.
 
 Now the hadoop version we deployed is hadoop 2.7.6. I wonder, does Hive 
 3.0 only work with hadoop 3.x.y? Or, if we want to use hive 3.0, we have 
 to update the hadoop version to 3.x.y?
 
 Looking forward to your reply and help.
 
 Best,
 
 Zhefu Peng
 
>>> 
>> 
> 


Re: Error Starting hive thrift server, hive 3 on Hadoop 3.1

2018-07-03 Thread Jörn Franke
It would be good if you can document this on the Hive wiki so that other users 
know it.

On the other hand there is Apache Bigtop which tests integration of various Big 
Data components  - but it is complicated. Behind a big data distribution there 
is a lot of effort.

> On 3. Jul 2018, at 23:08, Mich Talebzadeh  wrote:
> 
> Resolved this by getting rid of HADOOP_CLASSPATH that I had to add to make 
> Hbase 2 work with Hadoop 3.1. It did not help and I had to revert back to 
> Hbase 1.2.6. But left that CLASSPATH in ENV file.
> 
> This is becoming retrofitting issues where making an artefact work with 
> Hadoop impacts other artefacts and results in unnecessary waste of time.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
>> On Tue, 3 Jul 2018 at 17:48, Mich Talebzadeh  
>> wrote:
>> This is hive 3 on Hadoop 3.1
>> 
>> I am getting this error in a loop
>> 
>> 2018-07-03 17:43:44,929 INFO  [main] SessionState: Hive Session ID = 
>> 5f38c8a3-f269-42e0-99d8-9ddff676f009
>> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2: Shutting down 
>> HiveServer2
>> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2: 
>> Stopping/Disconnecting tez sessions.
>> 2018-07-03 17:43:44,930 WARN  [main] server.HiveServer2: Error starting 
>> HiveServer2 on attempt 5, will retry in 6ms
>> java.lang.NoSuchMethodError: 
>> org.apache.hadoop.tracing.TraceUtils.wrapHadoopConf(Ljava/lang/String;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/HTraceConfiguration;
>> 
>> Any ideas?
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  


Re: Which version of Hive can hanle creating XML table?

2018-06-09 Thread Jörn Franke
Well you are always free to create a Serde on top that works with abstractSerde 
in any version. I don’t think it will be difficult since the input format is 
already there.
I don’t know exactly when the interface SerDe was removed.

> On 9. Jun 2018, at 09:09, Mich Talebzadeh  wrote:
> 
> thanks Jorn. The only alternative is to use xpath UDF? Works as shown below 
> but tedious 
> 
> Like the example below
> 
> $cat employees.xml
> 
> 1
> Satish Kumar
> Technical Lead
> 
> 
> 2
> Ramya
> Testing
> 
> 
> Step:1 Bring each record to one line, by executing below command
> 
> $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 
> 's||\n|g' | grep -v '^\s*$' > employees_records.xml
> 
> $cat employees_records.xml
>  1 Satish Kumar Technical 
> Lead 
>  2 Ramya Testing 
> 
> 
> tep:2 Load the file to HDFS
> 
> $hadoop fs -mkdir /user/hive/sample-xml-inputs
> 
> $hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs
> 
> $hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml
>  1 Satish KumarTechnical 
> Lead 
>  2 Ramya Testing 
> 
> 
> Step:3 Create a Hive table and point to xml file
> 
> hive>create external table xml_table_org( xmldata string) LOCATION 
> '/user/hive/sample-xml-inputs/';
> 
> hive> select * from xml_table_org;
> OK
>  1 Satish Kumar Technical 
> Lead 
>  2 Ramya Testing 
> 
> 
> Step 4: From the stage table we can query the elements and load it to other 
> table.
> 
> hive> CREATE TABLE xml_table AS SELECT 
> xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation')
>  FROM xml_table_org;
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 9 June 2018 at 07:42, Jörn Franke  wrote:
>> Yes.
>> 
>> Serde must have been removed then in 2.x.
>> 
>> 
>> 
>>> On 8. Jun 2018, at 23:52, Mich Talebzadeh  wrote:
>>> 
>>> Ok I am looking at this jar file
>>> 
>>>  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
>>> org/apache/hadoop/hive/serde2/AbstractSerDe.class
>>> 
>>> Is this the correct one?
>>> 
>>> Thanks
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 8 June 2018 at 22:34, Mich Talebzadeh  wrote:
>>>> Thanks Jorn so what is the resolution? do I need another jar file?
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 8 June 2018 at 21:56, Jörn Franke  wrote:
>>>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it has 
>>>>> been replaced by the abstract class abstractserde 
>>>>> 
>>>>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh  
>>>>>> wrote:
>>>>>> 
>>>>>> Thanks Jorn.
>>>>>> 
>>>>>> Spark 2.3.3 (labelled as

Re: Which version of Hive can hanle creating XML table?

2018-06-09 Thread Jörn Franke
Yes.

Serde must have been removed then in 2.x.



> On 8. Jun 2018, at 23:52, Mich Talebzadeh  wrote:
> 
> Ok I am looking at this jar file
> 
>  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
> org/apache/hadoop/hive/serde2/AbstractSerDe.class
> 
> Is this the correct one?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 8 June 2018 at 22:34, Mich Talebzadeh  wrote:
>> Thanks Jorn so what is the resolution? do I need another jar file?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 8 June 2018 at 21:56, Jörn Franke  wrote:
>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it has 
>>> been replaced by the abstract class abstractserde 
>>> 
>>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh  
>>>> wrote:
>>>> 
>>>> Thanks Jorn.
>>>> 
>>>> Spark 2.3.3 (labelled as stable)
>>>> 
>>>> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib and 
>>>> explicitly loaded with ADD JAR as well in hive session
>>>> 
>>>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
>>>> Added 
>>>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar]
>>>>  to class path
>>>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar]
>>>> 
>>>> Then I ran a simple code given here
>>>> 
>>>> hive> CREATE  TABLE xml_41 (imap map)
>>>> > ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
>>>> > WITH SERDEPROPERTIES (
>>>> > "column.xpath.imap"="/file-format/data-set/element",
>>>> > "xml.map.specification.element"="@name->#content"
>>>> > )
>>>> > STORED AS
>>>> > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
>>>> > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
>>>> > TBLPROPERTIES (
>>>> > "xmlinput.start"="",
>>>> > "xmlinput.end"=""
>>>> > );
>>>> FAILED: Execution Error, return code 1 from 
>>>> org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
>>>> 
>>>> And this is full error
>>>> 
>>>> 2018-06-08T21:17:20,775  INFO [7feb5165-780b-4ab6-aca8-f516d0388823 main] 
>>>> ql.Driver: Starting task [Stage-0:DDL] in serial mode
>>>> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823 main] 
>>>> exec.DDLTask: java.lang.NoClassDefFoundError: 
>>>> org/apache/hadoop/hive/serde2/SerDe
>>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>>>> at 
>>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>>>   

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Jörn Franke
p.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.serde2.SerDe
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 40 more
> 
> The jar file has the classes!
> 
> jar tf hivexmlserde-1.0.5.3.jar
> META-INF/
> META-INF/MANIFEST.MF
> com/
> com/ibm/
> com/ibm/spss/
> com/ibm/spss/hive/
> com/ibm/spss/hive/serde2/
> com/ibm/spss/hive/serde2/xml/
> com/ibm/spss/hive/serde2/xml/objectinspector/
> com/ibm/spss/hive/serde2/xml/processor/
> com/ibm/spss/hive/serde2/xml/processor/java/
> com/ibm/spss/hive/serde2/xml/HiveXmlRecordReader.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlListObjectInspector.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlMapObjectInspector.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspectorFactory$1.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspectorFactory.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObjectInspector$1.class
> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObjectInspector.class
> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$1.class
> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$2.class
> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor.class
> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$1.class
> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$2.class
> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor.class
> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlQuery.class
> com/ibm/spss/hive/serde2/xml/processor/java/NodeArray.class
> com/ibm/spss/hive/serde2/xml/processor/SerDeArray.class
> com/ibm/spss/hive/serde2/xml/processor/XmlMapEntry.class
> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet$Type.class
> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet.class
> com/ibm/spss/hive/serde2/xml/processor/XmlNode$1.class
> com/ibm/spss/hive/serde2/xml/processor/XmlNode$2.class
> com/ibm/spss/hive/serde2/xml/processor/XmlNode.class
> com/ibm/spss/hive/serde2/xml/processor/XmlNodeArray.class
> com/ibm/spss/hive/serde2/xml/processor/XmlProcessor.class
> com/ibm/spss/hive/serde2/xml/processor/XmlProcessorContext.class
> com/ibm/spss/hive/serde2/xml/processor/XmlQuery.class
> com/ibm/spss/hive/serde2/xml/processor/XmlTransformer.class
> com/ibm/spss/hive/serde2/xml/processor/XmlUtils$1.class
> com/ibm/spss/hive/serde2/xml/processor/XmlUtils.class
> com/ibm/spss/hive/serde2/xml/SplittableXmlInputFormat.class
> com/ibm/spss/hive/serde2/xml/XmlInputFormat$XmlRecordReader.class
> com/ibm/spss/hive/serde2/xml/XmlInputFormat.class
> com/ibm/spss/hive/serde2/xml/XmlSerDe$1.class
> com/ibm/spss/hive/serde2/xml/XmlSerDe.class
> META-INF/maven/
> META-INF/maven/com.ibm.spss.hive.serde2.xml/
> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/
> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.xml
> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 8 June 2018 at 17:58, Jörn Franke  wrote:
>> Can you get the log files and start Hive with more detailled logs?
>> In could be that not all libraries are loaded (i don’t remember anymore but 
>> I think this one needs more , I can look next week in my docs) or that it 
>> does not support maps (not sure). 
>> You can try first with a more simpler extraction with a String field to see 
>> if it works .
>> 
>> Hive has always had external libraries for xml support and I used the one 
>> below with Hive 1.x, but it should also work with 2.x (3 not sure, but it 
>> should if it works in 2.x)
>> 
>> 
>>> On 8. Jun 2018, at 17:53, Mich Talebzadeh  wrote:
>>> 
>>> I tried Hive 2.0.1, 2.3.2 and now Hive 3/
>>> 
>>> I explicitly added hivexmlserde  jar file as ADD JAR shown below
>>> 
>>&

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Jörn Franke
Can you get the log files and start Hive with more detailled logs?
In could be that not all libraries are loaded (i don’t remember anymore but I 
think this one needs more , I can look next week in my docs) or that it does 
not support maps (not sure). 
You can try first with a more simpler extraction with a String field to see if 
it works .

Hive has always had external libraries for xml support and I used the one below 
with Hive 1.x, but it should also work with 2.x (3 not sure, but it should if 
it works in 2.x)


> On 8. Jun 2018, at 17:53, Mich Talebzadeh  wrote:
> 
> I tried Hive 2.0.1, 2.3.2 and now Hive 3/
> 
> I explicitly added hivexmlserde  jar file as ADD JAR shown below
> 
> 0: jdbc:hive2://rhes75:10099/default> ADD JAR 
> hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
> No rows affected (0.002 seconds)
> 
> But still cannot create an xml table
> 
> 0: jdbc:hive2://rhes75:10099/default> CREATE  TABLE xml_41 (imap 
> map) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
> WITH SERDEPROPERTIES 
> ("column.xpath.imap"="/file-format/data-set/element","xml.map.specification.element"="@name->#content")
>   STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
> TBLPROPERTIES 
> ("xmlinput.start"="","xmlinput.end"="");
> 
> Error: Error while processing statement: FAILED: Execution Error, return code 
> 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1)
> 
> Does anyone know the cause of this or which version of Hive supports creating 
> an XML table?
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
You have in AbstractSerde a method to return very basic stats related to your 
fileformat (mostly size of the data and number of rows etc):

https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java

 In method initialize of your Serde you can retrieve properties related to 
partitions and include this information in your file format, if needed (you 
don’t need to create folders etc for partitions - this is done by Hive)


> On 13. May 2018, at 19:09, Elliot West <tea...@gmail.com> wrote:
> 
> Hi Jörn,
> 
> I’m curious to know how the SerDe framework provides the means to deal with 
> partitions, table properties, and statistics? I was under the impression that 
> these were in the domain of the metastore and I’ve not found anything in the 
> SerDe interface related to these. I would appreciate if you could point me in 
> the direction of anything I’ve missed.
> 
> Thanks,
> 
> Elliot.
> 
>> On Sun, 13 May 2018 at 15:42, Jörn Franke <jornfra...@gmail.com> wrote:
>> In detail you can check the source code, but a Serde needs to translate an 
>> object to a Hive object and vice versa. Usually this is very simple (simply 
>> passing the object or create A HiveDecimal etc). It also provides an 
>> ObjectInspector that basically describes an object in more detail (eg to be 
>> processed by an UDF). For example, it can tell you precision and scale of an 
>> objects. In case of ORC it describes also how a bunch of objects 
>> (vectorized) can be mapped to hive objects and the other way around. 
>> Furthermore, it provides statistics and provides means to deal with 
>> partitions as well as table properties (!=input/outputformat properties).
>> Although it sounds complex, hive provides most of the functionality so 
>> implementing a serde is most of the times easy.
>> 
>> > On 13. May 2018, at 16:34, 侯宗田 <zongtian...@icloud.com> wrote:
>> > 
>> > Hello,everyone
>> >   I know the json serde turn fields in a row to a json format, csv serde 
>> > turn it to csv format with their serdeproperties. But I wonder what the 
>> > orc serde does when I choose to stored as orc file format. And why is 
>> > there still escaper, separator in orc serdeproperties. Also with RC 
>> > Parquet. I think they are just about how to stored and compressed with 
>> > their input and output format respectively, but I don’t know what their 
>> > serde does, can anyone give some hint?  


Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
Yes this was what I did when writing the Hive part of the HadoopOffice / 
HadoopCryptoledger library. Be aware that Orc uses also some internal Hive 
APIs/ Extended the existing ones (eg Vectorizedserde)

I don’t have access to the Hive Wiki otherwise I could update it a little bit.

> On 13. May 2018, at 17:08, 侯宗田 <zongtian...@icloud.com> wrote:
> 
> Thank you, it makes the concept clearer to me. I think I need to look up the 
> source code for some details.
>> 在 2018年5月13日,下午10:42,Jörn Franke <jornfra...@gmail.com> 写道:
>> 
>> In detail you can check the source code, but a Serde needs to translate an 
>> object to a Hive object and vice versa. Usually this is very simple (simply 
>> passing the object or create A HiveDecimal etc). It also provides an 
>> ObjectInspector that basically describes an object in more detail (eg to be 
>> processed by an UDF). For example, it can tell you precision and scale of an 
>> objects. In case of ORC it describes also how a bunch of objects 
>> (vectorized) can be mapped to hive objects and the other way around. 
>> Furthermore, it provides statistics and provides means to deal with 
>> partitions as well as table properties (!=input/outputformat properties).
>> Although it sounds complex, hive provides most of the functionality so 
>> implementing a serde is most of the times easy.
>> 
>>> On 13. May 2018, at 16:34, 侯宗田 <zongtian...@icloud.com> wrote:
>>> 
>>> Hello,everyone
>>> I know the json serde turn fields in a row to a json format, csv serde turn 
>>> it to csv format with their serdeproperties. But I wonder what the orc 
>>> serde does when I choose to stored as orc file format. And why is there 
>>> still escaper, separator in orc serdeproperties. Also with RC Parquet. I 
>>> think they are just about how to stored and compressed with their input and 
>>> output format respectively, but I don’t know what their serde does, can 
>>> anyone give some hint?  
> 


Re: What does the ORC SERDE do

2018-05-13 Thread Jörn Franke
In detail you can check the source code, but a Serde needs to translate an 
object to a Hive object and vice versa. Usually this is very simple (simply 
passing the object or create A HiveDecimal etc). It also provides an 
ObjectInspector that basically describes an object in more detail (eg to be 
processed by an UDF). For example, it can tell you precision and scale of an 
objects. In case of ORC it describes also how a bunch of objects (vectorized) 
can be mapped to hive objects and the other way around. Furthermore, it 
provides statistics and provides means to deal with partitions as well as table 
properties (!=input/outputformat properties).
Although it sounds complex, hive provides most of the functionality so 
implementing a serde is most of the times easy.

> On 13. May 2018, at 16:34, 侯宗田  wrote:
> 
> Hello,everyone
>   I know the json serde turn fields in a row to a json format, csv serde turn 
> it to csv format with their serdeproperties. But I wonder what the orc serde 
> does when I choose to stored as orc file format. And why is there still 
> escaper, separator in orc serdeproperties. Also with RC Parquet. I think they 
> are just about how to stored and compressed with their input and output 
> format respectively, but I don’t know what their serde does, can anyone give 
> some hint?  


Re: Need to read JSON File

2018-04-22 Thread Jörn Franke
Check the Json serde:

https://cwiki.apache.org/confluence/display/Hive/SerDe

> On 22. Apr 2018, at 09:09, Mahender Sarangam  
> wrote:
> 
> Hi,
> 
> we have to read Gz compressed JSON File from Source System. I see they are 3 
> different ways of reading JSON data.  Our data doesn't have nesting at all. 
> We see one option 
>   get_json_object and 
> Another is get_json_tuple. 
> I heard they time consuming library and whereas third is  
> org.openx.data.jsonserde, I could find where i can download this JAR File. I 
> heard this(org.openx.data.jsonserde)
>  serde is fast. Can any one provide location where i can download this Serde. 
> Or there any disadvantages of using
> org.openx.data.jsonserde
> 
> also
>  see in my Hive Version 1.2, there is Library called  Json-20090211.jar. Is 
> this for JSON parsing
> 
> 
> 
> 


Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
The question is what do your rules do? Do you need to maintain a factbase or do 
they just check data quality within certain tables?

> On 16. Apr 2018, at 22:28, Joel D <games2013@gmail.com> wrote:
> 
> Ok. 
> 
> Rough ideas:
> To keep the business logic outside code, I was thinking to give a custom UI.
> 
> Next read from UI data and build UDFs using the rules defined outside the UDF.
> 
> 1 UDF per data object.
> 
> Not sure these are just thoughts. 
> 
>> On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke <jornfra...@gmail.com> wrote:
>> I would not use Drools with Spark, it does not scale to the distributed 
>> setting.
>> 
>> You could translate the rules to hive queries but this would not be exactly 
>> the same thing.
>> 
>> > On 16. Apr 2018, at 17:59, Joel D <games2013@gmail.com> wrote:
>> > 
>> > Hi,
>> > 
>> > Any suggestions on how to implement Business Rules Engine with Hive ETLs?
>> > 
>> > For spark based Etl jobs, I was exploring Drools but not sure about Hive.
>> > 
>> > Thanks. 


Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
I would not use Drools with Spark, it does not scale to the distributed setting.

You could translate the rules to hive queries but this would not be exactly the 
same thing.

> On 16. Apr 2018, at 17:59, Joel D  wrote:
> 
> Hi,
> 
> Any suggestions on how to implement Business Rules Engine with Hive ETLs?
> 
> For spark based Etl jobs, I was exploring Drools but not sure about Hive.
> 
> Thanks. 


Re: ODBC-hiveserver2 question

2018-02-24 Thread Jörn Franke
HDFS support depends on the version. A long time it was not supported.

> On 23. Feb 2018, at 21:08, Andy Srine  wrote:
> 
> Team,
> 
> Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) 
> supported in hiveserver2 via an ODBC connection? 
> 
> Some relevant points:
> I am able to do it in Hive 2.1.1 via JDBC (beeline), but not via an ODBC 
> client.
> In Hive 1.2.1, I can add a jar from the local node, but not a JAR on HDFS.
> Some old blogs online say HiveServer2 doesn't support "ADD JAR " period. But 
> thats not what I experience via beeline.
> Let me know your thoughts and experiences.
> 
> Thanks,
> Andy
> 


Re: ODBC-hiveserver2 question

2018-02-23 Thread Jörn Franke
Add jar works only with local files on the Hive server.

> On 23. Feb 2018, at 21:08, Andy Srine  wrote:
> 
> Team,
> 
> Is ADD JAR from HDFS (ADD JAR hdfs:///hive_jars/hive-contrib-2.1.1.jar;) 
> supported in hiveserver2 via an ODBC connection? 
> 
> Some relevant points:
> I am able to do it in Hive 2.1.1 via JDBC (beeline), but not via an ODBC 
> client.
> In Hive 1.2.1, I can add a jar from the local node, but not a JAR on HDFS.
> Some old blogs online say HiveServer2 doesn't support "ADD JAR " period. But 
> thats not what I experience via beeline.
> Let me know your thoughts and experiences.
> 
> Thanks,
> Andy
> 


Re: Question on accessing LLAP as data cache from external containers

2018-01-29 Thread Jörn Franke
Are you looking for sth like this:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

To answer your original question: why not implement the whole job in Hive? Or 
orchestrate using oozie  some parts in mr and some in Huve.

> On 30. Jan 2018, at 05:15, Sungwoo Park  wrote:
> 
> Hello all,
> 
> I wonder if an external YARN container can send requests to LLAP daemon to 
> read data from its in-memory cache. For example, YARN containers owned by a 
> typical MapReduce job (e.g., TeraSort) could fetch data directly from LLAP 
> instead of contacting HDFS. In this scenario, LLAP daemon just serves IO 
> requests from YARN containers and does not run its executors to perform 
> non-trivial computation. 
> 
> If this is feasible, LLAP daemon can be shared by all services running in the 
> cluster. Any comment would be appreciated. Thanks a lot.
> 
> -- Gla Park
> 


Re: HIVE Parquet column names issue

2018-01-26 Thread Jörn Franke
Drop the old parquet table before and then create it with explicit statements. 
The above statement keeps using the old parquet table if it existed

> On 26. Jan 2018, at 17:35, Brandon Cooke  wrote:
> 
> Hi Prasad,
> 
> I actually have tried this and I had that same result. 
> Although I am certainly willing to try again.
> 
> Sincerely,
> 
> Brandon Cooke
> 
>> On Jan 26, 2018, at 11:29 AM, Prasad Nagaraj Subramanya 
>>  wrote:
>> 
>> Hi Brandon,
>> 
>> Have you tried creating an external table with the required names for 
>> parquet -
>> 
>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
>> `release` STRING,
>> `customer` STRING,
>> `cookie` STRING,
>> `category` STRING,
>> `end_time` STRING,
>> `start_time` STRING,
>> `first_name` STRING,
>> `email` STRING,
>> `phone` STRING,
>> `last_name` STRING,
>> `site` STRING,
>> `source` STRING,
>> `subject` STRING,
>> `raw` STRING
>> )
>> STORED AS PARQUET
>> LOCATION '${OUTPUT}';
>> 
>> And then inserting data into this table from your csv table -
>> 
>> INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;
>> 
>> This will create a parquet file at the specified location (${OUTPUT})
>> 
>> Thanks,
>> Prasad
>> 
>>> On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke  
>>> wrote:
>>> Hello,
>>> 
>>> I posted the following on a Cloudera forum but haven’t had much luck.
>>> I’m hoping someone here can tell me what step I have probably missed:
>>> 
>>> Hello,
>>>  
>>> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet for 
>>> use in AWS Athena.
>>> However, no mater what I try the resulting Parquet always has columns 
>>> titles [_col0, _col1, ..., _colN]
>>>  
>>> After researching, I read that the line SET 
>>> parquet.column.index.access=false was supposed to allow for Parquet to use 
>>> the column titles of my HIVE table; however, it has been unsuccessful so 
>>> far.
>>>  
>>> Below is an example script I use to create the Parquet from data
>>>  
>>> SET parquet.column.index.access=false;
>>> 
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>>> `release` STRING,
>>> `customer` STRING,
>>> `cookie` STRING,
>>> `category` STRING,
>>> `end_time` STRING,
>>> `start_time` STRING,
>>> `first_name` STRING,
>>> `email` STRING,
>>> `phone` STRING,
>>> `last_name` STRING,
>>> `site` STRING,
>>> `source` STRING,
>>> `subject` STRING,
>>> `raw` STRING
>>> )
>>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> LOCATION '${INPUT}';
>>> 
>>> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
>>> STORED AS PARQUET
>>> SELECT *
>>> FROM EVENTS;
>>>  
>>> Using parquet-tools, I read the resulting file and below is an example 
>>> output:
>>>  
>>> _col0 = 0.1.2
>>> _col1 = customer1
>>> _col2 = NULL
>>> _col3 = api
>>> _col4 = 2018-01-21T06:57:57Z 
>>> _col5 = 2018-01-21T06:57:56Z 
>>> _col6 = Brandon
>>> _col7 = bran...@fakesite.com
>>> _col8 = 999-999-
>>> _col9 = Pompei
>>> _col10 = Boston
>>> _col11 = Wifi
>>> _col12 = NULL
>>> _col13 = 
>>> eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>>>  
>>> This is problematic because it is impossible to transfer it to an Athena 
>>> table (or even back to HIVE) without using these index-based column titles. 
>>> I need HIVE's column titles to transfer over to the Parquet file. 
>>>  
>>> I've search for a very long time and have come up short. Am I doing 
>>> something wrong? 
>>> Please let me know if I can provide more information. Thank you!
>>> 
>>> I appreciate your time.
>>> Sincerely,
>>> 
>>> Brandon Cooke
>>> Software Engineer
>>> engage.cx
>>> 5500 Interstate N Parkway Suite 130
>> 
> 


Re: Does Hive SQL support reading data without locking?

2018-01-01 Thread Jörn Franke
How do you import data ? Bulk import?

What about using partitions ( or is the data too small for daily partitions?)

> On 2. Jan 2018, at 04:59, l...@china-inv.cn wrote:
> 
> Hi, All, 
> 
> We are using Hive to persist our data and we run cron jobs to import new data 
> into Hive daily. 
> 
> At the same time, our users may query data from Hive at the same time by 
> using a third party software like Tableau 
> 
> We found an issue that importing job will fail while the table is queried by 
> user because of failing to get an exclusive lock 
> 
> Does Hive SQL support reading data without locking (read lock) just like 
> 'with (nolock)' provided by SQL server? 
> 
> Thanks 
> 
> Boying 
> 
> 
>
> 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。
> 
>   
> This email message may contain confidential and/or privileged information. If 
> you are not the intended recipient, please do not read, save, forward, 
> disclose or copy the contents of this email or open any file attached to this 
> email. We will be grateful if you could advise the sender immediately by 
> replying this email, and delete this email and any attachment or links to 
> this email completely and immediately from your computer system.
> 
> 
> 


Re: For Apache Hive HS2 , what is the largest heap size setting that works well?

2017-11-28 Thread Jörn Franke
I also recommend it you will have also performance improvements with JDK8 in 
general (use the latest version). 
Keep also in mind that more and more big data libraries etc will drop JDK7 
support soon (Aside that JDK7 is anyway not maintained anymore).

> On 29. Nov 2017, at 01:31, Johannes Alberti  wrote:
> 
> Yes, I would recommend to go to Java 8 and give it a shot with G1 and report 
> back :)
> 
> Sent from my iPhone
> 
>> On Nov 28, 2017, at 3:30 PM, Sharanya Santhanam  
>> wrote:
>> 
>> HI Johannes ,
>> 
>> We are running on Java version jdk1.7.0_67 . We are using 
>> ConcurrentMarkAndSweep.  Would you recommend using G1GC ? 
>> 
>> 
>> These are our current settings 
>>  
>> -XX:NewRatio=8 -XX:+UseParNewGC -XX:-UseGCOverheadLimit -XX:PermSize=256m 
>> -Xloggc:<> -XX:HeapDumpPath=oom -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
>> -XX:ErrorFile=/oom/hs2jvmerror%p.log -XX:+UseGCLogFileRotation 
>> -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=128M 
>> -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled 
>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC 
>> -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=1024m -Xmx69427m -Xms128m 
>> -XX:MaxHeapFreeRatio=30 -XX:MinHeapFreeRatio=10 -XX:+UseParNewGC 
>> -XX:-UseGCOverheadLimit -XX:PermSize=256m 
>> 
>> 
>> Thanks ,
>> Sharanya 
>> 
>>> On Tue, Nov 28, 2017 at 2:19 PM, Johannes Alberti  
>>> wrote:
>>> Hi Sharanya,
>>> 
>>> Can you share your current GC settings and Java version. Are you using Java 
>>> 8/9 w/ G1 already?
>>> 
>>> Regards,
>>> 
>>> Johannes
>>> 
>>> Sent from my iPhone
>>> 
 On Nov 28, 2017, at 12:57 PM, Sharanya Santhanam  
 wrote:
 
 Hello , 
 
 I am currently trying to upgrade hive version on our prod clusters form 
 V1.2 to v2.1 
 We also want to adopt HS2 on the new upgraded cluster. Earlier all queries 
 were submitted via Hive cli. 
 
 Would like to understand how large a single HS2 Heap size can be ? And is 
 there any formula to figure out the how many concurrent sessions I can 
 support with this particular heap setting? 
 
 
 We currently have a upper limit of supporting 300 concurrent sessions ( 
 hive.server2.thrift.max.worker.threads=300). Based on this we set the max  
 heap size to 70 GB , but seeing many long GC pauses. 
 
 
 Would like to understand what is the industry standard for max HS2 Heap 
 size. Are there any recommendations on what JMV GC setting work best for 
 supporting  such high number of concurrent sessions? 
 
 Thanks,
 Sharanya 
>> 


Re: Issues with hive storage based authorization

2017-11-15 Thread Jörn Franke
What kind of access do you need for a user?

From a distance it is quiet difficult to judge, because we do not have all 
information and the Kerberos setup can be rather tricky (if not using a Hadoop 
distribution facilitating it).

Usually fine granular access is supported by using Apache Ranger or Apache 
Sentry.

> On 15. Nov 2017, at 12:19, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote:
> 
> Hi ,
> 
> As per the suggestion I did kerberized the cluster however getting the same 
> issue. Any user after authenticating using a keytab can go and create 
> databases.
> 
> One thing I observer was the dfs.permissions.enabled in hdfs-site.xml set to 
> false. After setting it to true a user with required privilege on the 
> warehouse dir was only able to create database. However that works without 
> even enabling the hive storage based authorization. So not sure how hive 
> storage base authorization will provided additional security. Definitely I am 
> missing something.
> 
> Please suggest.
> 
> Thanks,
> Vijay
> 
>> On Thu, Nov 9, 2017 at 1:55 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Then you need to kerberize it to support what you want
>> 
>>> On 9. Nov 2017, at 09:18, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote:
>>> 
>>> No its not. 
>>> 
>>> Thanks,
>>> Vijay
>>> 
>>>> On Thu, Nov 9, 2017 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> Is your Hadoop cluster kerberized?
>>>> 
>>>>> On 9. Nov 2017, at 06:57, Vijay Toshniwal <vijay.toshni...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi Team,
>>>>> 
>>>>>  
>>>>> 
>>>>> I am facing issues while configuring hive storage based authorization. I 
>>>>> followed the steps mentioned in 
>>>>> https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server
>>>>>  however still any user can create database in hive (using beeline and 
>>>>> cli) at will though not able to delete other users databases. My hive 
>>>>> directory permission is set to 770 (hive:hadoop).Below are the parameters 
>>>>> that I added to hive-site.xml:
>>>>> 
>>>>>  
>>>>> 
>>>>> hive.metastore.pre.event.listeners: 
>>>>> org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
>>>>> 
>>>>> hive.security.metastore.authorization.auth.reads: true
>>>>> 
>>>>> hive.security.metastore.authenticator.manager:org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
>>>>> 
>>>>> hive.security.metastore.authorization.manager: 
>>>>> org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
>>>>> 
>>>>> hive.metastore.execute.setugi: true
>>>>> 
>>>>> hive.server2.enable.doAs:true
>>>>> 
>>>>>  
>>>>> 
>>>>> hive version: 1.2.1
>>>>> 
>>>>> Hadoop version: 2.7.3
>>>>> 
>>>>>  
>>>>> 
>>>>> My understanding was only those users having write access to 
>>>>> /user/hive/warehouse should be able to create the database. Please 
>>>>> suggest.
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> I also found one similar question 
>>>>> https://stackoverflow.com/questions/43734947/does-the-storage-based-authorization-or-sql-standards-based-hive-authorization-w?rq=1
>>>>>  where the default authorization is not working as expected.
>>>>>  
>>>>> Request you to provide your inputs on the same.
>>>>> 
>>>>> Thanks,
>>>>> Vijay
>>> 
> 


Re: Issues with hive storage based authorization

2017-11-09 Thread Jörn Franke
Then you need to kerberize it to support what you want

> On 9. Nov 2017, at 09:18, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote:
> 
> No its not. 
> 
> Thanks,
> Vijay
> 
>> On Thu, Nov 9, 2017 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Is your Hadoop cluster kerberized?
>> 
>>> On 9. Nov 2017, at 06:57, Vijay Toshniwal <vijay.toshni...@gmail.com> wrote:
>>> 
>>> Hi Team,
>>> 
>>>  
>>> 
>>> I am facing issues while configuring hive storage based authorization. I 
>>> followed the steps mentioned in 
>>> https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server
>>>  however still any user can create database in hive (using beeline and cli) 
>>> at will though not able to delete other users databases. My hive directory 
>>> permission is set to 770 (hive:hadoop).Below are the parameters that I 
>>> added to hive-site.xml:
>>> 
>>>  
>>> 
>>> hive.metastore.pre.event.listeners: 
>>> org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
>>> 
>>> hive.security.metastore.authorization.auth.reads: true
>>> 
>>> hive.security.metastore.authenticator.manager:org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
>>> 
>>> hive.security.metastore.authorization.manager: 
>>> org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
>>> 
>>> hive.metastore.execute.setugi: true
>>> 
>>> hive.server2.enable.doAs:true
>>> 
>>>  
>>> 
>>> hive version: 1.2.1
>>> 
>>> Hadoop version: 2.7.3
>>> 
>>>  
>>> 
>>> My understanding was only those users having write access to 
>>> /user/hive/warehouse should be able to create the database. Please suggest.
>>> 
>>>  
>>> 
>>> 
>>> I also found one similar question 
>>> https://stackoverflow.com/questions/43734947/does-the-storage-based-authorization-or-sql-standards-based-hive-authorization-w?rq=1
>>>  where the default authorization is not working as expected.
>>>  
>>> Request you to provide your inputs on the same.
>>> 
>>> Thanks,
>>> Vijay
> 


Re: Issues with hive storage based authorization

2017-11-08 Thread Jörn Franke
Is your Hadoop cluster kerberized?

> On 9. Nov 2017, at 06:57, Vijay Toshniwal  wrote:
> 
> Hi Team,
> 
>  
> 
> I am facing issues while configuring hive storage based authorization. I 
> followed the steps mentioned in 
> https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server
>  however still any user can create database in hive (using beeline and cli) 
> at will though not able to delete other users databases. My hive directory 
> permission is set to 770 (hive:hadoop).Below are the parameters that I added 
> to hive-site.xml:
> 
>  
> 
> hive.metastore.pre.event.listeners: 
> org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
> 
> hive.security.metastore.authorization.auth.reads: true
> 
> hive.security.metastore.authenticator.manager:org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
> 
> hive.security.metastore.authorization.manager: 
> org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
> 
> hive.metastore.execute.setugi: true
> 
> hive.server2.enable.doAs:true
> 
>  
> 
> hive version: 1.2.1
> 
> Hadoop version: 2.7.3
> 
>  
> 
> My understanding was only those users having write access to 
> /user/hive/warehouse should be able to create the database. Please suggest.
> 
>  
> 
> 
> I also found one similar question 
> https://stackoverflow.com/questions/43734947/does-the-storage-based-authorization-or-sql-standards-based-hive-authorization-w?rq=1
>  where the default authorization is not working as expected.
>  
> Request you to provide your inputs on the same.
> 
> Thanks,
> Vijay


Re: Parameterized views

2017-10-02 Thread Jörn Franke
Yes it is better to push columns to the view so that users can filter on them.

Alternatively you can use hpl/sql on Hive.

I think there are very few (if any) use cases to support parametrized views .

> On 2. Oct 2017, at 16:12, Elliot West  wrote:
> 
> Hello,
> 
> Does any version of Hive support parameterized views. I'm thinking of 
> something like the following contrived example:
> 
> CREATE VIEW x AS
>   SELECT a
>   FROM y
>   WHERE date = ${mydate}
> 
> I've not been able to get this to work in Hive 1.2.1 or 2.1.0 and wonder if 
> this is the intended behavior, or I'm not constructing the view correctly.
> 
> Is it perhaps better instead to push up the filter columns to the view so 
> that they are accessible to the caller there?:
> 
> CREATE VIEW x AS
>   SELECT a, date
>   FROM y
> 
> Would this have any limitations in comparison with the variable substitution 
> approach?
> 
> Thanks,
> 
> Elliot.
> 


Re: hive on spark - why is it so hard?

2017-10-02 Thread Jörn Franke
You should try with TEZ+LLAP.

Additionally you will need to compare different configurations.

Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.

> On 2. Oct 2017, at 03:06, Stephen Sprague  wrote:
> 
> so...  i made some progress after much copying of jar files around (as 
> alluded to by Gopal previously on this thread).
> 
> 
> following the instructions here: 
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> 
> and doing this as instructed will leave off about a dozen or so jar files 
> that spark'll need:
>   ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz 
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
> 
> i ended copying the missing jars to $SPARK_HOME/jars but i would have 
> preferred to just add a path(s) to the spark class path but i did not find 
> any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but 
> i don't see the analagous var in spark - i don't think it inherits the hive 
> classpath.
> 
> anyway a simple query is now working under Hive On Spark so i think i might 
> be over the hump.  Now its a matter of comparing the performance with Tez.
> 
> Cheers,
> Stephen.
> 
> 
>> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:
>> ok.. getting further.  seems now i have to deploy hive to all nodes in the 
>> cluster - don't think i had to do that before but not a big deal to do it 
>> now.
>> 
>> for me:
>> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>> 
>> on all three nodes now.
>> 
>> i started spark master on the namenode and i started spark slaves (2) on two 
>> datanodes of the cluster. 
>> 
>> so far so good.
>> 
>> now i run my usual test command.
>> 
>> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set 
>> hive.execution.engine=spark; select date_key, count(*) from 
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>> 
>> i get a little further now and find the stderr from the Spark Web UI 
>> interface (nice) and it reports this:
>> 
>> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@172.19.79.127:40145
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:483)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
>>  at 
>> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>>  at 
>> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>>  at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>>  ... 6 more
>> 
>> 
>> searching around the internet i find this is probably a compatibility issue.
>> 
>> i know. i know. no surprise here.  
>> 
>> so i guess i just got to the point where everybody else is... build spark 
>> w/o hive. 
>> 
>> lemme see what happens next.
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:
>>> thanks.  I haven't had a chance to dig into this again today but i do 
>>> appreciate the pointer.  I'll keep you posted.
>>> 
 On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar  
 wrote:
 You can try increasing the value of hive.spark.client.connect.timeout. 
 Would also suggest taking a look at the HoS Remote Driver logs. The driver 
 gets launched in a YARN container (assuming you are running Spark in 
 yarn-client mode), so you just have to find the logs for that container.
 
 --Sahil
 
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  
> wrote:
> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's 
> where i'm at now.
> 
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: 
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "serverSparkVersion" : "2.2.0",
> 

Re: Hive - Avro - Schema Manipulation

2017-09-24 Thread Jörn Franke
insert into dep_av values(8,null) should do what you intent.

> On 24. Sep 2017, at 03:03, BD  wrote:
> 
> Hi ,
> 
> I have imported (using sqoop) departments table from retail_db in hdfs as 
> avro file. Have created an external table stored as hive and used the avro 
> schema generated by sqoop. 
> 
> I want to modify the avro schema so that a column is non nullable and if not 
> specified in insert query then a default value is inserted into the table. 
> Have tried modifying the avro schema as following, but it does not help. 
> 
> Avro Schema
> 
> {
>   "type" : "record",
>   "name" : "departments",
>   "doc" : "Sqoop import of departments",
>   "fields" : [ {
> "name" : "department_id",
> "type" :  "int" ,
> "columnName" : "department_id",
> "sqlType" : "4"
>   }, {
> "name" : "department_name",
> "type" : "string",
> "default" : "default_dep_name" ,
> "columnName" : "department_name",
> "sqlType" : "12"
>   } ],
>   "tableName" : "departments"
> }
> 
> 
> If i do not provide the value for department name then hive gives error 
> stating that two columns expected. Is this a valid use case? if so any 
> suggestion?
> 
> 
> hive> insert into dep_av values(8);
> FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target 
> table because column number/types are different 'dep_av': Table insclause-0 
> has 2 columns, but query has 1 columns.
> 
> 
> regards


Re: EMR 5.8 & Hue/Hive Performance/Stability Specifics

2017-09-12 Thread Jörn Franke
Test it, because it really depends what you do. Since you use hue you seem to 
be interested in interactive analysis, so the best is to use Tez and llap as a 
hive engine. Make also sure that you use ORC or Parquet as a Hive storage 
format. Leverage the in-build orc or parquet indexes by sorting data on the 
filtering column. If possible partition the data.

I do not remember  at the moment if TEZ is included in the Hive package of EMR 
or if you have to install an additional package. 
However, given your current Hive version you should see for sure an improvement.

> On 12. Sep 2017, at 18:41, Mr. Glenn Waldman  wrote:
> 
> Hi,
> We are upgrading our EMR from 4.7 - 5.8, along with Hue/Hive conjointly.  
> Hue: from 1.0 - 2.3
> Hive: to 3.12
> 
> Can somebody explain some high-level, specific stability and performance 
> increases we should see from these upgrades?
> 
> We run daily queries through Hue/Hive and are looking for how these upgrades 
> should make them run more quickly with less breaks.
> 
> Thoughts!?
> 
> Thanks!!


Re: Why is a single INSERT very slow in Hive?

2017-09-11 Thread Jörn Franke
Why do you want to do single inserts? 
It has been more designed for bulk loads.
In any case newer version of Hive 2 using TEZ +llap improve it significantly 
(also for bulk analysis). Nevertheless, it is good practice to not use single 
inserts in an analysis systems, but try to combine and bulk-load them.

> On 11. Sep 2017, at 21:01, Jinhui Qin  wrote:
> 
>  
> 
> Hi, 
> I am new to Hive. I just created a simple table in hive and inserted two 
> records, the first insertion took 16.4 sec, while the second took 14.3 sec. 
> Why is that very slow? is this the normal performance you get in Hive using 
> INSERT ? Is there a way to improve the performance of a single "insert" in 
> Hive? Any help would be really appreciated. Thanks!
> 
> Here is the record from a terminal in Hive shell:
> 
> =
> 
> hive> show tables;
> OK
> Time taken: 2.758 seconds
> hive> create table people(id int, name string, age int);
> OK
> Time taken: 0.283 seconds
> hive> insert into table people(1,'Tom A', 20);
> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0098, Tracking URL = 
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill 
> job_1505146047428_0098
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.7 
> sec
> MapReduce Total cumulative CPU time: 2 seconds 700 msec
> Ended Job = job_1505146047428_0098
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: 
> hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-40-52_106_462156758110461544
> 1-1/-ext-1
> Loading data to table default.people
> Table default.people stats: [numFiles=1, numRows=1, totalSize=11, 
> rawDataSize=10]
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS Write: 
> 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
> Time taken: 16.417 seconds
> hive> insert into table people values(1,'Tom A', 20);
> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0099, Tracking URL = 
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill 
> job_1505146047428_0099
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2017-09-11 13:41:36,289 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:40,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.28 
> sec
> MapReduce Total cumulative CPU time: 2 seconds 280 msec
> Ended Job = job_1505146047428_0099
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: 
> hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-41-28_757_445847252207124056
> 7-1/-ext-1
> Loading data to table default.people
> Table default.people stats: [numFiles=2, numRows=2, totalSize=22, 
> rawDataSize=20]
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 3924 HDFS 
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 280 msec
> OK
> Time taken: 14.288 seconds
> hive> exit;
> =
> 
>  
> Jinhui
> 
> 


Re: Hive index + Tez engine = no performance gain?!

2017-08-21 Thread Jörn Franke
Parquet has also internal indexes. So no need for Hive index there.
For fast ad-hoc queries you can use Tez +llap. Here you could use parquet or 
convert via CTAS easily to Orc. However you need to check if ORC is faster than 
Parquet depending on your data, queries and configuration (bloom filters, 
internal indexes etc).

> On 22. Aug 2017, at 03:22, Thai Bui  wrote:
> 
> This seems out of the blue but my initial benchmarks have shown that there's 
> no performance gain when Hive index is used with Tez engine. I'm not sure 
> why, but several posts online have suggested that Tez engine does not support 
> Hive index (bitmap, compact). Is true? If yes, that is sad.
> 
> I understand that ORC format is a much better alternative if you manage your 
> own tables. However, at my company, we have several teams that pick our own 
> technology and thus, most teams would use Parquet due to its ease of 
> integrations with various external systems. 
> 
> Nonetheless, we still want to have fast ad-hoc query via Hive LLAP / Tez. I 
> think that index is a perfect solution for non-ORC file format since you can 
> selectively build an index table and leverage Tez to only look at those 
> blocks and/or files that we need to scan.
> 
> Thanks for any input,
> Thai


Re: Can one classify Hive as an analytical tool besides storage?

2017-08-14 Thread Jörn Franke
Depends on your definition of analytical and storage tool. I think Hive 
(especially with TEZ+llap) would qualify as an analytical tool, especially 
because you can extend it with sql procedures, can use any Java function 
directly in sql , it has wide range of analytical functions etc.
The storage part is also covered with orc and parquet (and many more formats).

> On 14. Aug 2017, at 18:36, Mich Talebzadeh  wrote:
> 
> This may be rather mute point.
> 
> We had a discussion in the office about Hive and Impala!
> 
> A colleague insisted that Hive and Impala are storage tools. Changed his mind 
> when I mentioned that one can use Impala on Hive tables for faster query 
> access.
> 
> With regard to Hive, my view is that Hive supports HQL that in turn has 
> analytical functions like RANK etc built in. So in effect it is not only a 
> storage, but can be used as an analytical tool as well?
> 
> What are your views?
> 
> Thanks,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Using JAR files located on HDFS for SerDe

2017-04-12 Thread Jörn Franke
I do not think it is supported. The jar for Hive must be on a local filesystem 
of the Hive server (not necessarily on all nodes).

> On 12. Apr 2017, at 16:57, Mahdi Mohammadinasab  wrote:
> 
> Hello,
> 
> I am trying to add a JAR file which is located on HDFS to be later used as a 
> SerDe. This is completely possible using "ADD JAR" command but I prefer to 
> use hive.aux.jars.path setting in "hive-site.xml" or "HIVE_AUX_JARS_PATH" 
> environment variable (Because then I don't need to update all of my scripts 
> with ADD JAR command).
> 
> Unfortunately, none of these methods seem to work. I tried different 
> addressing methods (hdfs://, hdfs:///, hdfs://cluster_name/path/file.jar, 
> hdfs://localhost:9/path/file.jar, ...) but they don't work.
> 
> So how can I do this?
> 
> Thanks,
> Mahdi


Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Jörn Franke
From hive to hive another cluster I would use Hive Import/Export, if possible 
with Falcon instead of Sqoop. Sqoop always needs to do 
serialization/deserialization which is fine if the systems are different (e.g. 
Oracle -> Hive), but if it is the same system then usually it makes sense to 
use the tool of the system (e.g. Hive import/export) for performance/resource 
usage reasons.


> On 21 Feb 2017, at 11:31, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> thanks Jorn.
> 
> the idea is to test a big data high availability tool. by ingesting data to 
> the target cluster. I have provisioned aan oracle schema for it so we can use 
> Sqoop to get data into Hive.
> 
> someone suggested try using sqoop to ingest from a hive table in one cluster 
> to the target cluster.
> 
> this is not really a test is iut?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 21 February 2017 at 10:26, Jörn Franke <jornfra...@gmail.com> wrote:
>> Hallo,
>> 
>> I have not tried it, but sqoop supports any jdbc driver. However, since the 
>> SQL syntax is not necessarily standardized you may face issues or 
>> performance problems. Hive itself has a nice import and export tool that 
>> supports also the metadata import/export. It can be orchestrated from Oozie 
>> and/or Falcon.
>> 
>> Best regards 
>> 
>>> On 21 Feb 2017, at 11:16, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I have not tried this but someone mentioned that it is possible to use 
>>> Sqoop to get data from one Impala/Hive table in one cluster to another?
>>> 
>>> The clusters are in different zones. This is to test the cluster. Has 
>>> anyone done such a thing?
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
> 


Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Jörn Franke
Hallo,

I have not tried it, but sqoop supports any jdbc driver. However, since the SQL 
syntax is not necessarily standardized you may face issues or performance 
problems. Hive itself has a nice import and export tool that supports also the 
metadata import/export. It can be orchestrated from Oozie and/or Falcon.

Best regards 

> On 21 Feb 2017, at 11:16, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have not tried this but someone mentioned that it is possible to use Sqoop 
> to get data from one Impala/Hive table in one cluster to another?
> 
> The clusters are in different zones. This is to test the cluster. Has anyone 
> done such a thing?
> 
> Thanks
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Need help on connecting Tableau to Hive

2017-01-22 Thread Jörn Franke
You need to install the Hortonworks ODBC drivers (just Google them) on Windows. 
Tableau does not include any drivers, but only Software to use these drivers to 
connect to whatever database you need.

> On 22 Jan 2017, at 03:15, Raymond Xie  wrote:
> 
> Hello,
> 
> I have Tableau installed on my Win 7 desktop, I have Hive 2.0.1 installed on 
> my hadoop node on VM Ware.
> 
> The host and guest communication is ping-able.
> 
> I tried to connect Tableau to HiveServer2 using the credential found in 
> hive-site.xml, Tableau returns the following error message:
> 
> Unable to connect to the ODBC Data Source. Check that the necessary drivers 
> are installed and that the connection properties are valid.
> [Hortonworks][Hardy] (34) Error from server: connect() failed: errno = 10061.
> Unable to connect to the server "192.168.112.150". Check that the server is 
> running and that you have access privileges to the requested database.
> 
> Here is the screenshot:
> 
> 
> 
> What am I missing here?
> 
> I already started Hive on hadoop node.
> 
> But I don't see anything when I run:
> netstat -ntauple | grep hive
> 
> 
> 
> Thank you very much.
> 
> 
> 
> 
> 
> Sincerely yours,
> 
> 
> Raymond


Re: File not found of TEZ libraries with tez.lib.uris configuration

2017-01-16 Thread Jörn Franke
Sorry never mind my previous mail... in the stack it seems to look exactly for 
this file. Can you try to download the file? Can you check if these are all 
files needed? I think you need to extract the .tar.gz and point to the jars 
(check the Tez web site for the confit).

> On 17 Jan 2017, at 07:44, wenxing zheng  wrote:
> 
> Dear all, 
> 
> I met an issue in the TEZ configuration for HIVE, as from the HIVE logs file:
> 
>> Caused by: java.io.FileNotFoundException: File does not exist: 
>> hdfs://hdfscluster/apps/tez-0.8.4/tez.tar.gz
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>  ~[hadoop-common-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at org.apache.hadoop.fs.FileSystem.resolvePath(FileSystem.java:753) 
>> ~[hadoop-common-2.7.3.jar:?]
>> at 
>> org.apache.tez.client.TezClientUtils.addLocalResources(TezClientUtils.java:218)
>>  ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:183)
>>  ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.tez.client.TezClient.getTezJarResources(TezClient.java:1057) 
>> ~[tez-api-0.8.4.jar:0.8.4]
>> at org.apache.tez.client.TezClient.start(TezClient.java:447) 
>> ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.startSessionAndContainers(TezSessionState.java:396)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.access$000(TezSessionState.java:97)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:333)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:329)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>> ~[?:1.8.0_101]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_101]
> 
> 
> But in our configuration file, we already set the tez.lib.uris as below:
>> 
>> tez.lib.uris
>> hdfs://hdfscluster/apps/tez-0.8.4/tez.tar.gz
>> 
>> 
>> hive.execution.engine
>> tez
>> 
> 
> 
> And we did copy the tar.gz file to the HDFS filesystem:
>> [hadoop@hdfs-hdp-202 tez]$ hadoop fs -ls /apps//tez-0.8.4/tez.tar.gz
>> -rw-r--r--   3 hadoop supergroup   42892768 2016-12-21 18:36 
>> /apps/tez-0.8.4/tez.tar.gz
> 
> 
> Appreciated for any advice.
> Kind Regards, Wenxing


Re: File not found of TEZ libraries with tez.lib.uris configuration

2017-01-16 Thread Jörn Franke
Maybe the wrong configuration file is picked up? 

> On 17 Jan 2017, at 07:44, wenxing zheng  wrote:
> 
> Dear all, 
> 
> I met an issue in the TEZ configuration for HIVE, as from the HIVE logs file:
> 
>> Caused by: java.io.FileNotFoundException: File does not exist: 
>> hdfs://hdfscluster/apps/tez-0.8.4/tez.tar.gz
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>  ~[hadoop-common-2.7.3.jar:?]
>> at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
>>  ~[hadoop-hdfs-2.7.3.jar:?]
>> at org.apache.hadoop.fs.FileSystem.resolvePath(FileSystem.java:753) 
>> ~[hadoop-common-2.7.3.jar:?]
>> at 
>> org.apache.tez.client.TezClientUtils.addLocalResources(TezClientUtils.java:218)
>>  ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:183)
>>  ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.tez.client.TezClient.getTezJarResources(TezClient.java:1057) 
>> ~[tez-api-0.8.4.jar:0.8.4]
>> at org.apache.tez.client.TezClient.start(TezClient.java:447) 
>> ~[tez-api-0.8.4.jar:0.8.4]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.startSessionAndContainers(TezSessionState.java:396)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.access$000(TezSessionState.java:97)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:333)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.TezSessionState$1.call(TezSessionState.java:329)
>>  ~[hive-exec-2.1.1.jar:2.1.1]
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>> ~[?:1.8.0_101]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_101]
> 
> 
> But in our configuration file, we already set the tez.lib.uris as below:
>> 
>> tez.lib.uris
>> hdfs://hdfscluster/apps/tez-0.8.4/tez.tar.gz
>> 
>> 
>> hive.execution.engine
>> tez
>> 
> 
> 
> And we did copy the tar.gz file to the HDFS filesystem:
>> [hadoop@hdfs-hdp-202 tez]$ hadoop fs -ls /apps//tez-0.8.4/tez.tar.gz
>> -rw-r--r--   3 hadoop supergroup   42892768 2016-12-21 18:36 
>> /apps/tez-0.8.4/tez.tar.gz
> 
> 
> Appreciated for any advice.
> Kind Regards, Wenxing


Re: Hive Stored Textfile to Stored ORC taking long time

2016-12-09 Thread Jörn Franke
Ok.
No do no split in smaller files. This is done automatically. Your behavior 
looks strange. For that file size I would expect that it takes below one 
minute. 
Maybe you hit a bug in the spark on hive engine. You could try with a file with 
less columns, but the same size. I assume that this is a hive table with simple 
columns (nothing deeply nested) and that you do not any transformations.
What is the CTAS query?
Do you enable vectorization in Hive?

If you just need a simple mapping from CSV to orc you can use any framework 
(mr, tez, spark etc), because performance does not differ so much in these 
cases, especially for the small amount of data you process.

> On 9 Dec 2016, at 11:02, Joaquin Alzola <joaquin.alz...@lebara.com> wrote:
> 
> Hi Jorn
>  
> The file is about 1.5GB with 1.5 milion records and about 550 fields in each 
> row.
>  
> ORC is compress as Zlib.
>  
> I am using a standalone solution before expanding it, so everything is on the 
> same node.
> Hive 2.0.1 à Spark 1.6.3 à HDFS 2.6.5
>  
> The configuration is much more as standard and have not change anything much.
>  
> It cannot be a network issue because all the apps are on the same node.
>  
> Since I am doing all of this translation on the Hive point (from textfile to 
> ORC) I wanted to know if I could do it quicker on the Spark or HDFS level 
> (doing the file conversion some other way) not on the stop of the “stack”
>  
> We take the files every day once so if I put them in textfile and then to ORC 
> it will take me almost half a day just to display the data.
>  
> It is basicly a time consuming task, and want to do it much quicker. A better 
> solution of course would be to put smaller files with FLUME but this I will 
> do it in the future.
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: 09 December 2016 09:48
> To: user@hive.apache.org
> Subject: Re: Hive Stored Textfile to Stored ORC taking long time
>  
> How large is the file? Might IO be an issue? How many disks have you on the 
> only node?
>  
> Do you compress the ORC (snappy?). 
>  
> What is the Hadoop distribution? Configuration baseline? Hive version?
>  
> Not sure if i understood your setup, but might network be an issue?
> 
> On 9 Dec 2016, at 02:08, Joaquin Alzola <joaquin.alz...@lebara.com> wrote:
> 
> HI List
>  
> The transformation from textfile table to stored ORC table takes quiet a long 
> time.
>  
> Steps follow>
>  
> 1.Create one normal table using textFile format
> 
> 2.Load the data normally into this table
> 
> 3.Create one table with the schema of the expected results of your normal 
> hive table using stored as orcfile
> 
> 4.Insert overwrite query to copy the data from textFile table to orcfile table
> 
>  
> I have about 1,5 million records with about 550 fields in each row.
>  
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>  
> I have spark with only one worker (same for HDFS) so running now a standalone 
> server but with 25G and 14 cores on that worker.
>  
> BR
>  
> Joaquin
> This email is confidential and may be subject to privilege. If you are not 
> the intended recipient, please do not copy or disclose its content but 
> contact the sender immediately upon receipt.
> This email is confidential and may be subject to privilege. If you are not 
> the intended recipient, please do not copy or disclose its content but 
> contact the sender immediately upon receipt.


Re: Hive Stored Textfile to Stored ORC taking long time

2016-12-09 Thread Jörn Franke
How large is the file? Might IO be an issue? How many disks have you on the 
only node?

Do you compress the ORC (snappy?). 

What is the Hadoop distribution? Configuration baseline? Hive version?

Not sure if i understood your setup, but might network be an issue?

> On 9 Dec 2016, at 02:08, Joaquin Alzola  wrote:
> 
> HI List
>  
> The transformation from textfile table to stored ORC table takes quiet a long 
> time.
>  
> Steps follow>
>  
> 1.Create one normal table using textFile format
> 
> 2.Load the data normally into this table
> 
> 3.Create one table with the schema of the expected results of your normal 
> hive table using stored as orcfile
> 
> 4.Insert overwrite query to copy the data from textFile table to orcfile table
> 
>  
> I have about 1,5 million records with about 550 fields in each row.
>  
> Doing step 4 takes about 30 minutes (moving from one format to the other).
>  
> I have spark with only one worker (same for HDFS) so running now a standalone 
> server but with 25G and 14 cores on that worker.
>  
> BR
>  
> Joaquin
> This email is confidential and may be subject to privilege. If you are not 
> the intended recipient, please do not copy or disclose its content but 
> contact the sender immediately upon receipt.


Re: s3a and hive

2016-11-14 Thread Jörn Franke
Is it a permission issue on the folder?

> On 15 Nov 2016, at 06:28, Stephen Sprague  wrote:
> 
> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive and 
> see what would happen.   
> 
> running this query:
> 
> insert overwrite table omniture.hit_data_aws partition 
> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113 limit 1;
> 
> yields this error:
> 
>Failed with exception java.io.IOException: rename for src path: 
> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
>  
> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>  to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>  returned false
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename for src 
> path: 
> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>  to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>  returned false
> 
> 
> is there any workaround?   i'm running hive 2.1.0 and hadoop version 
> 2.6.0-cdh5.7.1  .
> 
> 
> thanks,
> Stephen.


Re: Hive metadata on Hbase

2016-10-23 Thread Jörn Franke
I think the main gain is more about getting rid of a dedicated database 
including maintenance and potential license cost. 
For really large clusters and a lot of users this might be even more 
beneficial. You can avoid clustering the database etc.

> On 24 Oct 2016, at 00:46, Mich Talebzadeh  wrote:
> 
> 
> A while back there was some notes on having Hive metastore on Hbase as 
> opposed to conventional RDBMSs
> 
> I am currently involved with some hefty work with Hbase and Phoenix for batch 
> ingestion of trade data. As long as you define your Hbase table through 
> Phoenix and with secondary Phoenix indexes on Hbase, the speed is impressive.
> 
> I am not sure how much having Hbase as Hive metastore is going to add to Hive 
> performance. We use Oracle 12c as Hive metastore and the Hive database/schema 
> is built on solid state disks. Never had any issues with lock and concurrency.
> 
> Therefore I am not sure what one is going to gain by having Hbase as the Hive 
> metastore? I trust that we can still use our existing schemas on Oracle.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Query consuming all resources

2016-09-28 Thread Jörn Franke
You need to configure queues in yarn and use the fairscheduler. From your use 
case it looks like you need to also configure pre-emption 

> On 28 Sep 2016, at 00:52, Jose Rozanec  wrote:
> 
> Hi, 
> 
> We have a Hive cluster. We notice that some queries consume all resources, 
> which is not desirable to us, since we want to grant some degree of 
> parallelism to incoming ones: any incoming query should be able to do at 
> least some progress, not just wait the big one finish.
> 
> Is there way to do so? We use Hive 2.1.0 with Tez engine.
> 
> Thank you in advance,
> 
> Joze.


Re: populating Hive table periodically from files on HDFS

2016-09-25 Thread Jörn Franke
I think what you propose makes sense. If you would do a delta load you gain not 
much performance benefits (most likely you will have less performance because 
you need to figure out what has changed, have the typical issues of distributed 
systems that some changes may arrive later, error handling etc). Especially 
given the volumes.
You may partition smaller, but yes given the volumes not really needed. 

> On 25 Sep 2016, at 12:32, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> I have trade data delivered through kafka and flume as csv files to HDFS. 
> There are 100 prices every 2 seconds so in a minute there are 3000 new rows, 
> 18K rows an hour and in a day 4,320,000 new rows.
> 
> Flume creates a new sub directory partition ever day in the format -MM-DD 
> like prices/2015-09-25 on HDFS
> 
> There is an external Hive table pointing to new directory by simply altering 
> external table location
> 
> ALTER TABLE ${DATABASE}.externalMarketData set location 
> 'hdfs://rhes564:9000/data/prices/${TODAY}';
> 
> This means that the external Hive table only points to the current directory.
> 
> The target internal table in Hive is partitioned by  DateStamp ="-MM-DD"
> 
> PARTITIONED BY (DateStamp  string)
> 
> to populate the Hive table a cron job runs every 15 minutes and does simply
> 
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = 
> "${TODAY}")
> SELECT
> '
> )
> FROM ${DATABASE}.externalMarketData
> 
> So effectively every 15 minutes today's partition is overwritten by new data 
> from the external table.
> 
> This seems to be OK. 
> 
> The other option is only add new rows since last time with INSERT INTO WHERE 
> rows do not exist in target table.
> 
> Any other suggestions?
> 
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: need help with hive session getting killed

2016-09-20 Thread Jörn Franke
Increase timeout or let the result of the query be written in a dedicated table.

> On 20 Sep 2016, at 16:57, anup ahire  wrote:
> 
> 
> 
> 
> Hello,
> 
> I am using hive-jdbc-1.2.1 to run a query. Query runs around an hour and 
> eventually completes.
> But my hive session terminats before I can get results from completed query 
> and job fails.
> java.sql.SQLException: org.apache.http.NoHttpResponseException: The target 
> server failed to respond
> 
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) 
> ~[hive-jdbc-1.2.1.jar!/:1.2.1]
> 
> What is the best way to address this ?
> 
> 
> 
> Thanks ,
> 
> Anup
> 
> 


Re: Hive 2.x usage

2016-09-14 Thread Jörn Franke
If you are using a distribution (which you should if you go to production - 
Apache releases should not be used due to the maintainability, complexity and 
interaction with other components, such as Hadoop etc) then wait until a 
distribution with 2.x is out. As far as i am aware there is currently no such 
distribution. As far as i know , Hortonworks and probably also Cloudera test 
their distributions on large scale real production systems beforehand.

I would not use MR even with 1.x and go for TEZ (except you are using some very 
specific outdated functionality). Spark is another option, but i do not see 
Hive on Spark as stable and less functionality - this may change in the future.

> On 14 Sep 2016, at 22:36, RD  wrote:
> 
> Hi Folks,
>   We  (at my org) are currently planning our move to Hive-2.x. As part of 
> this I wanted to get a sense of how stable the Hive-2.x release is.  I 
> thought it would be good to conduct a brief survey on this. I've added a few 
> questions below. It would really be a ton of help if folks could provide 
> their feedback
> 
> * Are you using Hive-2.x at your org and at what scale?
> * Is the release stable enough? Did you notice any correctness issues?
> * MR is deprecated in Hive-2.x (Though almost all the qtests still use MR). 
> Are you still using MR with Hive-2.x?
> * Are you using the apache release or HDP ?
> 
> -Best,
> 
> 
> 
> 


Re: Hive or Pig - Which one gives best performance for reading HBase data

2016-09-14 Thread Jörn Franke
They should be rather similar, you may gain some performance using Tez or Spark 
as an execution engine but in an export scenario do not expect much performance 
improvements.
In any scenario avoid to have only one reducer, but use several ones, e.g. by 
exporting to multiple output files instead of one. This avoids network load.
It might be worth to check Apache Pherf to export selected columns from Hbase.

> On 14 Sep 2016, at 19:19, Nagabhushanam Bheemisetty  
> wrote:
> 
> Hi,
> 
> I have a situation where I need to read data from huge HBase table and dump 
> it into other location as a flat file. I am not interested in all the columns 
> rather I need only lets 10 out of 100+ columns. So which technology Hive/Pig 
> gives better performance. I believe both of them will use serde' to extract 
> and build record.
> 
> Any thoughts?
> 
> Thanks


Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

2016-08-24 Thread Jörn Franke
This is also a good option.

With respect to Hive transactional tables: I do to think they have been 
designed for massive inserts of single items. On the other hand you would not 
insert a lot of events using single inserts in a relational database. Same 
restrictions apply, it is not the use case you want to implement.


> On 24 Aug 2016, at 13:55, Kit Menke <kitme...@gmail.com> wrote:
> 
> Joel,
> Another option which you have is to use the Storm HDFS bolt to stream data 
> into Hive external tables. The external tables then get loaded into ORC 
> history tables for long term storage. We use this in a HDP cluster with 
> similar load so I know it works. :)
> 
> I'm with Jörn on this one. My impression of hive transactions is that it is a 
> new feature not totally ready for production.
> Thanks,
> Kit
> 
> 
>> On Aug 24, 2016 3:07 AM, "Joel Victor" <joelsvic...@gmail.com> wrote:
>> @Jörn: If I understood correctly even later versions of Hive won't be able 
>> to handle these kinds of workloads?
>> 
>>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> I think Hive especially these old versions have not been designed for this. 
>>> Why not store them in Hbase and run a oozie job regularly that puts them 
>>> all into Hive /Orc or parquet in a bulk job?
>>> 
>>>> On 24 Aug 2016, at 09:35, Joel Victor <joelsvic...@gmail.com> wrote:
>>>> 
>>>> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are 
>>>> trying perform streaming ingestion with it.
>>>> We are using the Storm Hive bolt and we have 7 tables in which we are 
>>>> trying to insert. The RPS (requests per second) of our bolts ranges from 
>>>> 7000 to 5000 and our commit policies are configured accordingly i.e 100k 
>>>> events or 15 seconds.
>>>> 
>>>> We see that there are many commitTxn exceptions due to serialization 
>>>> errors in the metastore (we are using PostgreSQL 9.5 as metastore)
>>>> The serialization errors will cause the topology to start lagging in terms 
>>>> of events processed as it will try to reprocess the batches that have 
>>>> failed.
>>>> 
>>>> I have already backported this HIVE-10500 to 0.14 and there isn't much 
>>>> improvement.
>>>> I went through most of the JIRA's about transaction and I found the 
>>>> following HIVE-11948, HIVE-13013. I would like to backport them to 0.14.
>>>> Going through the patches gives me an impression that I need to mostly 
>>>> update the queries and transaction levels.
>>>> Do these patches also require me to update the schema in the metastore? 
>>>> Please also let me know if there are any other patches that I missed.
>>>> 
>>>> I would also like to know whether Apache Hive can handle inserts to the 
>>>> same/different tables concurrently from multiple clients in 1.2.1 or later 
>>>> versions without many serialization errors in Hive metastore?
>>>> 
>>>> -Joel


Re: Loading Sybase to hive using sqoop

2016-08-24 Thread Jörn Franke

Is your Sybase server ready to deliver a large amount of data? (Network, 
memory, cpu, parallel access, resources etc) This is usually the problem when 
loading data from a relational database  and less sqoop / mr or spark. 
Then, you should have a recent Hive version and store in Orc or parquet 
compressed (snappy). Not in a text based format.
Another alternative would be to use one of the export tools supplied with 
Sybase and export as a compressed file, put the file on HDFS and load it into 
Hive. This makes only sense if the export tool by Sybase is outperforming a 
JDBC connection (can happen depending on the relational database).





> On 23 Aug 2016, at 21:48, Rahul Channe  wrote:
> 
> Hi All,
> 
> We are trying to load data from Sybase Iq table to hive using sqoop. The hive 
> table is partitioned and expecting to hold 29M records per day.
> 
> The sqoop job takes 7 hours to load 15 days of data, even while setting the 
> direct load option to 6. Hive is using MR framework.
> 
> Is there is way to speed up the process.
> 
> Note - the aim is to load 1 year of data


Re: HIVE on Windows

2016-08-24 Thread Jörn Franke
Only with a Hadoop distribution that supports Windows. Generally I recommend to 
avoid installing single pieces of the ecosystem, but use a proper distribution. 
Their underlying default configuration is usually better and they usually 
tested the integration properly. 

> On 24 Aug 2016, at 15:23, Ajay Chander <hadoopde...@gmail.com> wrote:
> 
> Hi,
> 
> Were you able to get Hive up and running on Windows machine ? I have 
> installed Hadoop on Windows now I want to install Hive too. I couldn't find 
> binaries to run on Windows machine. Can anyone tell me is it possible to run 
> Hive on Windows machine ? Thanks 
> 
>> On Wednesday, May 18, 2016, Me To <ektapaliwal2...@gmail.com> wrote:
>> Thanks so  much for replying:)
>> 
>> so without distribution, I will not able to do that?
>> 
>>> On Wed, May 18, 2016 at 12:27 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> Use a distribution, such as Hortonworks 
>>> 
>>> 
>>>> On 18 May 2016, at 19:09, Me To <ektapaliwal2...@gmail.com> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I want to install hive on my windows machine but I am unable to find any 
>>>> resource out there. I am trying to set up it from one month but unable to 
>>>> accomplish that. I have successfully set up Hadoop on my windows machine. 
>>>> According to this guide 
>>>> 
>>>> https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
>>>> 
>>>> There are different steps involved to install and run it on Windows but 
>>>> where are those steps documented? Please help me with this problem. I have 
>>>> posted this question in almost all forums like Stackoverflow but nobody 
>>>> knows the answer. 
>>>> 
>>>> I am using Windows 8 and Hadoop2.7 running on my desktop. I want to run 
>>>> hive and beeline. Please help me.
>>>> 
>>>> Looking forward for response.
>>>> 
>>>> Thank you.
>>>> Ekta Paliwal
>> 


Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

2016-08-24 Thread Jörn Franke
I think Hive especially these old versions have not been designed for this. Why 
not store them in Hbase and run a oozie job regularly that puts them all into 
Hive /Orc or parquet in a bulk job?

> On 24 Aug 2016, at 09:35, Joel Victor  wrote:
> 
> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are trying 
> perform streaming ingestion with it.
> We are using the Storm Hive bolt and we have 7 tables in which we are trying 
> to insert. The RPS (requests per second) of our bolts ranges from 7000 to 
> 5000 and our commit policies are configured accordingly i.e 100k events or 15 
> seconds.
> 
> We see that there are many commitTxn exceptions due to serialization errors 
> in the metastore (we are using PostgreSQL 9.5 as metastore)
> The serialization errors will cause the topology to start lagging in terms of 
> events processed as it will try to reprocess the batches that have failed.
> 
> I have already backported this HIVE-10500 to 0.14 and there isn't much 
> improvement.
> I went through most of the JIRA's about transaction and I found the following 
> HIVE-11948, HIVE-13013. I would like to backport them to 0.14.
> Going through the patches gives me an impression that I need to mostly update 
> the queries and transaction levels.
> Do these patches also require me to update the schema in the metastore? 
> Please also let me know if there are any other patches that I missed.
> 
> I would also like to know whether Apache Hive can handle inserts to the 
> same/different tables concurrently from multiple clients in 1.2.1 or later 
> versions without many serialization errors in Hive metastore?
> 
> -Joel


Re: hive concurrency not working

2016-08-06 Thread Jörn Franke
I am pretty sure that they will support it because the Spark option is 
supported.
Not sure where you get your information from though...

> On 05 Aug 2016, at 11:01, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> great in that case they can try it and I am pretty sure if they are stuck 
> they can come and ask you for expert advice since Hortonworks do not support 
> Hive on Spark and I know that
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 5 August 2016 at 09:01, Jörn Franke <jornfra...@gmail.com> wrote:
>> That is not correct the option is there to install it.
>> 
>>> On 05 Aug 2016, at 08:41, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>>> You won't have this problem if you use Spark as the execution engine! This 
>>> set up handles concurrency but Hive with Spark is not part of the HW distro.
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 5 August 2016 at 07:39, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>> wrote:
>>>> you won't have this problem if you use Spark as the execution engine? That 
>>>> handles concurrency OK
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 5 August 2016 at 06:23, Raj hadoop <raj.had...@gmail.com> wrote:
>>>>> Thanks everyone..
>>>>> 
>>>>> we are raising case with Hortonworks
>>>>> 
>>>>>> On Wed, Aug 3, 2016 at 6:44 PM, Raj hadoop <raj.had...@gmail.com> wrote:
>>>>>> Dear All,
>>>>>> 
>>>>>> In need or your help,
>>>>>> 
>>>>>> we have horton works 4 node cluster,and the problem is hive is allowing 
>>>>>> only one user at a time,
>>>>>> 
>>>>>> if any second resource need to login hive is not working,
>>>>>> 
>>>>>> could someone please help me in this
>>>>>> 
>>>>>> Thanks,
>>>>>> Rajesh
> 


Re: hive concurrency not working

2016-08-05 Thread Jörn Franke
Depends on how you configured scheduling in yarn ...

> On 05 Aug 2016, at 08:39, Mich Talebzadeh  wrote:
> 
> you won't have this problem if you use Spark as the execution engine? That 
> handles concurrency OK
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 5 August 2016 at 06:23, Raj hadoop  wrote:
>> Thanks everyone..
>> 
>> we are raising case with Hortonworks
>> 
>>> On Wed, Aug 3, 2016 at 6:44 PM, Raj hadoop  wrote:
>>> Dear All,
>>> 
>>> In need or your help,
>>> 
>>> we have horton works 4 node cluster,and the problem is hive is allowing 
>>> only one user at a time,
>>> 
>>> if any second resource need to login hive is not working,
>>> 
>>> could someone please help me in this
>>> 
>>> Thanks,
>>> Rajesh
> 


Re: hive concurrency not working

2016-08-05 Thread Jörn Franke
That is not correct the option is there to install it.

> On 05 Aug 2016, at 08:41, Mich Talebzadeh  wrote:
> 
> You won't have this problem if you use Spark as the execution engine! This 
> set up handles concurrency but Hive with Spark is not part of the HW distro.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 5 August 2016 at 07:39, Mich Talebzadeh  wrote:
>> you won't have this problem if you use Spark as the execution engine? That 
>> handles concurrency OK
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 5 August 2016 at 06:23, Raj hadoop  wrote:
>>> Thanks everyone..
>>> 
>>> we are raising case with Hortonworks
>>> 
 On Wed, Aug 3, 2016 at 6:44 PM, Raj hadoop  wrote:
 Dear All,
 
 In need or your help,
 
 we have horton works 4 node cluster,and the problem is hive is allowing 
 only one user at a time,
 
 if any second resource need to login hive is not working,
 
 could someone please help me in this
 
 Thanks,
 Rajesh
> 


Re: Vectorised Query Execution extension

2016-08-04 Thread Jörn Franke
Even if it is possible it does only make sense to a certain limit given by your 
CPU and CPU caches.

> On 04 Aug 2016, at 22:57, Mich Talebzadeh  wrote:
> 
> As I understand from the manual:
> 
> Vectorized query execution is a Hive feature that greatly reduces the CPU 
> usage for typical query operations like scans, filters, aggregates, and 
> joins. A standard query execution system processes one row at a time. This 
> involves long code .. Vectorized query execution streamlines operations 
> by processing a block of 1024 rows at a time. Within the block, each column 
> is stored as a vector (an array of a primitive data type).
> 
> As fart as I can see Vectorized query execution (VQE) can be applied to most 
> columns and sql operations. Is it therefore possible to extend it beyond 1024 
> rows to include the whole column in table?
> 
> VQE would be very useful especially with ORC as it basically means that one 
> can process the whole column separately thus improving performance of the 
> query.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: hive concurrency not working

2016-08-03 Thread Jörn Franke
You need to configure the yarn scheduler (fair or capacity depending on your 
needs)

> On 03 Aug 2016, at 15:14, Raj hadoop  wrote:
> 
> Dear All,
> 
> In need or your help,
> 
> we have horton works 4 node cluster,and the problem is hive is allowing only 
> one user at a time,
> 
> if any second resource need to login hive is not working,
> 
> could someone please help me in this
> 
> Thanks,
> Rajesh


Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
Partition pruning works also with older Hive version, but you have to put the 
filter in the join statement and not only in the where statement 

> On 02 Aug 2016, at 09:53, Furcy Pin <furcy@flaminem.com> wrote:
> 
> I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to work.
> 
> Since MR is deprecated in 2.0, I assume we should not expect any future perf 
> optimisation on this side.
> 
> It has been implemented for Hive on Spark, though.
> https://issues.apache.org/jira/browse/HIVE-9152
> 
> 
> 
> 
>> On Tue, Aug 2, 2016 at 3:45 AM, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> 
>> wrote:
>> Is this partition pruning fixed in MR too except for TEZ in newer hive 
>> version?
>> 
>> Regards,
>> Q
>> 
>>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> It happens in old hive version of the filter is only in the where clause 
>>> and NOT in the join clause. This should not happen in newer hive version. 
>>> You can check it by executing explain dependency query. 
>>> 
>>>> On 01 Aug 2016, at 11:07, Abhishek Dubey <abhishek.du...@xoriant.com> 
>>>> wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>>  
>>>> 
>>>> I have a very big table t with billions of rows and it is partitioned on a 
>>>> column p. Column p  has datatype text and values like ‘201601’, 
>>>> ‘201602’…upto ‘201612’.
>>>> 
>>>> And, I am running a query like : Select columns from t where p=’201604’.
>>>> 
>>>>  
>>>> 
>>>> My question is : Can there be a scenario/condition/probability that my 
>>>> query will do a complete table scan on t instead of only reading data for 
>>>> specified partition key. If yes, please put some light on those scenario.
>>>> 
>>>>  
>>>> 
>>>> I’m asking this because someone told me that there is a probability that 
>>>> the query will ignore the partitioning and do a complete table scan to 
>>>> fetch output.
>>>> 
>>>>  
>>>> 
>>>> 
>>>> Thanks & Regards,
>>>> Abhishek Dubey
> 


Re: Doubt on Hive Partitioning.

2016-08-02 Thread Jörn Franke
I do not think so, but never tested it.

> On 02 Aug 2016, at 03:45, Qiuzhuang Lian <qiuzhuang.l...@gmail.com> wrote:
> 
> Is this partition pruning fixed in MR too except for TEZ in newer hive 
> version?
> 
> Regards,
> Q
> 
>> On Mon, Aug 1, 2016 at 8:48 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> It happens in old hive version of the filter is only in the where clause and 
>> NOT in the join clause. This should not happen in newer hive version. You 
>> can check it by executing explain dependency query. 
>> 
>>> On 01 Aug 2016, at 11:07, Abhishek Dubey <abhishek.du...@xoriant.com> wrote:
>>> 
>>> Hi All,
>>> 
>>>  
>>> 
>>> I have a very big table t with billions of rows and it is partitioned on a 
>>> column p. Column p  has datatype text and values like ‘201601’, 
>>> ‘201602’…upto ‘201612’.
>>> 
>>> And, I am running a query like : Select columns from t where p=’201604’.
>>> 
>>>  
>>> 
>>> My question is : Can there be a scenario/condition/probability that my 
>>> query will do a complete table scan on t instead of only reading data for 
>>> specified partition key. If yes, please put some light on those scenario.
>>> 
>>>  
>>> 
>>> I’m asking this because someone told me that there is a probability that 
>>> the query will ignore the partitioning and do a complete table scan to 
>>> fetch output.
>>> 
>>>  
>>> 
>>> 
>>> Thanks & Regards,
>>> Abhishek Dubey
> 


Re: Doubt on Hive Partitioning.

2016-08-01 Thread Jörn Franke
It happens in old hive version of the filter is only in the where clause and 
NOT in the join clause. This should not happen in newer hive version. You can 
check it by executing explain dependency query. 

> On 01 Aug 2016, at 11:07, Abhishek Dubey  wrote:
> 
> Hi All,
>  
> I have a very big table t with billions of rows and it is partitioned on a 
> column p. Column p  has datatype text and values like ‘201601’, ‘201602’…upto 
> ‘201612’.
> And, I am running a query like : Select columns from t where p=’201604’.
>  
> My question is : Can there be a scenario/condition/probability that my query 
> will do a complete table scan on t instead of only reading data for specified 
> partition key. If yes, please put some light on those scenario.
>  
> I’m asking this because someone told me that there is a probability that the 
> query will ignore the partitioning and do a complete table scan to fetch 
> output.
>  
> Thanks & Regards,
> Abhishek Dubey
>  


Re: Hive External Storage Handlers

2016-07-19 Thread Jörn Franke
The main reason is that if you compile it yourself then nobody can understand 
what you did. Hence any distribution can be downloaded and people can follow 
what you did. As far as I recall you had described several problems that the 
distributions did not have (eg you could not compile tez, spark only in an 
outdated version etc). Furthermore the distributions have a clear baseline for 
configuration of several complex pieces of software.

Hence even for production use a self-compiled version of something complex such 
as the Hadoop, hive, spark toolkit is clearly a no go.

> On 19 Jul 2016, at 08:25, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> 
> 
> "So not use a self-compiled hive or Spark version, but only the ones supplied 
> by distributions (cloudera, Hortonworks, Bigtop...) You will face performance 
> problems, strange errors etc when building and testing your code using 
> self-compiled versions."
> 
> This comment does not make sense and is meaningless without any evidence. 
> Either you provide evidence that you have done this work and you encountered 
> errors or better not mention it. Sounds like scaremongering.
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 19 July 2016 at 06:51, Jörn Franke <jornfra...@gmail.com> wrote:
>> So not use a self-compiled hive or Spark version, but only the ones supplied 
>> by distributions (cloudera, Hortonworks, Bigtop...) You will face 
>> performance problems, strange errors etc when building and testing your code 
>> using self-compiled versions.
>> 
>> If you use the Hive APIs then the engine should not be relevant for your 
>> storage handler. Nevertheless, the APIs of the storage handler might have 
>> changed. 
>> 
>> However, I wonder why a 1-1 mapping does not work for you.
>> 
>>> On 18 Jul 2016, at 22:46, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> You can move up to Hive 2 that works fine and pretty stable. You can opt 
>>> for Hive 1.2.1 if yoy wish.
>>> 
>>> If you want to use Spark (the replacement for Shark) as the execution 
>>> engine for Hive then the version that works (that I have managed to make it 
>>> work with Hive is Spark 1.3.1) that you will need to build from source.
>>> 
>>> It works and it is table.
>>> 
>>> Otherwise you may decide to use Spark Thrift Server (STS) that allows JDBC 
>>> access to Spark SQL (through beeline, Squirrel , Zeppelin) that has Hive 
>>> SQL context built into it as if you were using Hive Thrift Server (HSS)
>>> 
>>> HTH
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 18 July 2016 at 21:38, Lavelle, Shawn <shawn.lave...@osii.com> wrote:
>>>> Hello,
>>>> 
>>>>  
>>>> 
>>>> I am working with an external storage handler written for Hive 0.11 
>>>> and run on a Shark execution engine.  I’d like to move forward and upgrade 
>>>> to hive 1.2.1 on spark 1.6 or even 2.0.  
>>>> 
>>>>This storage has a need to run queries across tables existing in 
>>>> different databases in the external data store, so existing drivers that 
>>>> map hive to external storage in 1 to 1 mappings are insufficient. I have 
>>>> attempted this upgrade already, but found out that predicate pushdown was 
>>>> not occurring.  Was this changed in 1.2?
>>>> 
>>>>Can I update and use the same storage handler in Hive or has this 
>>>> concept been replaced by the RDDs and DataFrame API?  
>>>>
>>>> 
>>>>Are these questions better for the Spark list?
>>>> 
>>>>  
>>>> 
>>>>Thank you,
>>>> 
>>>>  
>>>> 
>>>> ~ Shawn M Lavelle
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Shawn Lavelle
>>>> Software Development
>>>> 
>>>> 4101 Arrowhead Drive
>>>> Medina, Minnesota 55340-9457
>>>> Phone: 763 551 0559
>>>> Fax: 763 551 0750
>>>> Email: shawn.lave...@osii.com
>>>> Website: www.osii.com
> 


Re: hive external table on gzip

2016-07-19 Thread Jörn Franke
Gzip is transparently handled by Hive (* by the formats available in Hive. If 
it is a custom format it depends on it).. What format is the table (csv? Json?) 
depending on that you simply choose the corresponding serde and it 
transparently does the decompression. Keep in mind that gzip is not splittable 
that means it cannot be decompressed in parallel. Try to go for bzip2 to enable 
parallel decompression it or split the large file in several smaller files (at 
minimum the size of a HDFS block).

> On 19 Jul 2016, at 13:03, Amatucci, Mario, Vodafone Group 
>  wrote:
> 
>  
> Hi I have huge gzip on hdfs and |I’d like to create an external table on top 
> of them
> Any code example? Cheers
> Ps
> I cannot use snappy or lzo for some constraints
>  
> --
> Kind regards
> Mario Amatucci
> CG TB PS GDC PRAGUE THINK BIG
>  


Re: Hive External Storage Handlers

2016-07-18 Thread Jörn Franke
So not use a self-compiled hive or Spark version, but only the ones supplied by 
distributions (cloudera, Hortonworks, Bigtop...) You will face performance 
problems, strange errors etc when building and testing your code using 
self-compiled versions.

If you use the Hive APIs then the engine should not be relevant for your 
storage handler. Nevertheless, the APIs of the storage handler might have 
changed. 

However, I wonder why a 1-1 mapping does not work for you.

> On 18 Jul 2016, at 22:46, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> You can move up to Hive 2 that works fine and pretty stable. You can opt for 
> Hive 1.2.1 if yoy wish.
> 
> If you want to use Spark (the replacement for Shark) as the execution engine 
> for Hive then the version that works (that I have managed to make it work 
> with Hive is Spark 1.3.1) that you will need to build from source.
> 
> It works and it is table.
> 
> Otherwise you may decide to use Spark Thrift Server (STS) that allows JDBC 
> access to Spark SQL (through beeline, Squirrel , Zeppelin) that has Hive SQL 
> context built into it as if you were using Hive Thrift Server (HSS)
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 18 July 2016 at 21:38, Lavelle, Shawn  wrote:
>> Hello,
>> 
>>  
>> 
>> I am working with an external storage handler written for Hive 0.11 and 
>> run on a Shark execution engine.  I’d like to move forward and upgrade to 
>> hive 1.2.1 on spark 1.6 or even 2.0.  
>> 
>>This storage has a need to run queries across tables existing in 
>> different databases in the external data store, so existing drivers that map 
>> hive to external storage in 1 to 1 mappings are insufficient. I have 
>> attempted this upgrade already, but found out that predicate pushdown was 
>> not occurring.  Was this changed in 1.2?
>> 
>>Can I update and use the same storage handler in Hive or has this concept 
>> been replaced by the RDDs and DataFrame API?  
>>
>> 
>>Are these questions better for the Spark list?
>> 
>>  
>> 
>>Thank you,
>> 
>>  
>> 
>> ~ Shawn M Lavelle
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> Shawn Lavelle
>> Software Development
>> 
>> 4101 Arrowhead Drive
>> Medina, Minnesota 55340-9457
>> Phone: 763 551 0559
>> Fax: 763 551 0750
>> Email: shawn.lave...@osii.com
>> Website: www.osii.com
> 


Re: Hive on TEZ + LLAP

2016-07-15 Thread Jörn Franke
I would recommend a distribution such as Hortonworks were everything is already 
configured. As far as I know llap is currently not part of any distribution.

> On 15 Jul 2016, at 17:04, Ashok Kumar  wrote:
> 
> Hi,
> 
> Has anyone managed to make Hive work with Tez + LLAP as the query engine in 
> place of Map-reduce please?
> 
> If you configured it yourself which version of Tez and LLAP work with Hive 2. 
> Do I need to build Tez from source for example
> 
> Thanks


Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-14 Thread Jörn Franke
I think both are the same.
can you elaborate a little bit more on your use case, eg a query you currently 
do and what the exact issue is

> On 14 Jul 2016, at 09:36, Markovitz, Dudu <dmarkov...@paypal.com> wrote:
> 
> Are you referring to ‘java_method‘ (or ‘reflect’)?
>  
> e.g.
>  
> hive> select java_method  ('java.lang.Math','min',45,9)  ;
> 9
>  
> I’m not sure how it serves out purpose.
>  
> Dudu
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: Thursday, July 14, 2016 8:55 AM
> To: user@hive.apache.org
> Subject: Re: Any way in hive to have functionality like SQL Server collation 
> on Case sensitivity
>  
> 
> You can use use any Java function in Hive without (!) the need to wrap it in 
> an UDF via the reflect command. 
> however not sure if this meets your use case.
>  
> 
> 
> Sent from my iPhone
> On 13 Jul 2016, at 19:50, Markovitz, Dudu <dmarkov...@paypal.com> wrote:
> 
> Hi
>  
> I’m personally not aware of other methods to achieve case insensitivity 
> comparison but to use lower() / upper()
>  
> Dudu
>  
> From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com] 
> Sent: Wednesday, July 13, 2016 12:56 AM
> To: user@hive.apache.org
> Subject: Re: Any way in hive to have functionality like SQL Server collation 
> on Case sensitivity
>  
> Thanks Dudu,
> 
> I would like to know dealing with case in-sensitivity in other project. is 
> every one converting to toLower() or toUpper() in the Joins ? . Is there any 
> setting applied at Hive Server level which gets reflected in all the queries ?
> 
>  
> 
> /MS
> 
>  
> On 5/25/2016 9:05 AM, Markovitz, Dudu wrote:
> It will not be suitable for JOIN operation since it will cause a Cartesian 
> product.
> Any chosen solution should determine a single representation for any given 
> string.
>  
> Dudu
>  
> From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] 
> Sent: Wednesday, May 25, 2016 1:31 AM
> To: user <user@hive.apache.org>
> Subject: Re: Any way in hive to have functionality like SQL Server collation 
> on Case sensitivity
>  
> I would rather go for something like compare() that allows one to directly 
> compare two character strings based on alternate collation rules.
>  
> Hive does not have it. This is from SAP ASE
>  
> 1> select compare ("aaa","bbb")
> 2> go
>  ---
>   -1
> (1 row affected)
> 1> select compare ("aaa","Aaa")
> 2> go
>  ---
>1
> (1 row affected)
> 
> 1> select compare ("aaa","AAA")
> 2> go
>  ---
>1
> ·  The compare function returns the following values, based on the collation 
> rules that you chose:
> 
> · 1 – indicates that char_expression1 or uchar_expression1 is greater 
> than char_expression2 or uchar_expression2.
> 
> · 0 – indicates that char_expression1 or uchar_expression1 is equal 
> to char_expression2 or uchar_expression2.
> 
> · -1 – indicates that char_expression1 or uchar_expression1 is less 
> than char_expression2 or uchar expression2.
> 
>  
> hive> select compare("aaa", "bbb");
> FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'
>  
> HTH
> 
>  
>  
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
>  
> On 24 May 2016 at 21:15, mahender bigdata <mahender.bigd...@outlook.com> 
> wrote:
> Hi,
> 
> We would like to have feature in Hive where string comparison should ignore 
> case sensitivity while joining on String Columns in hive. This feature helps 
> us in reducing code of calling Upper or Lower function on Join columns. If it 
> is already there, please let me know settings to enable this feature.
> 
> /MS
> 
>  
>  


Re: Any way in hive to have functionality like SQL Server collation on Case sensitivity

2016-07-13 Thread Jörn Franke

You can use use any Java function in Hive without (!) the need to wrap it in an 
UDF via the reflect command. 
however not sure if this meets your use case.



Sent from my iPhone
> On 13 Jul 2016, at 19:50, Markovitz, Dudu  wrote:
> 
> Hi
>  
> I’m personally not aware of other methods to achieve case insensitivity 
> comparison but to use lower() / upper()
>  
> Dudu
>  
> From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com] 
> Sent: Wednesday, July 13, 2016 12:56 AM
> To: user@hive.apache.org
> Subject: Re: Any way in hive to have functionality like SQL Server collation 
> on Case sensitivity
>  
> Thanks Dudu,
> 
> I would like to know dealing with case in-sensitivity in other project. is 
> every one converting to toLower() or toUpper() in the Joins ? . Is there any 
> setting applied at Hive Server level which gets reflected in all the queries ?
> 
>  
> 
> /MS
> 
>  
> On 5/25/2016 9:05 AM, Markovitz, Dudu wrote:
> It will not be suitable for JOIN operation since it will cause a Cartesian 
> product.
> Any chosen solution should determine a single representation for any given 
> string.
>  
> Dudu
>  
> From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] 
> Sent: Wednesday, May 25, 2016 1:31 AM
> To: user 
> Subject: Re: Any way in hive to have functionality like SQL Server collation 
> on Case sensitivity
>  
> I would rather go for something like compare() that allows one to directly 
> compare two character strings based on alternate collation rules.
>  
> Hive does not have it. This is from SAP ASE
>  
> 1> select compare ("aaa","bbb")
> 2> go
>  ---
>   -1
> (1 row affected)
> 1> select compare ("aaa","Aaa")
> 2> go
>  ---
>1
> (1 row affected)
> 
> 1> select compare ("aaa","AAA")
> 2> go
>  ---
>1
> ·  The compare function returns the following values, based on the collation 
> rules that you chose:
> 
> · 1 – indicates that char_expression1 or uchar_expression1 is greater 
> than char_expression2 or uchar_expression2.
> 
> · 0 – indicates that char_expression1 or uchar_expression1 is equal 
> to char_expression2 or uchar_expression2.
> 
> · -1 – indicates that char_expression1 or uchar_expression1 is less 
> than char_expression2 or uchar expression2.
> 
>  
> hive> select compare("aaa", "bbb");
> FAILED: SemanticException [Error 10011]: Line 1:7 Invalid function 'compare'
>  
> HTH
> 
>  
>  
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
>  
> On 24 May 2016 at 21:15, mahender bigdata  
> wrote:
> Hi,
> 
> We would like to have feature in Hive where string comparison should ignore 
> case sensitivity while joining on String Columns in hive. This feature helps 
> us in reducing code of calling Upper or Lower function on Join columns. If it 
> is already there, please let me know settings to enable this feature.
> 
> /MS
> 
>  
>  


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke

I think the comparison with Oracle rdbms and oracle times ten is not so good. 
There are times when the in-memory database of Oracle is slower than the rdbms 
(especially in case of Exadata) due to the issue that in-memory - as in Spark - 
means everything is in memory and everything is always processed (no storage 
indexes , no bloom filters etc) which explains this behavior quiet well.

Hence, I do not agree with the statement that tez is basically mr with dag (or 
that llap is basically in-memory which is also not correct). This is a wrong 
oversimplification and I do not think this is useful for the community, but 
better is to understand when something can be used and when not. In-memory is 
also not the solution to everything and if you look for example behind SAP Hana 
or NoSql there is much more around this, which is not even on the roadmap of 
Spark.

Anyway, discovering good use case patterns should be done on standardized 
benchmarks going beyond the select count etc 

> On 12 Jul 2016, at 11:16, Mich Talebzadeh  wrote:
> 
> That is only a plan not what execution engine is doing.
> 
> As I stated before Spark uses DAG + in-memory computing. MR is serial on 
> disk. 
> 
> The key is the execution here or rather the execution engine.
> 
> In general
> 
> The standard MapReduce  as I know reads the data from HDFS, apply map-reduce 
> algorithm and writes back to HDFS. If there are many iterations of map-reduce 
> then, there will be many intermediate writes to HDFS. This is all serial 
> writes to disk. Each map-reduce step is completely independent of other 
> steps, and the executing engine does not have any global knowledge of what 
> map-reduce steps are going to come after each map-reduce step. For many 
> iterative algorithms this is inefficient as the data between each map-reduce 
> pair gets written and read from the file system.
> 
> The equivalent to parallelism in Big Data is deploying what is known as 
> Directed Acyclic Graph (DAG) algorithm. In a nutshell deploying DAG results 
> in a fuller picture of global optimisation by deploying parallelism, 
> pipelining consecutive map steps into one and not writing intermediate data 
> to HDFS. So in short this prevents writing data back and forth after every 
> reduce step which for me is a significant improvement, compared to the 
> classical MapReduce algorithm.
> 
> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory 
> computing. Think of it as a comparison between a classic RDBMS like Oracle 
> and IMDB like Oracle TimesTen with in-memory processing.
> 
> The outcome is that Hive using Spark as execution engine is pretty 
> impressive. You have the advantage of Hive CBO + In-memory computing. If you 
> use Spark for all this (say Spark SQL) but no Hive, Spark uses its own 
> optimizer called Catalyst that does not have CBO yet plus in memory computing.
> 
> As usual your mileage varies.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>> I don’t see how this explains the time differences.
>> 
>>  
>> 
>> Dudu
>> 
>>  
>> 
>> From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] 
>> Sent: Tuesday, July 12, 2016 10:56 AM
>> To: user 
>> Cc: user @spark 
>> 
>> 
>> Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
>> engine
>>  
>> 
>> This the whole idea. Spark uses DAG + IM, MR is classic
>> 
>>  
>> 
>>  
>> 
>> This is for Hive on Spark
>> 
>>  
>> 
>> hive> explain select max(id) from dummy_parquet;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> 
>> STAGE PLANS:
>>   Stage: Stage-1
>> Spark
>>   Edges:
>> Reducer 2 <- Map 1 (GROUP, 1)
>>   DagName: hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: dummy_parquet
>>   Statistics: Num rows: 1 Data size: 7 Basic 
>> stats: COMPLETE Column stats: NONE
>>   Select Operator
>> expressions: id (type: int)
>> outputColumnNames: id
>> Statistics: Num rows: 1 Data size: 7 
>> Basic stats: COMPLETE Column stats: NONE
>> Group By Operator
>>   aggregations: 

Re: Trouble trying to get started with hive

2016-07-11 Thread Jörn Franke
Please use a Hadoop distribution to avoid these configuration issues (in the 
beginning).

> On 05 Jul 2016, at 12:06, Kari Pahula  wrote:
> 
> Hi. I'm trying to familiarize myself with Hadoop and various projects related 
> to it.
> 
> I've been following 
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted
> 
> I'd like to start by giving a bit of feedback from the bits I have got to 
> work.
> 
> Should the document tell about using schematool?  The first hive command I 
> tried was hive, since that was what the guide had as the first command. That 
> didn't work without running schematool first, but I got more errors when I 
> tried schematool since hive had already created the metastore_db directory 
> and schematool got confused by that. I had to look up what went wrong and 
> found my answers from stackexhange. In the end, I removed metastore_db and 
> ran "bin/schematool -initSchema -dbType derby".
> 
> After that, the next problem I ran into was, when I tried to connect to the 
> database with beeline:
> Error: Failed to open new session: java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
>  User: kaol is not allowed to impersonate anonymous (state=,code=0)
> 
> Again, I had to go and figure that I needed to have a conf/core-site.xml like 
> this:
> 
> 
> 
>   
> hadoop.proxyuser.hive.groups
> *
>   
>   
> hadoop.proxyuser.hive.hosts
> *
>   
> 
> 
> Should this have worked out of box?
> 
> This is as far as I've gotten and now I'm stumped with trying to create a 
> table. My command is "create table users (user_id int, item_id int, rating 
> int, stamp int);" but all I get as a response is
> Error: Error while processing statement: FAILED: Execution Error, return code 
> 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:file:/user/hive/warehouse/users is not a directory or 
> unable to create one) (state=08S01,code=1)
> 
> I've searched for this error and what I've found is that it's a permission 
> error. However, I've still not found what I should make /user/hive/warehouse 
> to make it work.  Last I tried was to have the directory world writable, to 
> no effect. What could I try next?


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know 
>> anything about Hive. and they promote Impala but that sounds like a sinking 
>> ship these days.
>> 
>> Having said that I will try TEZ + LLAP :) No pun intended
>> 
>> Regards
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 31 May 2016 at 08:19, Jörn Franke <jornfra...@gmail.com> wrote:
>> Thanks very interesting explanation. Looking forward to test it.
>> 
>> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote:
>> >
>> >
>> >> That being said all systems are evolving. Hive supports tez+llap which
>> >> is basically the in-memory support.
>> >
>> > There is a big difference between where LLAP & SparkSQL, which has to do
>> > with access pattern needs.
>> >
>> > The first one is related to the lifetime of the cache - the Spark RDD
>> > cache is per-user-session which allows for further operation in that
>> > session to be optimized.
>> >
>> > LLAP is designed to be hammered by multiple user sessions running
>> > different queries, designed to automate the cache eviction & selection
>> > process. There's no user visible explicit .cache() to remember - it's
>> > automatic and concurrent.
>> >
>> > My team works with both engines, trying to improve it for ORC, but the
>> > goals of both are different.
>> >
>> > I will probably have to write a proper academic paper & get it
>> > edited/reviewed instead of send my ramblings to the user lists like this.
>> > Still, this needs an example to talk about.
>> >
>> > To give a qualified example, let's leave the world of single use clusters
>> > and take the use-case detailed here
>> >
>> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>> >
>> >
>> > There are two distinct problems there - one is that a single day sees upto
>> > 100k independent user sessions running queries and that most queries cover
>> > the last hour (& possibly join/compare against a similar hour aggregate
>> > from the past).
>> >
>> > The problem with having independent 100k user-sessions from different
>> > connections was that the SparkSQL layer drops the RDD lineage & cache
>> > whenever a user ends a session.
>> >
>> > The scale problem in general for Impala was that even though the data size
>> > was in multiple terabytes, the actual hot data was approx <20Gb, which
>> > resides on <10 machines with locality.
>> >
>> > The same problem applies when you apply RDD caching with something like
>> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
>> > popular that the machines which hold those blocks run extra hot.
>> >
>> > A cache model per-user session is entirely wasteful and a common cache +
>> > MPP model effectively overloads 2-3% of cluster, while leaving the other
>> > machines idle.
>> >
>> > LLAP was designed specifically to prevent that hotspotting, while
>> > maintaining the common cache model - within a few minutes after an hour
>> > ticks over, the whole cluster develops temporal popularity for the hot
>> > data and nearly every rack has at least one cached copy of the same data
>> > for availability/performance.
>> >
>> > Since data stream tend to be extremely wide table (Omniture) comes to
>> > mine, so the cache actually does not hold all columns in a table and since
>> > Zipf distributions are extremely common in these real data sets, the cache
>> > does not hold all rows either.
>> >
>> > select count(clicks) from table where zipcode = 695506;
>> >
>> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
>> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
>> > indexes for all files will be loaded into memory, all misses on the bloom
>> > will not even feature in the cache.
>> >
>> > A subsequent query for
>> >
>> > select count(clicks) from table where zipcode = 695586;
>> >
>> > will run against the collected indexes, before deciding which files need
>> > to be loaded into cache.
>> >
>> >
>> > Then again,
>> >
>> > select count(clicks)/count(impressions) from table where zipcode = 695586;
>> >
>> > will load only impressions out of the table into cache, to add it to the
>> > columnar cache without producing another complete copy (RDDs are not
>> > mutable, but LLAP cache is additive).
>> >
>> > The column split cache & index-cache separation allows for this to be
>> > cheaper than a full rematerialization - both are evicted as they fill up,
>> > with different priorities.
>> >
>> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
>> > with a bit of input from UX patterns observed from Tableau/Microstrategy
>> > users to give it the impression of being much faster than the engine
>> > really can be.
>> >
>> > Illusion of performance is likely to be indistinguishable from actual -
>> > I'm actually looking for subjects for that experiment :)
>> >
>> > Cheers,
>> > Gopal
>> >
>> >
> 


Re: Optimize Hive Query

2016-06-23 Thread Jörn Franke
The query looks a little bit too complex from what it is supposed to do. Can 
you reformulate and restrict the data in a where clause (highest restriction 
first). Another hint would be to use the Orc format (with indexes and 
optionally bloom filters) with snappy compression as well as sorting the data 
on the column you choose to restrict in the where part.

> On 23 Jun 2016, at 02:42, @Sanjiv Singh  wrote:
> 
> Hi All,
> 
> I am running performance issue with below query. Its took 2-3 hours to 
> complete in hive.
> 
> Try tried to partition and bucketing changes on this tables, but without luck.
> 
> Please help me in optimizing this query. 
> 
> what schema level changes can be done ? 
> other parameters recommendations ?
> 
> 
> Below are complete details :
> 
> Hive Table DDL :
> 
>> CREATE TABLE `tuning_dd_key`(
>>   m_d_key smallint,
>>   sb_gu_key bigint, 
>>   t_ev_st_dt date, 
>>   a_z_key int, 
>>   c_dt date, 
>>   e_p_dt date, 
>>   sq_nbr int);
>   
> Total data size : 
>> 250 GB  
>   
> Long running query :
>   
>> SELECT 
>> sb_gu_key, m_d_key, t_ev_st_dt, 
>> LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  
>> t_ev_st_dt ) AS LAG_START_DT, 
>> a_z_key, 
>> c_dt, 
>> e_p_dt, 
>> sq_nbr, 
>> CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER 
>> BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )  OVER ( 
>> PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt )  THEN 'S'  ELSE 
>> NULL  END AS ST_FLAG 
>> FROM  `PRDDB`.tuning_dd_key ;
>  
> 
> More info :
> 
>> number of distinct value in column m_d_key : 29
>> number of distinct value in column sb_gu_key : 15434343
> 
> 
> 
> 
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339


Re: if else condition in hive

2016-06-21 Thread Jörn Franke
I recommend you to rethink it as part of a bulk transfer potentially even using 
separate partitions. Will be much faster.

> On 21 Jun 2016, at 13:22, raj hive  wrote:
> 
> Hi friends,
> 
> INSERT,UPDATE,DELETE commands are working fine in my Hive environment after 
> changing the configuration and all. Now, I have to execute a query like below 
> sql  in hive.
> 
> If exists(select * from tablename where columnname=something)
>   update table set column1=something where columnname=something
>  else
>   insert into tablename values ...
> 
> Can any one help me how to do it in Hive? 
> 
> Thanks
> Raj


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Jörn Franke
Well see also comment that it is NOT advisable to use jdbc for these data 
transfers but to consider the alternatives mention below. The alternatives are 
more reliable and you will save yourself a lot of troubles.
I also doubt that beeline is suitable for this volumes in general. So yes it 
could be that beeline is not efficient for these purposes.

> On 21 Jun 2016, at 08:52, David Nies <david.n...@adition.com> wrote:
> 
> In my test case below, I’m using `beeline` as the Java application receiving 
> the JDBC stream. As I understand, this is the reference command line 
> interface to Hive. Are you saying that the reference command line interface 
> is not efficiently implemented? :)
> 
> -David Nies
> 
>> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com>:
>> 
>> Aside from this the low network performance could also stem from the Java 
>> application receiving the JDBC stream (not threaded / not efficiently 
>> implemented etc). However that being said, do not use jdbc for this.
>> 
>>> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote:
>>> 
>>> Hallo,
>>> 
>>> For no databases (including traditional ones) it is advisable to fetch this 
>>> amount through jdbc. Jdbc is not designed for this (neither for import nor 
>>> for export of large data volumes). It is a highly questionable approach 
>>> from a reliability point of view.
>>> 
>>> Export it as file to HDFS and fetch it from there or use oozie to dump the 
>>> file from HDFS to a sftp or other server. There are alternatives depending 
>>> on your use case.
>>> 
>>> Best regards
>>> 
>>>> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote:
>>>> 
>>>> Dear Hive mailing list,
>>>> 
>>>> in my setup, network throughput from the HiveServer2 to the client seems 
>>>> to be the bottleneck and I’m seeking a way do increase throughput. Let me 
>>>> elaborate my use case:
>>>> 
>>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>>>> 
>>>> I want to fetch a huge amount of data from our Hive cluster. By huge I 
>>>> mean something around 100 million rows. The Hive table I’m querying is an 
>>>> external table whose data is stored in .avro. On HDFS, the data I want to 
>>>> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A 
>>>> cleverer filtering strategy (to reduce the amount of data) is no option, 
>>>> sadly, since I need all the data.
>>>> 
>>>> I was able to reduce the time the MapReduce job takes to an agreeable 
>>>> interval fiddling around with 
>>>> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking 
>>>> ages comes after MapReduce. I’m observing that the Hadoop namenode that is 
>>>> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our 
>>>> network is capable of much more. Playing around with `fetchSize` did not 
>>>> increase throughput. 
>>>> 
>>>> As I identified network throughput to be the bottleneck, I restricted my 
>>>> efforts to trying to increase it. For this, I simply run the query I’d 
>>>> normally run through JDBC (from Clojure/Java) via `beeline` and dumping 
>>>> the output to `/dev/null`. My `beeline` query looks something like that:
>>>> 
>>>> beeline \
>>>> -u jdbc:hive2://srv:1/db \
>>>> -n user -p password \
>>>> --outputformat=csv2 \
>>>> --incremental=true \
>>>> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>>> -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND 
>>>> month=6 AND day=1 AND hour=10)' > /dev/null
>>>> 
>>>> I already tried playing around with additional `—hiveconf`s:
>>>> 
>>>> --hiveconf hive.exec.compress.output=true \
>>>> --hiveconf mapred.output.compression.type=BLOCK \
>>>> --hiveconf 
>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>>>> 
>>>> without success. 
>>>> 
>>>> In all cases, Hive is able only to utilize a tiny fraction of the 
>>>> bandwidth that is available. Is there a possibility to increase network 
>>>> throughput?
>>>> 
>>>> Thank you in advance!
>>>> 
>>>> Yours
>>>> 
>>>> David Nies
>>>> Entwickler Business Intelligence
>>>>  
>>>> ADITION technologies AG
>>>>  
>>>> Oststraße 55, D-40211 Düsseldorf
>>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>>>  
>>>> T +49 211 987400 30
>>>> F +49 211 987400 33
>>>> E david.n...@adition.com
>>>>  
>>>> Technischen Support erhalten Sie unter der +49 1805 2348466
>>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>>>  
>>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>>>>  
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>>>> UStIDNr.: DE 218 858 434
> 


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Aside from this the low network performance could also stem from the Java 
application receiving the JDBC stream (not threaded / not efficiently 
implemented etc). However that being said, do not use jdbc for this.

> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Hallo,
> 
> For no databases (including traditional ones) it is advisable to fetch this 
> amount through jdbc. Jdbc is not designed for this (neither for import nor 
> for export of large data volumes). It is a highly questionable approach from 
> a reliability point of view.
> 
> Export it as file to HDFS and fetch it from there or use oozie to dump the 
> file from HDFS to a sftp or other server. There are alternatives depending on 
> your use case.
> 
> Best regards
> 
>> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote:
>> 
>> Dear Hive mailing list,
>> 
>> in my setup, network throughput from the HiveServer2 to the client seems to 
>> be the bottleneck and I’m seeking a way do increase throughput. Let me 
>> elaborate my use case:
>> 
>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>> 
>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean 
>> something around 100 million rows. The Hive table I’m querying is an 
>> external table whose data is stored in .avro. On HDFS, the data I want to 
>> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A 
>> cleverer filtering strategy (to reduce the amount of data) is no option, 
>> sadly, since I need all the data.
>> 
>> I was able to reduce the time the MapReduce job takes to an agreeable 
>> interval fiddling around with 
>> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking 
>> ages comes after MapReduce. I’m observing that the Hadoop namenode that is 
>> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our 
>> network is capable of much more. Playing around with `fetchSize` did not 
>> increase throughput. 
>> 
>> As I identified network throughput to be the bottleneck, I restricted my 
>> efforts to trying to increase it. For this, I simply run the query I’d 
>> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the 
>> output to `/dev/null`. My `beeline` query looks something like that:
>> 
>> beeline \
>> -u jdbc:hive2://srv:1/db \
>> -n user -p password \
>> --outputformat=csv2 \
>> --incremental=true \
>> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>> -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND 
>> month=6 AND day=1 AND hour=10)' > /dev/null
>> 
>> I already tried playing around with additional `—hiveconf`s:
>> 
>> --hiveconf hive.exec.compress.output=true \
>> --hiveconf mapred.output.compression.type=BLOCK \
>> --hiveconf 
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>> 
>> without success. 
>> 
>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth 
>> that is available. Is there a possibility to increase network throughput?
>> 
>> Thank you in advance!
>> 
>> Yours
>> 
>> David Nies
>> Entwickler Business Intelligence
>>  
>> ADITION technologies AG
>>  
>> Oststraße 55, D-40211 Düsseldorf
>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>  
>> T +49 211 987400 30
>> F +49 211 987400 33
>> E david.n...@adition.com
>>  
>> Technischen Support erhalten Sie unter der +49 1805 2348466
>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>  
>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>>  
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>> UStIDNr.: DE 218 858 434
>> 


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Jörn Franke
Hallo,

For no databases (including traditional ones) it is advisable to fetch this 
amount through jdbc. Jdbc is not designed for this (neither for import nor for 
export of large data volumes). It is a highly questionable approach from a 
reliability point of view.

Export it as file to HDFS and fetch it from there or use oozie to dump the file 
from HDFS to a sftp or other server. There are alternatives depending on your 
use case.

Best regards

> On 20 Jun 2016, at 16:43, David Nies  wrote:
> 
> Dear Hive mailing list,
> 
> in my setup, network throughput from the HiveServer2 to the client seems to 
> be the bottleneck and I’m seeking a way do increase throughput. Let me 
> elaborate my use case:
> 
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
> 
> I want to fetch a huge amount of data from our Hive cluster. By huge I mean 
> something around 100 million rows. The Hive table I’m querying is an external 
> table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. 
> the aforementioned 100 million rows) is about 5GB in size. A cleverer 
> filtering strategy (to reduce the amount of data) is no option, sadly, since 
> I need all the data.
> 
> I was able to reduce the time the MapReduce job takes to an agreeable 
> interval fiddling around with 
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages 
> comes after MapReduce. I’m observing that the Hadoop namenode that is hosting 
> the HiveServer2 is merely sending data with around 3 MB/sec. Our network is 
> capable of much more. Playing around with `fetchSize` did not increase 
> throughput. 
> 
> As I identified network throughput to be the bottleneck, I restricted my 
> efforts to trying to increase it. For this, I simply run the query I’d 
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the 
> output to `/dev/null`. My `beeline` query looks something like that:
> 
> beeline \
> -u jdbc:hive2://srv:1/db \
> -n user -p password \
> --outputformat=csv2 \
> --incremental=true \
> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
> -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND 
> month=6 AND day=1 AND hour=10)' > /dev/null
> 
> I already tried playing around with additional `—hiveconf`s:
> 
> --hiveconf hive.exec.compress.output=true \
> --hiveconf mapred.output.compression.type=BLOCK \
> --hiveconf 
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
> 
> without success. 
> 
> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth 
> that is available. Is there a possibility to increase network throughput?
> 
> Thank you in advance!
> 
> Yours
> 
> David Nies
> Entwickler Business Intelligence
>  
> ADITION technologies AG
>  
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>  
> T +49 211 987400 30
> F +49 211 987400 33
> E david.n...@adition.com
>  
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>  
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>  
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
> UStIDNr.: DE 218 858 434
> 


Re: Hive indexes without improvement of performance

2016-06-16 Thread Jörn Franke
The indexes are based on HDFS blocksize, which is usually around 128 mb. This 
means for hitting a single row you must always load the full block. In 
traditional databases this blocksize it is much faster. If the optimizer does 
not pick up the index then you can query the index directly (it is just a 
table!). Keep in mind that you should use for the index also an adequate 
storage format, such as Orc or parquet.

You should not use the traditional indexes, but use Hive+Tez and the Orc format 
with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of key 
importance that you insert the data sorted on the columns that you use in the 
where clause. You should compress the table with snappy. Additionally 
partitions make sense. Finally please use the right data types . Storage 
indexes work best with ints etc. for text fields you can try bloom filters.

That being said, also in other relational databases such as Oracle Exadata, the 
use of traditional indexes is discouraged for warehouse scenarios, but storage 
indexes and columnar formats including compression will bring the most 
performance.

> On 16 Jun 2016, at 22:50, Vadim Dedkov  wrote:
> 
> Hello!
> 
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>  
> My index creation:
> CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' WITH 
> DEFERRED REBUILD;
> ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;
> 
> Then I set configs:
> set hive.optimize.autoindex=true;
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1; 
> 
> And my query is:
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> 
> Sometimes I have improvement of performance, but most of cases - not.
> 
> In cases when I have improvement:
> 1. my query is
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> give me NullPointerException (in logs I see that Hive doesn't find my index 
> table)
> 2. then I write:
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result with improvement
> (172 sec)
> 
> In case when I don't have improvement, I can use either
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> without exception, either
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result
> (1153 sec)
> 
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two: 
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> My hadoop version is 2.6.0-cdh5.5.0
> 
> What I do wrong?
> 
> Thank you.
> 
> -- 
> ___ ___
> Best regards,С уважением
> Vadim Dedkov.  Вадим Дедков.


Re: insert query in hive

2016-06-08 Thread Jörn Franke
This is not the recommended way to load large data volumes into Hive. Check the 
external table feature, scoop, and the Orc/parquet formats

> On 08 Jun 2016, at 14:03, raj hive  wrote:
> 
> Hi Friends,
> 
> I have to insert the data into hive table from Java program.  Insert query 
> will work in Hive directly? like below sql command
> 
> insert into tablename values(value1,value2)
> 
> Thanks
> Raj


Re: Convert date in string format to timestamp in table definition

2016-06-05 Thread Jörn Franke
Never use string when you can use int - the performance will be much better - 
especially for tables in Orc / parquet format 

> On 04 Jun 2016, at 22:31, Igor Kravzov  wrote:
> 
> Thanks Dudu. 
> So if I need actual date I will use view.
> Regarding partition column:  I can create 2 external tables based on the same 
> data with integer or string column partition and see which one is more 
> convenient for our use.
> 
>> On Sat, Jun 4, 2016 at 2:20 PM, Markovitz, Dudu  
>> wrote:
>> I’m not aware of an option to do what you request in the external table 
>> definition but you might want to that using a view.
>> 
>>  
>> 
>> P.s.
>> 
>> I seems to me that defining the partition column as a string would be more 
>> user friendly than integer, e.g. –
>> 
>>  
>> 
>> select * from threads_test where mmdd like ‘2016%’ – year 2016;
>> 
>> select * from threads_test where mmdd like ‘201603%’ –- March 2016;
>> 
>> select * from threads_test where mmdd like ‘__01’ -- first of every 
>> month;
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> $ hdfs dfs -ls -R /tmp/threads_test
>> 
>> drwxr-xr-x   - cloudera supergroup  0 2016-06-04 10:45 
>> /tmp/threads_test/20160604
>> 
>> -rw-r--r--   1 cloudera supergroup136 2016-06-04 10:45 
>> /tmp/threads_test/20160604/data.txt
>> 
>>  
>> 
>> $ hdfs dfs -cat /tmp/threads_test/20160604/data.txt
>> 
>> {"url":"www.blablabla.com","pageType":"pg1","addDate":"2016-05-17T02:10:44.527","postDate":"2016-05-16T02:08:55","postText":"YadaYada"}
>> 
>>  
>> 
>> 
>> 
>>  
>> 
>>  
>> 
>> hive> add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
>> 
>>  
>> 
>> hive>
>> 
>> create external table threads_test
>> 
>> (
>> 
>> url string
>> 
>>,pagetypestring
>> 
>>,adddate string
>> 
>>,postdatestring
>> 
>>,posttextstring
>> 
>> )
>> 
>> partitioned by (mmdd string)
>> 
>> row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
>> 
>> location '/tmp/threads_test'
>> 
>> ;
>> 
>>  
>> 
>> hive> alter table threads_test add partition (mmdd=20160604) location 
>> '/tmp/threads_test/20160604';
>> 
>>  
>> 
>> hive> select * from threads_test;
>> 
>>  
>> 
>> www.blablabla.compg12016-05-17T02:10:44.5272016-05-16T02:08:55  
>> YadaYada  20160604
>> 
>>  
>> 
>> hive>
>> 
>> create view threads_test_v
>> 
>> as
>> 
>> select  url
>> 
>>,pagetype
>> 
>>,cast (concat_ws(' ',substr (adddate ,1,10),substr (adddate ,12)) 
>> as timestamp)  as adddate 
>> 
>>,cast (concat_ws(' ',substr (postdate,1,10),substr (postdate,12)) 
>> as timestamp)  as postdate 
>> 
>>,posttext
>> 
>>
>> 
>> fromthreads_test
>> 
>> ;
>> 
>>  
>> 
>> hive> select * from threads_test_v;
>> 
>>  
>> 
>> www.blablabla.compg12016-05-17 02:10:44.5272016-05-16 02:08:55  
>> YadaYada
>> 
>>  
>> 
>>  
>> 
>> From: Igor Kravzov [mailto:igork.ine...@gmail.com] 
>> Sent: Saturday, June 04, 2016 8:13 PM
>> To: user@hive.apache.org
>> Subject: Convert date in string format to timestamp in table definition
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> I have 2 dates in Json file defined like this
>> 
>> "addDate": "2016-05-17T02:10:44.527",
>> 
>>   "postDate": "2016-05-16T02:08:55",
>> 
>>  
>> 
>> Right now I define external table based on this file like this:
>> 
>> CREATE external TABLE threads_test 
>> 
>> (url string,
>> 
>>  pagetype string,
>> 
>>  adddate string,
>> 
>>  postdate string,
>> 
>>  posttext string)
>> 
>> partitioned by (mmdd int)
>> 
>> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>> 
>> location 'my location';
>> 
>>  
>> 
>> is it possible to define these 2 dates as timestamp? 
>> 
>> Do I need to change date format in the file? is it possible to specify date 
>> format in table definition?
>> 
>> Or I better off with string?
>> 
>>  
>> 
>> Thanks in advance.
>> 
> 


Re: Internode Encryption with HiveServer2

2016-06-03 Thread Jörn Franke
This can be configured on the Hadoop level.

> On 03 Jun 2016, at 10:59, Nick Corbett  wrote:
> 
> Hi
> 
> 
> I am deploying Hive in a regulated environment - all data needs to be 
> encrypted when transferred and at rest.
> 
> 
> If I run a 'select' statement, using HiveServer2, then a map reduce job(s) 
> get started and results files are generated on nodes of my cluster.  The 
> final stage of the process must be to transfer all these results files back 
> to HiveServer2 before they are sent to the client.
> 
> 
> Does anyone know how to encrypt this stage (ie results sent from cluster 
> nodes to HiveServer2)?  If I run tcpdump on my cluster I can see that this 
> traffic is not encrypted.
> 
> 
> Many thanks
> 
> 
> Nick 


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
> 
> 
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
> 
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
> 
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
> 
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
> 
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
> 
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
> 
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
> 
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> 
> 
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
> 
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
> 
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
> 
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
> 
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
> 
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
> 
> Since data stream tend to be extremely wide table (Omniture) comes to
> mine, so the cache actually does not hold all columns in a table and since
> Zipf distributions are extremely common in these real data sets, the cache
> does not hold all rows either.
> 
> select count(clicks) from table where zipcode = 695506;
> 
> with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> indexes for all files will be loaded into memory, all misses on the bloom
> will not even feature in the cache.
> 
> A subsequent query for
> 
> select count(clicks) from table where zipcode = 695586;
> 
> will run against the collected indexes, before deciding which files need
> to be loaded into cache.
> 
> 
> Then again, 
> 
> select count(clicks)/count(impressions) from table where zipcode = 695586;
> 
> will load only impressions out of the table into cache, to add it to the
> columnar cache without producing another complete copy (RDDs are not
> mutable, but LLAP cache is additive).
> 
> The column split cache & index-cache separation allows for this to be
> cheaper than a full rematerialization - both are evicted as they fill up,
> with different priorities.
> 
> Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> with a bit of input from UX patterns observed from Tableau/Microstrategy
> users to give it the impression of being much faster than the engine
> really can be.
> 
> Illusion of performance is likely to be indistinguishable from actual -
> I'm actually looking for subjects for that experiment :)
> 
> Cheers,
> Gopal
> 
> 


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
I do not think that in-memory itself will make things faster in all cases. 
Especially if you use Tez with Orc or parquet. 
Especially for ad hoc queries on large dataset (indecently if they fit 
in-memory or not) this will have a significant impact. This is an experience I 
have also with the in-memory databases with Oracle or SQL server. It might 
sound surprising, but has some explanations. Orc and parquet have the min/max 
indexes, store and process data (important choose the right datatype, if 
everything is varchar then it is your fault that the database is not 
performing) very efficiently, only load into memory what is needed. This is not 
the case for in-memory systems. Usually everything is loaded in memory and not 
only the parts which are needed. This means due to the absence of min max 
indexes you have to go through everything. Let us assume the table has a size 
of 10 TB. There are different ad hoc queries that only process 1 gb (each one 
addresses different areas). In hive+tez this is currently rather efficient: you 
load 1 gb (negligible in a cluster) and process 1 gb.  In spark you would cache 
10 tb (you do not know which can part will be addressed) which takes a lot of 
time to first load and each query needs to go in memory through 10 tb. This 
might be an extreme case, but it is not uncommon. An exception are of course 
machine learning algorithms (the original purpose of Spark), where I see more 
advantages for Spark. Most of the traditional companies have probably both use 
cases (maybe with a bias towards the first). Internet companies have more 
towards the last.

That being said all systems are evolving. Hive supports tez+llap which is 
basically the in-memory support. Spark stores the data more efficient in 1.5 
and 1.6 (in the dataset Api and dataframe - issue here that it is not the same 
format as the files from disk). Let's see if there will be a convergence - my 
bet is that both systems will be used optimized for their use cases.

The bottom line is you have to first optimize and think what you need to do 
before going in-memory. Never load everything in-memory. You will be surprised. 
Have multiple technologies in your ecosystem. Understand them. Unfortunately 
most of the consultant companies have only poor experience and understanding of 
the complete picture and thus they fail with both technologies, which is sad, 
because both can be extremely powerful and a competitive  advantage.

> On 30 May 2016, at 21:49, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com> wrote:
>> Mich, 
>> 
>> Most people use vendor releases because they need to have the support. 
>> Hortonworks is the vendor who has the most skin in the game when it comes to 
>> Tez. 
>> 
>> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
>> Then LLAP is the in-memory piece to speed up Tez? 
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>> wrote:
>>> 
>>> thanks I think the problem is that the TEZ user group is exceptionally 
>>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>>> built a vendor independent version.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> Well I think it is different from MR. It has some optimizations which you 
>>>> do not find in MR. Especially the LLAP option in Hive2 makes it 
>>>> interesting. 
>>>> 
>>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>>>> integrated in the Hortonworks distribution. 
>>>> 
>>>> 
>>>>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>&

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Well I think it is different from MR. It has some optimizations which you do 
not find in MR. Especially the LLAP option in Hive2 makes it interesting. 

I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
integrated in the Hortonworks distribution. 


> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
> TEZ user group kindly gave a hand but I could not go very far (or may be I 
> did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what version of TEZ works as execution engine 
> with Hive. 
> 
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive 
> etc as I am sure you already know.
> 
> Cheers,
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>> 
>>>> Whether Hive is the write database for purpose or one is better off with 
>>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>>> mileage varies. 
>>>> 
>>>> So fit for purpose.
>>>> 
>>>> Ideally what wants is to use the fastest  method to get the results. How 
>>>> fast we confine it to our SLA agreements in production and that helps us 
>>>> from unnecessary further work as we technologists like to play around.
>>>> 
>>>> So in short, we use Spark most of the time and use Hive as the backend 
>>>> engine for data storage, mainly ORC tables.
>>>> 
>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>>> at the moment it is one of my projects.
>>>> 
>>>> We do not use any vendor's products as it enables us to move away  from 
>>>> being tied down after years of SAP, Oracle and MS dependency to yet 
>>>> another vendor. Besides there is some politics going on with one promoting 
>>>> Tez and another Spark as a backend. That is fine but obviously we prefer 
>>>> an independent assessment ourselves.
>>>> 
>>>> My gut feeling is that one needs to look at the use case. Recently we had 
>>>> to import a very large table from Oracle to Hive and decided to use Spark 
>>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>>>> connection with temp table and it was good. We could have used sqoop but 
>>>> decided to settle for Spark so it all depends on use case.
>>>> 
>>>> H

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Jörn Franke
Both have outdated versions, usually one can support you better if you upgrade 
to the newest.
Firewall could be an issue here.


> On 26 May 2016, at 10:11, Nikolay Voronchikhin  
> wrote:
> 
> Hi PySpark users,
> 
> We need to be able to run large Hive queries in PySpark 1.2.1. Users are 
> running PySpark on an Edge Node, and submit jobs to a Cluster that allocates 
> YARN resources to the clients.
> We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark 
> 1.2.1.
> 
> 
> Currently, our process for writing queries works only for small result sets, 
> for example:
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> results = sqlContext.sql("select column from database.table limit 
> 10").collect()
> results
> 
> 
> 
> How do I save the HiveQL query to RDD first, then output the results?
> 
> This is the error I get when running a query that requires output of 400,000 
> rows:
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> results = sqlContext.sql("select column from database.table").collect()
> results
> ...
> /path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)
>1976 """
>1977 with SCCallSiteSync(self.context) as css:
> -> 1978 bytesInJava = 
> self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()
>1979 cls = _create_cls(self.schema())
>1980 return map(cls, 
> self._collect_iterator_through_file(bytesInJava))
> 
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> 
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling o76.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Failed to connect 
> to cluster_node/IP_address:port
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> For this example, ideally, this query should output the 400,000 row resultset.
> 
> 
> Thanks for your help,
> Nikolay Voronchikhin
> https://www.linkedin.com/in/nvoronchikhin
> E-mail: nvoronchik...@gmail.com
> 
> 


Re: Copying all Hive tables from Prod to UAT

2016-05-26 Thread Jörn Franke
Or use Falcon ...

The Spark JDBC I would try to avoid. Jdbc is not designed for these big data 
bulk operations, eg data has to be transferred uncompressed and there is the 
serialization/deserialization issue query result -> protocol -> Java objects -> 
writing to specific storage format etc
This costs more time than you may think.

> On 25 May 2016, at 18:05, Mich Talebzadeh  wrote:
> 
> They are multiple ways of doing this without relying any vendors release.
> 
> 1) Using  hive EXPORT/IMPORT utility
> 
> EXPORT TABLE table_or_partition TO hdfs_path;
> IMPORT [[EXTERNAL] TABLE table_or_partition] FROM hdfs_path [LOCATION 
> [table_location]];
> 2) This works for individual tables but you can easily write a generic script 
> to pick up name of tables for a given database from Hive metadata.
>  example
> 
> SELECT
>   t.owner AS Owner
> , d.NAME AS DBName
> , t.TBL_NAME AS Tablename
> , TBL_TYPE
> FROM tbls t, dbs d
> WHERE
>   t.DB_ID = d.DB_ID
> AND
>   TBL_TYPE IN ('MANAGED_TABLE','EXTERNAL_TABLE')
> ORDER BY 1,2
> 
> Then a Linux shell script will table 5 min max to create and you have full 
> control of the code. You can even do multiple EXPORT/IMPORT at the same time.
> 
> 3) Easier  to create a shared NFS mount between PROD and UAT so you can put 
> the tables data and metadata on this NFS
> 
> 2) Use Spark shell script to get data via JDBC from the source database and 
> push schema and data into the new env. Again this is no different from 
> getting the underlying data from Oracle or Sybase database and putting in Hive
> 
> 3) Using vendor's product to do the same. I am not sure vendors do 
> parallelise this sort of things.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 25 May 2016 at 14:50, Suresh Kumar Sethuramaswamy  
>> wrote:
>> Hi
>> 
>>   If you are using CDH, via CM , Backup->replications you could do inter 
>> cluster hive data transfer including metadata
>> 
>> Regards
>> Suresh
>> 
>> 
>>> On Wednesday, May 25, 2016, mahender bigdata  
>>> wrote:
>>> Any Document on it. 
>>> 
 On 4/8/2016 6:28 PM, Will Du wrote:
 did you try export and import statement in HQL?
 
> On Apr 8, 2016, at 6:24 PM, Ashok Kumar  wrote:
> 
> Hi,
> 
> Anyone has suggestions how to create and copy Hive and Spark tables from 
> Production to UAT.
> 
> One way would be to copy table data to external files and then move the 
> external files to a local target directory and populate the tables in 
> target Hive with data.
> 
> Is there an easier way of doing so?
> 
> thanks
> 


Re: Hive and XML

2016-05-22 Thread Jörn Franke
XML is generally slow in any software. It is not recommended for large data 
volumes.

> On 22 May 2016, at 10:15, Maciek  wrote:
> 
> Have you had to load XML data into Hive? Did you run into any problems or 
> experienced any pain points, e.g. complex schemas or performance?
> 
> I have done a lot of research on the role of XML in Big Data analytics 
> recently and would like to get your view on it. It would be very helpful if 
> you could fill out this brief survey https://mamont914.typeform.com/to/Ewo4aD
> 
> 
> 
> -- 
> Thank you,
> Kind Regards
> ~Maciek
> 


Re: HIVE on Windows

2016-05-18 Thread Jörn Franke
Use a distribution, such as Hortonworks 


> On 18 May 2016, at 19:09, Me To  wrote:
> 
> Hello,
> 
> I want to install hive on my windows machine but I am unable to find any 
> resource out there. I am trying to set up it from one month but unable to 
> accomplish that. I have successfully set up Hadoop on my windows machine. 
> According to this guide 
> 
> https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
> 
> There are different steps involved to install and run it on Windows but where 
> are those steps documented? Please help me with this problem. I have posted 
> this question in almost all forums like Stackoverflow but nobody knows the 
> answer. 
> 
> I am using Windows 8 and Hadoop2.7 running on my desktop. I want to run hive 
> and beeline. Please help me.
> 
> Looking forward for response.
> 
> Thank you.
> Ekta Paliwal


Re: Query Failing while querying on ORC Format

2016-05-17 Thread Jörn Franke
I do not remember exactly, but I think it worked simply by adding a new 
partition to the old table with the additional columns.

> On 17 May 2016, at 15:00, Mich Talebzadeh  wrote:
> 
> Hi Mahendar,
> 
> That version 1.2 is reasonable.
> 
> One alternative is to create a new table (new_table) in Hive with columns 
> from old_table plus the added column new_column as ORC etc
> 
> Do an INSERT/SELECT from old_table to new_table
> 
> INSERT INTO new_table
> SELECT *,  ALTER old_table RENAME to old_table_KEEP
> RENAME new_table TO ol_table
> 
> That should work. Check the syntax.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 16 May 2016 at 23:53, mahender bigdata  
>> wrote:
>> I'm on Hive 1.2
>> 
>>> On 5/16/2016 12:02 PM, Matthew McCline   wrote:
>>> ​
>>> What version of Hive are you on?
>>> 
>>> From: Mahender Sarangam 
>>> Sent: Saturday, May 14, 2016 3:29 PM
>>> To: user@hive.apache.org
>>> Subject: Query Failing while querying on ORC Format
>>>  
>>> Hi,
>>> We are dumping our data into ORC Partition Bucketed table. We have loaded 
>>> almost 6 months data and here month is Partition by column. Now we have 
>>> modified ORC partition bucketed table schema. We have added 2 more columns 
>>> to the ORC table. Now whenever we are running select statement for older 
>>> month which has no columns( even though these columns are not 
>>> part in select clause, (projection column) ), it is throwing exception.
>>>  
>>> There is JIRA bug for this kind of requirement has already been raised.
>>> https://issues.apache.org/jira/browse/HIVE-11981
>>>  
>>> Can any one please tell me know alternative workaround for reading old 
>>> previous columns of ORC partition table.
>>>  
>>> Thanks
> 


Re: Performance for hive external to hbase with serval terabyte or more data

2016-05-11 Thread Jörn Franke
Why don't you export the data from hbase to  hive, eg in Orc format. You should 
not use mr with Hive, but Tez. Also use a recent hive version (at least 1.2). 
You can then do queries there. For large log file processing in real time, one 
alternative depending on your needs could be Solr on Hadoop.

> On 12 May 2016, at 03:06, Yi Jiang  wrote:
> 
> Hi, Guys
> Recently we are debating the usage for hbase as our destination for data 
> pipeline job.
> Basically, we want to save our logs into hbase, and our pipeline can generate 
> 2-4 terabytes data everyday, but our IT department think it is not good idea 
> to scan so hbase, it will cause the performance and memory issue. And they 
> ask our just keep 15 minutes data amount in the hbase for real time analysis.
> For now, I am using hive to external to hbase, but what I am thinking that 
> for map reduce job, what kind of mapper it is using to scan the data from 
> hbase? Is it TableInputFormatBase? and how many mapper it will use in hive to 
> scan the hbase. Is it efficient or not? Will it cause the performance issue 
> if we have couple T’s or more larger data amount?
> I am also trying to index some columns that we might use to query. But  I am 
> not sure if it is good idea to keep so much history data in the hbase for 
> query.
> Thank you
> Jacky
>  


Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Jörn Franke
I do not think you make it faster by setting the execution engine to Spark. 
Especially with such an old Spark version. 
For such simple things such as "dump" bulk imports and exports, it does matter 
much less if it all what execution engine you use.
There was recently a discussion on that on the Spark? or Sqoop mailing list.
With 1 billion rows it will be more the Oracle database which is a bottleneck.

> On 30 Apr 2016, at 16:05, Mich Talebzadeh  wrote:
> 
> Hi Marcin,
>  
> It is the speed really. The speed in which data is digested into Hive.
>  
> Sqoop is two stage as I understand.
>  
> Take the data out of RDMSD via JADB and put in on an external HDFS file
> Read that file and insert into a Hive table
>  The issue is the second part. In general I use Hive 2 with Spark 1.3.1 
> engine to put data into Hive table. I wondered if there was such a parameter 
> in Sqoop to use Spark engine.
> 
> Well I gather this is easier said that done.  I am importing 1 billion rows 
> table from Oracle
> 
> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username 
> scratchpad -P \
> --query "select * from scratchpad.dummy where \
> \$CONDITIONS" \
> --split-by ID \
> --hive-import  --hive-table "oraclehadoop.dummy" --target-dir "dummy"
> 
> 
> Now the fact that in hive-site.xml I have set hive.execution.engine=spark 
> does not matter. Sqoop seems to internally set  hive.execution.engine=mr 
> anyway.
> 
> May be there should be an option   --hive-execution-engine='mr/tez/spak' etc 
> in above command?
> 
> Cheers,
> 
> Mich
> 
> 
>  
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 30 April 2016 at 14:51, Marcin Tustin  wrote:
>> They're not simply interchangeable. sqoop is written to use mapreduce.
>> 
>> I actually implemented my own replacement for sqoop-export in spark, which 
>> was extremely simple. It wasn't any faster, because the bottleneck was the 
>> receiving database.
>> 
>> Is your motivation here speed? Or correctness?
>> 
>>> On Sat, Apr 30, 2016 at 8:45 AM, Mich Talebzadeh 
>>>  wrote:
>>> Hi,
>>> 
>>> What is the simplest way of making sqoop import use spark engine as opposed 
>>> to the default mapreduce when putting data into hive table. I did not see 
>>> any parameter for this in sqoop command line doc.
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>> 
>> 
>> Want to work at Handy? Check out our culture deck and open roles
>> Latest news at Handy
>> Handy just raised $50m led by Fidelity
>> 
>> 
> 


Re: Container out of memory: ORC format with many dynamic partitions

2016-04-30 Thread Jörn Franke
I would still need some time to dig deeper in this. Are you using a specific 
distribution? Would it be possible to upgrade to a more recent Hive version?

However, having so many small partitions is a bad practice which seriously 
affects performance. Each partition should at least contain several Orc stripes 
of data. 300 rows is definitely too little. In doubt, having less partitions is 
better than having more. In this respect it is not different from relational 
databases. Especially subpartitions should be avoided if possible. Partitions 
are also no good solution if each partition is very different in size. Instead 
you should insert the data sorted on that columns to leverage ORC indexes and 
bloom filters at their best. Having only String types is also very bad for 
performance in any database. Additionally you can use buckets.
Alternatively, if your use case is search then you may look for Solr or 
ElasticSearch on Hadoop.


> On 30 Apr 2016, at 03:49, Matt Olson  wrote:
> 
> Hi all,
> 
> I am using Hive 1.0.1 and trying to do a simple insert into an ORC table, 
> creating dynamic partitions. I am selecting from a table partitioned by dt 
> and category, and inserting into a table partitioned by dt, title, and 
> title_type. Other than the partitioning, the tables have the same schemas. 
> Both title and title_type are fields in the first table, and when I insert 
> into the second table, I am using them to create dynamic partitions. The .q 
> file with the CREATE and INSERT statements is copied below.
> 
> SET hive.optimize.sort.dynamic.partition=true;
> SET hive.exec.orc.memory.pool=1.0;
> SET hive.exec.max.dynamic.partitions = 5000;
> SET hive.exec.max.dynamic.partitions.pernode = 5000;
> SET hive.merge.mapfiles = true;
> SET mapred.min.split.size=134217728;
> SET mapred.min.split.size.per.node=134217728;
> SET mapred.min.split.size.per.rack=134217728;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.max.split.size=134217728;
> SET hive.map.aggr.hash.percentmemory=0.125;
> SET hive.exec.parallel=true;
> SET hive.exec.compress.intermediate=true;
> SET hive.exec.compress.output=true;
> SET mapred.map.child.java.opts=-Xmx2048M; 
> SET mapred.child.java.opts=-Xmx2048M;
> SET mapred.task.profile=false;
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS dynamic_partition_table (
> field1 string,
> field2 string,
> ...
> field26 string
> )
> PARTITIONED BY (dt string, title string, title_type string)
> STORED AS ORC 
> LOCATION '/hive/warehouse/partitioned_table'
> TBLPROPERTIES ("orc.compress.size"="16000");
> 
> INSERT OVERWRITE TABLE dynamic_partition_table PARTITION (dt="2016-04-05", 
> title, title_type)
> SELECT 
> field1,
> field2,
> ...
> title,
> title_type
> FROM original_table
> WHERE dt = "2016-04-05";
> 
> The original table has about 250 GB of data for 2016-04-05, and about 260 
> different titles (some titles have very little data, some have ~20 GB). There 
> is generally only one title_type per title. The INSERT action succeeds on 
> that data set, but when I add 2000 new titles with 300 rows each to the 
> original table, I get the following error during the INSERT:
> 
> Container [pid=6278,containerID=container_e26_1460661845156_49295_01_000244] 
> is running beyond physical memory limits. Current usage: 2.2 GB of 2 GB 
> physical memory used; 2.7 GB of 4.2 GB virtual memory used. Killing container.
> 
> I've found a couple questions online about this same error message for ORC 
> files with lots of dynamic partitions, on an older version of Hive:
> https://qnalist.com/questions/4836037/hive-0-12-orc-heap-issues-on-write
> 
> Based on that and the information about configuration properties at 
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-ORCFileFormat,
>  I have tried setting hive.exec.orc.memory.pool=1.0 in order to give as much 
> heap space as possible to the ORC file writers. As you can see from the 
> CREATE TABLE statement, I also decreased the orc.compress.size from the 
> default 256 kb to 16 kb. After making these changes, the INSERT is still 
> failing with the "beyond physical memory limits" error.
> 
> I've tried inserting into a table stored as RCFile rather than ORC, and in 
> that case the action succeeds even with the additional 2000 titles.
> 
> Can anyone explain how exactly the two ORC parameters above affect the 
> writing of dynamic partitions in ORC files, and why I'm not getting the OOM 
> error when I use the RCFile format instead?  I'd also appreciate any 
> suggestions for other tuning I could do to fix the memory management when 
> using ORC.
> 
> Thanks for any help,
> Matt


Analyzing Bitcoin blockchain data with Hive

2016-04-29 Thread Jörn Franke
Dear all,

I prepared a small Serde to analyze Bitcoin blockchain data with Hive:
https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/

There are some example queries, but I will add some in the future.
Additionally, more unit tests will be added.

Let me know if this is useful for you and of course please report bugs ;)

Thank you.

Cheers


Re: Sqoop_Sql_blob_types

2016-04-27 Thread Jörn Franke
You could try as binary. Is it just for storing the blobs or for doing analyzes 
on them? In the first case you may think about storing them as files in HDFS 
and including in hive just a string containing the file name (to make analysis 
on the other data faster). In the later case you should think about an optimal 
analysis format in Hive.

> On 27 Apr 2016, at 22:13, Ajay Chander  wrote:
> 
> Hi Everyone,
> 
> I have a table which has few columns as blob types with huge data. Is there 
> any best way to 'sqoop import' it to hive tables with out losing any data ? 
> Any help is highly appreciated.
> 
> Thank you!


Re: Hive external indexes incorporation into Hive CBO

2016-04-21 Thread Jörn Franke
I am still not sure why you think they are not used. The main issue is that the 
block size is usually very large (eg 256 MB compared to kilobytes / sometimes 
few megabytes in traditional databases) and the indexes refer to blocks. This 
makes it less likely that you can leverage it for small datasets in the area of 
GBs, they make more sense for TB, PB, except if you reduce the blocksize. 
However, given ORC, the traditional indexes make probably less sense. I tried 
to give an overview here: https://t.co/DtkjKWOM7G
However comments to improve it are very welcome.



Sent from my iPhone
> On 21 Apr 2016, at 12:02, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> As we have discussed this few times, Hive external indexes (as opposed to 
> Store Indexes in ORC tables) are there but are not currently utilised.
> 
> For Hive to be effective it needs to use these indexes for a variety of 
> reasons and the CBO should leverage these indexes.
> 
> I am not sure how far we are down the road with Work In Progress on this. 
> However, I am happy to help with this.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


Re: Hive footprint

2016-04-20 Thread Jörn Franke
Hive has working indexes. However many people overlook that a block is usually 
much larger than in a relational database and thus do not use them right.

> On 19 Apr 2016, at 09:31, Mich Talebzadeh  wrote:
> 
> The issue is that Hive has indexes (not index store) but they don't work so 
> there we go. May be in later releases we can make use of these indexes for 
> faster queries. Hive allows even bitmap indexes on Fact table but they are 
> never used by COB.
> 
> show indexes on sales;
> 
> +---+---+---+--+---+--+--+
> |   idx_name|   tab_name|   col_names   | 
>   idx_tab_name   |   idx_type| comment  |
> +---+---+---+--+---+--+--+
> | sales_cust_bix| sales | cust_id   | 
> oraclehadoop__sales_sales_cust_bix__ | bitmap|  |
> | sales_channel_bix | sales | channel_id| 
> oraclehadoop__sales_sales_channel_bix__  | bitmap|  |
> | sales_prod_bix| sales | prod_id   | 
> oraclehadoop__sales_sales_prod_bix__ | bitmap|  |
> | sales_promo_bix   | sales | promo_id  | 
> oraclehadoop__sales_sales_promo_bix__| bitmap|  |
> | sales_time_bix| sales | time_id   | 
> oraclehadoop__sales_sales_time_bix__ | bitmap|  |
> +---+---+---+--+---+--+--+
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 18 April 2016 at 23:51, Marcin Tustin  wrote:
>> We use a hive with ORC setup now. Queries may take thousands of seconds with 
>> joins, and potentially tens of seconds with selects on very large tables. 
>> 
>> My understanding is that the goal of hbase is to provide much lower latency 
>> for queries. Obviously, this comes at the cost of not being able to perform 
>> joins. I don't actually use hbase, so I hesitate to say more about it. 
>> 
>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh 
>>>  wrote:
>>> Thanks Marcin.
>>> 
>>> What is the definition of low latency here? Are you referring to the 
>>> performance of SQL against HBase tables compared to Hive. As I understand 
>>> HBase is a columnar database. Would it be possible to use Hive against ORC 
>>> to achieve the same?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 18 April 2016 at 23:43, Marcin Tustin  wrote:
 HBase has a different use case - it's for low-latency querying of big 
 tables. If you combined it with Hive, you might have something nice for 
 certain queries, but I wouldn't think of them as direct competitors.
 
> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh 
>  wrote:
> Hi,
> 
> I notice that Impala is rarely mentioned these days.  I may be missing 
> something. However, I gather it is coming to end now as I don't recall 
> many use cases for it (or customers asking for it). In contrast, Hive has 
> hold its ground with the new addition of Spark and Tez as execution 
> engines, support for ACID and ORC and new stuff in Hive 2. In addition 
> provided a good choice for its metastore it scales well.
> 
> If Hive had the ability (organic) to have local variable and stored 
> procedure support then it would be top notch Data Warehouse. Given its 
> metastore, I don't see any technical reason why it cannot support these 
> constructs.
> 
> I was recently asked to comment on migration from commercial DWs to Big 
> Data (primarily for TCO reason) and really could not recall any better 
> candidate than Hive. Is HBase a viable alternative? Obviously whatever 
> one decides there is still HDFS, a good engine for Hive (sounds like many 
> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
> 
> Let me know your thoughts.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
 
 
 Want to work at Handy? Check out our culture 

Re: Hive footprint

2016-04-20 Thread Jörn Franke
Depends really what you want to do. Hive is more for queries involving a lot of 
data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion.

I think the reason is that hive has been the entry point for many engines and 
formats. Additionally there is a lot of tuning capabilities from hardware over 
software to make it fast. Thus, other software always had it a little bit 
difficult.


> On 19 Apr 2016, at 00:34, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I notice that Impala is rarely mentioned these days.  I may be missing 
> something. However, I gather it is coming to end now as I don't recall many 
> use cases for it (or customers asking for it). In contrast, Hive has hold its 
> ground with the new addition of Spark and Tez as execution engines, support 
> for ACID and ORC and new stuff in Hive 2. In addition provided a good choice 
> for its metastore it scales well.
> 
> If Hive had the ability (organic) to have local variable and stored procedure 
> support then it would be top notch Data Warehouse. Given its metastore, I 
> don't see any technical reason why it cannot support these constructs.
> 
> I was recently asked to comment on migration from commercial DWs to Big Data 
> (primarily for TCO reason) and really could not recall any better candidate 
> than Hive. Is HBase a viable alternative? Obviously whatever one decides 
> there is still HDFS, a good engine for Hive (sounds like many prefer TEZ 
> although I am a Spark fan) and the ubiquitous YARN.
> 
> Let me know your thoughts.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke

You could also explore the in-memory database of 12c . However, I am not sure 
how beneficial it is for Oltp scenarios.

I am excited to see how the performance will be on hbase as a hive metastore.

Nevertheless, your results on Oracle/SSD will be beneficial for the community.

> On 17 Apr 2016, at 11:52, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have had my Hive metastore database on Oracle 11g supporting concurrency 
> (with added transactional capability)
> 
> Over the past few days I created a new schema on Oracle 12c on Solid State 
> Disks (SSD) and used databump (exdp, imdp) to migrate Hive database from 
> Oracle 11g to Oracle 12c on SSD.
> 
> Couple of years ago I did some work for OLTP operations (many random access 
> via index scan plus serial scans) for Oracle 11g and SAP ASE 15.7. I noticed 
> that for Random Access with Index scans the performance improves by a factor 
> of 20  because of much faster seek time for SSD
> 
> https://www.scribd.com/doc/119707722/IOUG-SELECT-Q312-Final
> 
> I have recently seen some contention for access resources in Hive database, 
> so I think going to SSD will improve the performance of Hive in general.
> 
> I will look at AWR reports to see how beneficial this set up is as this 
> Oracle instance is more and less dedicated to Hive.
> 
> HTH
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


  1   2   >