Re: CREATE FUNCTION: How to automatically load extra jar file?

2015-01-14 Thread arthur.hk.c...@gmail.com
Hi,
 
I have deleted the original hive metadata database from mysql,  re-created a 
new one with "character set ='latin1';
also put the jar file into HDFS with a shorter file name, the  'max key length 
is 767 bytes’ issue from mysql is resolved.


Tried again:
1) drop function sysdate;
2) CREATE FUNCTION sysdate AS 'com.nexr.platform.hive.udf.UDFSysDate' using JAR 
'hdfs://mycluster/hadoop/nexr.jar';

3) (check the hive log)
2015-01-15 15:05:43,133 INFO  [main]: ql.Driver (Driver.java:getSchema(238)) - 
Returning Hive schema: Schema(fieldSchemas:null, properties:null)
2015-01-15 15:05:43,133 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogEnd(135)) - 
2015-01-15 15:05:43,133 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogBegin(108)) - 
2015-01-15 15:05:43,133 INFO  [main]: ql.Driver (Driver.java:execute(1192)) - 
Starting command: CREATE FUNCTION sysdate AS 
'com.nexr.platform.hive.udf.UDFSysDate' using JAR 
'hdfs://mycluster/hadoop/nexr.jar'
2015-01-15 15:05:43,134 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogEnd(135)) - 
2015-01-15 15:05:43,134 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogBegin(108)) - 
2015-01-15 15:05:43,134 INFO  [main]: log.PerfLogger 
(PerfLogger.java:PerfLogBegin(108)) - 
2015-01-15 15:05:43,135 INFO  [main]: SessionState 
(SessionState.java:printInfo(536)) - converting to local  
hdfs://mycluster/hadoop/nexr.jar
2015-01-15 15:05:43,142 INFO  [main]: SessionState 
(SessionState.java:printInfo(536)) - Added 
/tmp/606e6a26-775f-40c8-be18-a670deee2f7e_resources/nexr.jar to class path
2015-01-15 15:05:43,142 INFO  [main]: SessionState 
(SessionState.java:printInfo(536)) - Added resource: 
/tmp/606e6a26-775f-40c8-be18-a670deee2f7e_resources/nexr.jar

4) select sysdate();

converting to local hdfs://mycluster/hadoop/nexr.jar
Added /tmp/abce45b1-6041-40b6-83ed-8c6491216360_resources/nexr.jar to class path
Added resource: /tmp/abce45b1-6041-40b6-83ed-8c6491216360_resources/nexr.jar
Automatically selecting local only mode for query
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
java.io.FileNotFoundException: File does not exist: 
hdfs://mycluster/tmp/abce45b1-6041-40b6-83ed-8c6491216360_resources/nexr.jar
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:265)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:301)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:389)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.main(ExecDriver.java:740)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccesso

sum a double in ORC bug?

2015-01-14 Thread Martin, Nick
*Hive 13

I'm storing a sales amount column as a double in an ORC table and when I do:

select sum(x) from sometable

I get a value like 4.79165141174808E9

A visual inspection of the column values reveals no glaring anomalies...all 
looks pretty normal. If I do the same thing in a textfile table I get a 
perfectly fine aggregation of the double field.

I couldn't find anything like this already reported so I filed a bug (9385).


RE: Set variable via query

2015-01-14 Thread Martin, Nick
Thx..went with something similar-ish to fit the need.

From: Xuefu Zhang [mailto:xzh...@cloudera.com]
Sent: Tuesday, January 13, 2015 9:48 PM
To: user@hive.apache.org
Subject: Re: Set variable via query

select * from someothertable where dt IN (select max(dt) from sometable);

On Tue, Jan 13, 2015 at 4:39 PM, Martin, Nick 
mailto:nimar...@pssd.com>> wrote:
Hi all,

I'm looking to set a variable in Hive and use the resulting value in a 
subsequent query. Something like:

set startdt='select max(dt) from sometable';
select * from someothertable where dt=${hiveconf:startdt};

I found this is still open 
HIVE-2165

Any options? Tried a flavor of above via CLI and it didn't work.

On Hive 13

Thanks!
Nick



jdbc:hive vs jdbc:hive2

2015-01-14 Thread Edward Capriolo
Just a heads up. For anyone that has been using jdbc:hive

I noticed a recent hive...

jdbc:hive2://myhost:port

SQL exception "Invalid URL"

It might be better if the exception said " Invalid URL. Url must start with
jdbc:hive"


Re: Adding new columns to parquet based Hive table

2015-01-14 Thread Kumar V
Hi,    Thanks for your response.I can't do another insert as the data is 
already in the table. Also, since there is a lot of data in the table already, 
I am trying to find a way to avoid reprocessing/reloading.
Thanks. 

 On Wednesday, January 14, 2015 2:47 PM, Daniel Haviv 
 wrote:
   

 Hi Kumar,Altering the table just update's Hive's metadata without updating 
parquet's schema.I believe that if you'll insert to your table (after adding 
the column) you'll be able to later on select all 3 columns.
Daniel
On 14 בינו׳ 2015, at 21:34, Kumar V  wrote:


Hi,
    Any ideas on how to go about this ? Any insights you have would be helpful. 
I am kinda stuck here.
Here are the steps I followed on hive 0.13
1) create table t (f1 String, f2 string) stored as Parquet;2) upload parquet 
files with 2 fields3) select * from t; < Works fine.4) alter table t add 
columns (f3 string);5) Select * from t; <- ERROR  "Caused by: 
java.lang.IllegalStateException: Column f3 at index 2 does not exist at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:116)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:204)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:66)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
  at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)


 

 On Wednesday, January 7, 2015 2:55 PM, Kumar V  
wrote:
   

 Hi,    I have a Parquet format Hive table with a few columns.  I have loaded a 
lot of data to this table already and it seems to work.I have to add a few new 
columns to this table.  If I add new columns, queries don't work anymore since 
I have not reloaded the old data.Is there a way to add new fields to the table 
and not reload the old Parquet files and make the query work ?
I tried this in Hive 0.10 and also on hive 0.13.  Getting an error in both 
versions.
Please let me know how to handle this.
Regards,Kumar. 




   

Re: Hive being slow

2015-01-14 Thread Ashutosh Chauhan
Can you run your query with following config:

hive> set hive.fetch.task.conversion=none;

and run your two queries with this. Lets see if this makes a difference. My
expectation is this will result in MR job getting launched and thus
runtimes might be different.

On Sat, Jan 10, 2015 at 4:54 PM, Abhishek kumar 
wrote:

> First I tried running the query: select * from table1 where id = 'value';
> It was very fast, as expected since Hbase replied the results very fast.
> In this case, I observed no map/reduce task getting spawned.
>
> Now, for the query, select * from table1 where id > 'zzz', I expected the
> filter push down to happen (at least the 0.14 code says). And since, there
> were no results found, so Hbase will again reply very fast and thus hive
> should output the query's result very fast. But, this is not happening, and
> from the logs of datanode, it looks like a lot of reads are happening
> (close to full table scan of 10GBs of data). I expected the response time
> to be very close to the above query's time.
>
> I will check about the number of task getting launched.
>
> My questions are:
> * Why there was no any filter pushdown (id > 'zzz') happening for this
> very simple query.
> * Since this query can only be resolved from HBase, will Hive launch map
> tasks (last time, I guess I observed no map task getting launched)
>
> --
> Abhishek
>
> On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan 
> wrote:
>
>> Hi Abhishek,
>>
>> How are you determining its resulting in full table scan? One way to
>> ascertain that filter got pushed down is to see how many tasks were
>> launched for your query, with and without filter. One would expect lower #
>> of splits (and thus tasks) for query having filter.
>>
>> Thanks,
>> Ashutosh
>>
>> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar > > wrote:
>>
>>> Hi,
>>>
>>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I am
>>> facing issues in terms of slowness when querying over Hbase. My query looks
>>> like following:
>>>
>>> select * from table1 where id > '';  (id is the row-key)
>>>
>>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
>>> 'startKey'. Now given there are no such rows-keys (id) which satisfies this
>>> criteria, this query should be extremely fast. But hive is taking a lot of
>>> time, looks like full hbase table scan.
>>> Can someone let me know where am I wrong in understanding the whole
>>> thing?
>>>
>>> --
>>> Abhishek
>>>
>>
>>
>


Re: Adding new columns to parquet based Hive table

2015-01-14 Thread Daniel Haviv
Hi Kumar,
Altering the table just update's Hive's metadata without updating parquet's 
schema.
I believe that if you'll insert to your table (after adding the column) you'll 
be able to later on select all 3 columns.

Daniel

> On 14 בינו׳ 2015, at 21:34, Kumar V  wrote:
> 
> Hi,
> 
> Any ideas on how to go about this ? Any insights you have would be 
> helpful. I am kinda stuck here.
> 
> Here are the steps I followed on hive 0.13
> 
> 1) create table t (f1 String, f2 string) stored as Parquet;
> 2) upload parquet files with 2 fields
> 3) select * from t; < Works fine.
> 4) alter table t add columns (f3 string);
> 5) Select * from t; <- ERROR  "Caused by: 
> java.lang.IllegalStateException: Column f3 at index 2 does not exist 
> at 
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:204)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:66)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)
> 
> 
> 
> 
> 
> On Wednesday, January 7, 2015 2:55 PM, Kumar V  
> wrote:
> 
> 
> Hi,
> I have a Parquet format Hive table with a few columns.  I have loaded a 
> lot of data to this table already and it seems to work.
> I have to add a few new columns to this table.  If I add new columns, queries 
> don't work anymore since I have not reloaded the old data.
> Is there a way to add new fields to the table and not reload the old Parquet 
> files and make the query work ?
> 
> I tried this in Hive 0.10 and also on hive 0.13.  Getting an error in both 
> versions.
> 
> Please let me know how to handle this.
> 
> Regards,
> Kumar. 
> 
> 


Re: Adding new columns to parquet based Hive table

2015-01-14 Thread Kumar V
Hi,
    Any ideas on how to go about this ? Any insights you have would be helpful. 
I am kinda stuck here.
Here are the steps I followed on hive 0.13
1) create table t (f1 String, f2 string) stored as Parquet;2) upload parquet 
files with 2 fields3) select * from t; < Works fine.4) alter table t add 
columns (f3 string);5) Select * from t; <- ERROR  "Caused by: 
java.lang.IllegalStateException: Column f3 at index 2 does not exist at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:116)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:204)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:79)
  at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:66)
  at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
  at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)


 

 On Wednesday, January 7, 2015 2:55 PM, Kumar V  
wrote:
   

 Hi,    I have a Parquet format Hive table with a few columns.  I have loaded a 
lot of data to this table already and it seems to work.I have to add a few new 
columns to this table.  If I add new columns, queries don't work anymore since 
I have not reloaded the old data.Is there a way to add new fields to the table 
and not reload the old Parquet files and make the query work ?
I tried this in Hive 0.10 and also on hive 0.13.  Getting an error in both 
versions.
Please let me know how to handle this.
Regards,Kumar. 

   

Re: cannot store value into partition column

2015-01-14 Thread Patcharee Thongtra
After I changed org.apache.hcatalog.pig.HCatStorer() to 
org.apache.hive.hcatalog.pig.HCatStorer(), it worked.


Patcharee

On 01/14/2015 02:57 PM, Patcharee Thongtra wrote:

Hi,

I am having a weird problem. I created a table in orc format:


Create table

create external table cossin (x int, y int, cos float, sin float) 
PARTITIONED BY(zone int) stored as orc location 
'/apps/hive/warehouse/wrf_tables/cossin' tblproperties 
("orc.compress"="ZLIB");


I run a pig script below to import data into this table 'cossin'.


Pig script

...
r_three_dim = FOREACH result_three_dim GENERATE
 $ZONE as zone: int,
 result::x as x: int, result::y as y: int,
 result::cos as cos: float, result::sin as sin: float;

x = FILTER r_three_dim by x < 5 and y < 5;
dump x;
describe x;

store x into 'cossin' using org.apache.hcatalog.pig.HCatStorer();


Dump x

(2,3,3,0.9883806,-0.15199915)
(2,3,4,0.98836243,-0.15211758)
(2,4,1,0.98830783,-0.15247186)
(2,4,2,0.9882811,-0.15264522)
(2,4,3,0.9882628,-0.15276346)
(2,4,4,0.98824626,-0.15287022)
x: {zone: int,x: int,y: int,cos: float,sin: float}

But when I checked the table 'cossin', zone is NULL instead on 2.

Any ideas?

BR,
Patcharee





cannot store value into partition column

2015-01-14 Thread Patcharee Thongtra

Hi,

I am having a weird problem. I created a table in orc format:


Create table

create external table cossin (x int, y int, cos float, sin float) 
PARTITIONED BY(zone int) stored as orc location 
'/apps/hive/warehouse/wrf_tables/cossin' tblproperties 
("orc.compress"="ZLIB");


I run a pig script below to import data into this table 'cossin'.


Pig script

...
r_three_dim = FOREACH result_three_dim GENERATE
 $ZONE as zone: int,
 result::x as x: int, result::y as y: int,
 result::cos as cos: float, result::sin as sin: float;

x = FILTER r_three_dim by x < 5 and y < 5;
dump x;
describe x;

store x into 'cossin' using org.apache.hcatalog.pig.HCatStorer();


Dump x

(2,3,3,0.9883806,-0.15199915)
(2,3,4,0.98836243,-0.15211758)
(2,4,1,0.98830783,-0.15247186)
(2,4,2,0.9882811,-0.15264522)
(2,4,3,0.9882628,-0.15276346)
(2,4,4,0.98824626,-0.15287022)
x: {zone: int,x: int,y: int,cos: float,sin: float}

But when I checked the table 'cossin', zone is NULL instead on 2.

Any ideas?

BR,
Patcharee



struggling with Hive ACID - need help

2015-01-14 Thread Devopam Mittra
Env: Hive 0.14 , HDP2.2, Centos 6.6

I created a simple table (ref syntax below)
DROP TABLE IF EXISTS  twitter_user_info;

CREATE TABLE twitter_user_info
( userid BIGINT COMMENT 'User Id',
  username STRING COMMENT 'User Name',
  is_latest BOOLEAN COMMENT 'Currently active',
  last_modified_ts TIMESTAMP COMMENT 'Last modified timestamp'
)
COMMENT 'Twitter User Info Table'
CLUSTERED BY (userid,is_latest) SORTED BY (userid) INTO 30 BUCKETS
STORED AS ORC
TBLPROPERTIES('transactional'='true');


set the parameters in hive conf via ambari:

set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=100;

I inserted few rows via the following query (not-relevant , yet pasting):
INSERT INTO TABLE twitter_user_info
SELECT created_by_id, created_by_name,'Y',MAX(created_at)
FROM twitter_data B
WHERE b.created_by_id IS NOT NULL
AND b.created_by_id NOT IN (
 SELECT created_by_id FROM
(SELECT created_by_id,COUNT(DISTINCT
created_by_name)
 FROM twitter_data
 WHERE created_by_id IS NOT NULL
 GROUP BY created_by_id
 HAVING COUNT(DISTINCT created_by_name) >1) A
)
GROUP BY b.created_by_id,b.created_by_name;

For each operation below I am hitting the same error, and am clueless at
this point .

hive> UPDATE twitter_user_info
> SET is_latest = false
> WHERE userid = 298779403
> AND last_modified_ts = '2015-01-13 14:25:14';
*FAILED: NoMatchingMethodException No matching method for class
org.apache.hadoop.hive.ql.udf.UDFToLong with
(struct). Possible choices:
_FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
_FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
_FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *
hive> DELETE FROM twitter_user_info WHERE userid=100;

*FAILED: NoMatchingMethodException No matching method for class
org.apache.hadoop.hive.ql.udf.UDFToLong with
(struct). Possible choices:
_FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
_FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
_FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *

Any pointers are welcome

-- 
Devopam Mittra
Life and Relations are not binary