Double quotes in csv data

2015-10-21 Thread michael.england
Hi,

I have some CSV data which is encompassing each field in double quotes e.g. 
“hello”, “”, “test”, “”.

I noticed that there is a CSV Serde now available in Hive 0.14. Is it possible 
to use this to strip the quotes when querying an external Hive table?

https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

e.g. select * from table limit 5 should return:

hello test   


This e-mail (including any attachments) is private and confidential, may 
contain proprietary or privileged information and is intended for the named 
recipient(s) only. Unintended recipients are strictly prohibited from taking 
action on the basis of information in this e-mail and must contact the sender 
immediately, delete this e-mail (and all attachments) and destroy any hard 
copies. Nomura will not accept responsibility or liability for the accuracy or 
completeness of, or the presence of any virus or disabling code in, this 
e-mail. If verification is sought please request a hard copy. Any reference to 
the terms of executed transactions should be treated as preliminary only and 
subject to formal written confirmation by Nomura. Nomura reserves the right to 
retain, monitor and intercept e-mail communications through its networks 
(subject to and in accordance with applicable laws). No confidentiality or 
privilege is waived or lost by Nomura by any mistransmission of this e-mail. 
Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, 
Inc. group. Please read our Electronic Communications Legal Notice which forms 
part of this e-mail: http://www.Nomura.com/email_disclaimer.htm



Re: Storm HiveBolt missing records due to batching of Hive transactions

2015-10-21 Thread Harshit Raikar
withTickTupleInterval parameter is not available in the storm version which
I am using.

On 21 October 2015 at 14:02, Harshit Raikar 
wrote:

> Hi Aaron,
>
> Thanks for the information.
> Do I need to update my Storm version? Currently I am using 0.10.0 version.
> Can you please guide me what parameters need to be set to use tick tuple
>
> Regards,
> Harshit Raikar
>
> On 9 October 2015 at 14:49, Aaron.Dossett 
> wrote:
>
>> STORM-938 adds a periodic flush to the HiveBolt using tick tuples that
>> would address this situation.
>>
>> From: Harshit Raikar 
>> Reply-To: "user@hive.apache.org" 
>> Date: Friday, October 9, 2015 at 4:05 AM
>> To: "user@hive.apache.org" 
>> Subject: Storm HiveBolt missing records due to batching of Hive
>> transactions
>>
>>
>> To store the processed records I am using HiveBolt in Storm topology with
>> following arguments.
>>
>> - id: "MyHiveOptions"
>> className: "org.apache.storm.hive.common.HiveOptions"
>>   - "${metastore.uri}"   # metaStoreURI
>>   - "${hive.database}"   # databaseName
>>   - "${hive.table}"  # tableName
>> configMethods:
>>   - name: "withTxnsPerBatch"
>> args:
>>   - 2
>>   - name: "withBatchSize"
>> args:
>>   - 100
>>   - name: "withIdleTimeout"
>> args:
>>   - 2  #default value 0
>>   - name: "withMaxOpenConnections"
>> args:
>>   - 200 #default value 500
>>   - name: "withCallTimeout"
>> args:
>>   - 3 #default value 1
>>   - name: "withHeartBeatInterval"
>> args:
>>   - 240 #default value 240
>>
>> There are missing transaction in Hive due to batch no being completed and
>> records are flushed. (For example: 1330 records are processed but only 1200
>> records are in hive. 130 records missing.)
>>
>> How can I overcome this situation? How can I fill the batch so that the
>> transaction is triggered and the records are stored in hive.
>>
>> Topology : Kafka-Spout --> DataProcessingBolt
>>DataProcessingBolt -->HiveBolt (Sink)
>>DataProcessingBolt -->JdbcBolt (Sink)
>>
>>
>> --
>> Thanks and Regards,
>> Harshit Raikar
>>
>>
>>
>> --
>> Thanks and Regards,
>> Harshit Raikar
>> Phone No. +4917655471932
>>
>
>
>
> --
> Thanks and Regards,
> Harshit Raikar
> Phone No. +4917655471932
>



-- 
Thanks and Regards,
Harshit Raikar
Phone No. +4917655471932


Re: Double quotes in csv data

2015-10-21 Thread Vikas Parashar
Hi Micheal,

You can write some python/perl script and can transform the csv.

On Wed, Oct 21, 2015 at 8:57 PM,  wrote:

> Hi,
>
>
>
> I have some CSV data which is encompassing each field in double quotes
> e.g. “hello”, “”, “test”, “”.
>
>
>
> I noticed that there is a CSV Serde now available in Hive 0.14. Is it
> possible to use this to strip the quotes when querying an external Hive
> table?
>
>
>
> https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
>
>
>
> e.g. select * from table limit 5 should return:
>
>
>
> hello test   
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>


Re: How to use grouping__id in a query

2015-10-21 Thread Jesus Camacho Rodriguez
I created HIVE-12223 to track this issue.

Thanks,
Jesús


From: Jesus Camachorodriguez
Reply-To: "user@hive.apache.org"
Date: Friday, October 16, 2015 at 8:00 AM
To: "user@hive.apache.org"
Subject: Re: How to use grouping__id in a query

Hi Michal,

Sorry I didn't catch your message before. The change of behavior might be due 
to a bug; certainly we should filter or at least produce a proper error.

Could you file a JIRA case and assign it to me? I'll check further.

Thanks,
Jesús



From: Michal Krawczyk
Reply-To: "user@hive.apache.org"
Date: Friday, October 16, 2015 at 8:15 AM
To: "user@hive.apache.org"
Subject: Re: How to use grouping__id in a query

Hi all,

Unfortunately I didn't get any answer on this one, perhaps I asked the question 
incorrectly. I'll try another one then ;).

Should it be possible to use grouping__id function in having clause to filter 
our null values in the same query. It used to work in Hive 0.11 and 0.13, but 
doesn't work in Hive 1.0.

Thanks,
Michal

On Fri, Sep 25, 2015 at 1:14 PM, Michal Krawczyk 
> wrote:
Hi all,

During the migration from Hive 0.11 to 1.0 on Amazon EMR I run to an issue with 
grouping__id function. I'd like to use it to filter out NULL values that didn't 
come from grouping sets. Here's an example:

We have a simple table with some data:

hive> create table grouping_test (col1 string, col2 string);
hive> insert into grouping_test values (1, 2), (1, 3), (1, null), (null, 2);
hive> select * from grouping_test;
OK
1   2
1   3
1   NULL
NULL2

hive> select col1, col2, GROUPING__ID, count(*)
from grouping_test
group by col1, col2
grouping sets ((), (col1))
having !(col1 IS NULL AND ((CAST(GROUPING__ID as int) & 1) > 0))

I expect the query above to filter out NULL col1 for the col1 grouping set, it 
used to work on Hive 0.11. But on Hive 1.0 it doesn't filter any values and 
still returns NULL col1:

NULLNULL0   4
NULLNULL1   1 <=== this row is expected to be removed by 
the having clause
1   NULL1   3

I tried also a few other conditions on grouping__id in having clause and none 
of them seem to work correctly:

select col1, col2, GROUPING__ID, count(*)
from grouping_test
group by col1, col2
grouping sets ((), (col1))
having GROUPING__ID = '1'

This query doesn't return any data.


I also tried to embed it into a subquery, but still no luck. It finally worked 
when I saved the output of the main query to a temp table and filtered out the 
data using where clause, but this looks like an overkill.

So my question is: How to filter out values using grouping__id in Hive 1.0?

Thanks for your help,
Michal


--
Michal Krawczyk
Project Manager / Tech Lead
Union Square Internet Development
http://www.u2i.com/



--
Michal Krawczyk
Project Manager / Tech Lead
Union Square Internet Development
http://www.u2i.com/


Re: Question about hive-jdbc

2015-10-21 Thread Alan Gates
The way to keep track of when things are getting done in Hive is to 
check the JIRA, https://issues.apache.org/jira/browse/HIVE  I'm not 
aware of anyone working on those issues at the moment, but a search of 
the JIRA will tell you if anyone has filed a bug on it.


Alan.


Hafiz Mujadid 
October 19, 2015 at 2:59
Hi all!

I have seen different methods are not implemented in hive jdbc drive 
like getURL in hiveDatabaseMetaData class.
Can somebody guide me when these methods are going to be implemented 
and will be available as maven dependency


Re: Storm HiveBolt missing records due to batching of Hive transactions

2015-10-21 Thread Artem Ervits
Please try this version
https://github.com/apache/storm/blob/master/external/storm-hive/pom.xml
On Oct 21, 2015 11:19 AM, "Harshit Raikar"  wrote:

> withTickTupleInterval parameter is not available in the storm version
> which I am using.
>
> On 21 October 2015 at 14:02, Harshit Raikar 
> wrote:
>
>> Hi Aaron,
>>
>> Thanks for the information.
>> Do I need to update my Storm version? Currently I am using 0.10.0 version.
>> Can you please guide me what parameters need to be set to use tick tuple
>>
>> Regards,
>> Harshit Raikar
>>
>> On 9 October 2015 at 14:49, Aaron.Dossett 
>> wrote:
>>
>>> STORM-938 adds a periodic flush to the HiveBolt using tick tuples that
>>> would address this situation.
>>>
>>> From: Harshit Raikar 
>>> Reply-To: "user@hive.apache.org" 
>>> Date: Friday, October 9, 2015 at 4:05 AM
>>> To: "user@hive.apache.org" 
>>> Subject: Storm HiveBolt missing records due to batching of Hive
>>> transactions
>>>
>>>
>>> To store the processed records I am using HiveBolt in Storm topology
>>> with following arguments.
>>>
>>> - id: "MyHiveOptions"
>>> className: "org.apache.storm.hive.common.HiveOptions"
>>>   - "${metastore.uri}"   # metaStoreURI
>>>   - "${hive.database}"   # databaseName
>>>   - "${hive.table}"  # tableName
>>> configMethods:
>>>   - name: "withTxnsPerBatch"
>>> args:
>>>   - 2
>>>   - name: "withBatchSize"
>>> args:
>>>   - 100
>>>   - name: "withIdleTimeout"
>>> args:
>>>   - 2  #default value 0
>>>   - name: "withMaxOpenConnections"
>>> args:
>>>   - 200 #default value 500
>>>   - name: "withCallTimeout"
>>> args:
>>>   - 3 #default value 1
>>>   - name: "withHeartBeatInterval"
>>> args:
>>>   - 240 #default value 240
>>>
>>> There are missing transaction in Hive due to batch no being completed
>>> and records are flushed. (For example: 1330 records are processed but only
>>> 1200 records are in hive. 130 records missing.)
>>>
>>> How can I overcome this situation? How can I fill the batch so that the
>>> transaction is triggered and the records are stored in hive.
>>>
>>> Topology : Kafka-Spout --> DataProcessingBolt
>>>DataProcessingBolt -->HiveBolt (Sink)
>>>DataProcessingBolt -->JdbcBolt (Sink)
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Harshit Raikar
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Harshit Raikar
>>> Phone No. +4917655471932
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Harshit Raikar
>> Phone No. +4917655471932
>>
>
>
>
> --
> Thanks and Regards,
> Harshit Raikar
> Phone No. +4917655471932
>


Re: Storm HiveBolt missing records due to batching of Hive transactions

2015-10-21 Thread Harshit Raikar
Hi Aaron,

Thanks for the information.
Do I need to update my Storm version? Currently I am using 0.10.0 version.
Can you please guide me what parameters need to be set to use tick tuple

Regards,
Harshit Raikar

On 9 October 2015 at 14:49, Aaron.Dossett  wrote:

> STORM-938 adds a periodic flush to the HiveBolt using tick tuples that
> would address this situation.
>
> From: Harshit Raikar 
> Reply-To: "user@hive.apache.org" 
> Date: Friday, October 9, 2015 at 4:05 AM
> To: "user@hive.apache.org" 
> Subject: Storm HiveBolt missing records due to batching of Hive
> transactions
>
>
> To store the processed records I am using HiveBolt in Storm topology with
> following arguments.
>
> - id: "MyHiveOptions"
> className: "org.apache.storm.hive.common.HiveOptions"
>   - "${metastore.uri}"   # metaStoreURI
>   - "${hive.database}"   # databaseName
>   - "${hive.table}"  # tableName
> configMethods:
>   - name: "withTxnsPerBatch"
> args:
>   - 2
>   - name: "withBatchSize"
> args:
>   - 100
>   - name: "withIdleTimeout"
> args:
>   - 2  #default value 0
>   - name: "withMaxOpenConnections"
> args:
>   - 200 #default value 500
>   - name: "withCallTimeout"
> args:
>   - 3 #default value 1
>   - name: "withHeartBeatInterval"
> args:
>   - 240 #default value 240
>
> There are missing transaction in Hive due to batch no being completed and
> records are flushed. (For example: 1330 records are processed but only 1200
> records are in hive. 130 records missing.)
>
> How can I overcome this situation? How can I fill the batch so that the
> transaction is triggered and the records are stored in hive.
>
> Topology : Kafka-Spout --> DataProcessingBolt
>DataProcessingBolt -->HiveBolt (Sink)
>DataProcessingBolt -->JdbcBolt (Sink)
>
>
> --
> Thanks and Regards,
> Harshit Raikar
>
>
>
> --
> Thanks and Regards,
> Harshit Raikar
> Phone No. +4917655471932
>



-- 
Thanks and Regards,
Harshit Raikar
Phone No. +4917655471932