Re: Hive Serialization issues

2016-11-23 Thread Edward Capriolo
I believe json itself has encoding rules. What i suggest you do is build
your own input format or serde and escape those fieds possibly by
converting them to hex.

On Wednesday, November 23, 2016, Dana Ram Meghwal  wrote:

> Hey,
> Any leads?
>
> On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal  > wrote:
>
>> Hey All,
>>
>> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
>> execution engine.
>> Our data are stored in json format so for serialization and
>> deserialization purpose we are planning to use lazy serde
>> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>>
>> My table definition is
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS 
>> daily_active_users_summary_json_partition_dt_paths_v1
>> (uid string, city string, user string, songcount string, songid_list
>> array  ) PARTITIONED BY ( dt string)
>>
>>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>>
>>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>>
>>  LOCATION 's3:///users/daily_active_us
>> ers_summary_json_partition_dt';
>>
>>
>> and data look like this---
>>
>> {"uid":"xx","listening_user_flag":"non_listening","
>> platform":"android","model":"micromax a110q","aquisition_channel":"o
>> rganic","state":"delhi","app_version":"3.2:","country":"IN","city":"new
>> delhi","new_listening_user_flag":"non_listening","manufactur
>> er":"Micromax","login_mode":"loggedout","new_user_flag":"
>> returning","digital_channel":"Not Source"}
>>
>>
>> Note: I have pasted here one record in table.
>>
>>
>> Now, When I do query
>>
>> select * from daily_active_users_summary_json_partition_dt_paths_v1
>> limit 5;
>>
>>
>> the first field of table takes the complete record and rest of field are
>> showing to be NULL.
>>
>> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> then I can see the above query works fine and able to serialize data
>> perfectly fine. We want to user the lazy serde because our data contains
>> non-utf-8 character and the later serde does not support non-utf-8
>> character serialization/deserialization.
>>
>>
>> Can you please help me solve this, we mostly want to use lazy serde only
>> as we have already experimented with other serde's none of them is working
>> for us Is there any configuration which enable
>> serialization/deserialization while using lazy Serde.
>>
>> Or is there any other serde which can fine process non-utf-8 character in
>> hive-2 and tez.
>>
>> Thank you
>>
>>
>> Best Regards,
>> Dana Ram Meghwal
>> Software Engineer
>> dana...@saavn.com 
>>
>>
>
>
> --
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com 
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive Serialization issues

2016-11-23 Thread Dana Ram Meghwal
Hey,
Any leads?

On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal  wrote:

> Hey All,
>
> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
> execution engine.
> Our data are stored in json format so for serialization and
> deserialization purpose we are planning to use lazy serde
> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>
> My table definition is
>
> CREATE EXTERNAL TABLE IF NOT EXISTS 
> daily_active_users_summary_json_partition_dt_paths_v1
> (uid string, city string, user string, songcount string, songid_list
> array  ) PARTITIONED BY ( dt string)
>
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>
>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>
>  LOCATION 's3:///users/daily_active_
> users_summary_json_partition_dt';
>
>
> and data look like this---
>
> {"uid":"xx","listening_user_flag":"non_listening","platform":"android","model":"micromax
> a110q","aquisition_channel":"organic","state":"delhi","app_
> version":"3.2:","country":"IN","city":"new delhi","new_listening_user_
> flag":"non_listening","manufacturer":"Micromax","
> login_mode":"loggedout","new_user_flag":"returning","digital_channel":"Not
> Source"}
>
>
> Note: I have pasted here one record in table.
>
>
> Now, When I do query
>
> select * from daily_active_users_summary_json_partition_dt_paths_v1 limit
> 5;
>
>
> the first field of table takes the complete record and rest of field are
> showing to be NULL.
>
> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>
> then I can see the above query works fine and able to serialize data
> perfectly fine. We want to user the lazy serde because our data contains
> non-utf-8 character and the later serde does not support non-utf-8
> character serialization/deserialization.
>
>
> Can you please help me solve this, we mostly want to use lazy serde only
> as we have already experimented with other serde's none of them is working
> for us Is there any configuration which enable
> serialization/deserialization while using lazy Serde.
>
> Or is there any other serde which can fine process non-utf-8 character in
> hive-2 and tez.
>
> Thank you
>
>
> Best Regards,
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com
>
>


-- 
Dana Ram Meghwal
Software Engineer
dana...@saavn.com


Fwd: Hive Serialization issues

2016-11-22 Thread Dana Ram Meghwal
Hey All,

I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
execution engine.
Our data are stored in json format so for serialization and deserialization
purpose we are planning to use lazy serde
(classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).

My table definition is

CREATE EXTERNAL TABLE IF NOT EXISTS
daily_active_users_summary_json_partition_dt_paths_v1
(uid string, city string, user string, songcount string, songid_list
array  ) PARTITIONED BY ( dt string)

 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

 WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')

 LOCATION 's3:///users/daily_active_
users_summary_json_partition_dt';


and data look like this---

{"uid":"xx","listening_user_flag":"non_listening","platform":"android","model":"micromax
a110q","aquisition_channel":"organic","state":"delhi","app_
version":"3.2:","country":"IN","city":"new delhi","new_listening_user_
flag":"non_listening","manufacturer":"Micromax","
login_mode":"loggedout","new_user_flag":"returning","digital_channel":"Not
Source"}


Note: I have pasted here one record in table.


Now, When I do query

select * from daily_active_users_summary_json_partition_dt_paths_v1 limit 5;


the first field of table takes the complete record and rest of field are
showing to be NULL.

When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'

then I can see the above query works fine and able to serialize data
perfectly fine. We want to user the lazy serde because our data contains
non-utf-8 character and the later serde does not support non-utf-8
character serialization/deserialization.


Can you please help me solve this, we mostly want to use lazy serde only as
we have already experimented with other serde's none of them is working for
us Is there any configuration which enable serialization/deserialization
while using lazy Serde.

Or is there any other serde which can fine process non-utf-8 character in
hive-2 and tez.

Thank you


Best Regards,
Dana Ram Meghwal
Software Engineer
dana...@saavn.com