Re: Hive Serialization issues

2016-11-23 Thread Edward Capriolo
I believe json itself has encoding rules. What i suggest you do is build
your own input format or serde and escape those fieds possibly by
converting them to hex.

On Wednesday, November 23, 2016, Dana Ram Meghwal  wrote:

> Hey,
> Any leads?
>
> On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal  > wrote:
>
>> Hey All,
>>
>> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
>> execution engine.
>> Our data are stored in json format so for serialization and
>> deserialization purpose we are planning to use lazy serde
>> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>>
>> My table definition is
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS 
>> daily_active_users_summary_json_partition_dt_paths_v1
>> (uid string, city string, user string, songcount string, songid_list
>> array  ) PARTITIONED BY ( dt string)
>>
>>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>>
>>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>>
>>  LOCATION 's3:///users/daily_active_us
>> ers_summary_json_partition_dt';
>>
>>
>> and data look like this---
>>
>> {"uid":"xx","listening_user_flag":"non_listening","
>> platform":"android","model":"micromax a110q","aquisition_channel":"o
>> rganic","state":"delhi","app_version":"3.2:","country":"IN","city":"new
>> delhi","new_listening_user_flag":"non_listening","manufactur
>> er":"Micromax","login_mode":"loggedout","new_user_flag":"
>> returning","digital_channel":"Not Source"}
>>
>>
>> Note: I have pasted here one record in table.
>>
>>
>> Now, When I do query
>>
>> select * from daily_active_users_summary_json_partition_dt_paths_v1
>> limit 5;
>>
>>
>> the first field of table takes the complete record and rest of field are
>> showing to be NULL.
>>
>> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> then I can see the above query works fine and able to serialize data
>> perfectly fine. We want to user the lazy serde because our data contains
>> non-utf-8 character and the later serde does not support non-utf-8
>> character serialization/deserialization.
>>
>>
>> Can you please help me solve this, we mostly want to use lazy serde only
>> as we have already experimented with other serde's none of them is working
>> for us Is there any configuration which enable
>> serialization/deserialization while using lazy Serde.
>>
>> Or is there any other serde which can fine process non-utf-8 character in
>> hive-2 and tez.
>>
>> Thank you
>>
>>
>> Best Regards,
>> Dana Ram Meghwal
>> Software Engineer
>> dana...@saavn.com 
>>
>>
>
>
> --
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com 
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive Serialization issues

2016-11-23 Thread Dana Ram Meghwal
Hey,
Any leads?

On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal  wrote:

> Hey All,
>
> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
> execution engine.
> Our data are stored in json format so for serialization and
> deserialization purpose we are planning to use lazy serde
> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>
> My table definition is
>
> CREATE EXTERNAL TABLE IF NOT EXISTS 
> daily_active_users_summary_json_partition_dt_paths_v1
> (uid string, city string, user string, songcount string, songid_list
> array  ) PARTITIONED BY ( dt string)
>
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>
>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>
>  LOCATION 's3:///users/daily_active_
> users_summary_json_partition_dt';
>
>
> and data look like this---
>
> {"uid":"xx","listening_user_flag":"non_listening","platform":"android","model":"micromax
> a110q","aquisition_channel":"organic","state":"delhi","app_
> version":"3.2:","country":"IN","city":"new delhi","new_listening_user_
> flag":"non_listening","manufacturer":"Micromax","
> login_mode":"loggedout","new_user_flag":"returning","digital_channel":"Not
> Source"}
>
>
> Note: I have pasted here one record in table.
>
>
> Now, When I do query
>
> select * from daily_active_users_summary_json_partition_dt_paths_v1 limit
> 5;
>
>
> the first field of table takes the complete record and rest of field are
> showing to be NULL.
>
> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>
> then I can see the above query works fine and able to serialize data
> perfectly fine. We want to user the lazy serde because our data contains
> non-utf-8 character and the later serde does not support non-utf-8
> character serialization/deserialization.
>
>
> Can you please help me solve this, we mostly want to use lazy serde only
> as we have already experimented with other serde's none of them is working
> for us Is there any configuration which enable
> serialization/deserialization while using lazy Serde.
>
> Or is there any other serde which can fine process non-utf-8 character in
> hive-2 and tez.
>
> Thank you
>
>
> Best Regards,
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com
>
>


-- 
Dana Ram Meghwal
Software Engineer
dana...@saavn.com