Re: Hive select shows null after successful data load

Sunita Arvind Wed, 19 Jun 2013 19:56:13 -0700

Finally I could get it work. The issue resolves once I remove the arrays
within position structure. So that is the limitation of the serde. I
changed 'industries' to string and 'jobfunctions' to Map<string,string> I
can query the table just fine now. Here is the complete DDL for reference:


create external table linkedin_Jobsearch (
jobs STRUCT<
values : ARRAY<STRUCT<
company : STRUCT<
id : STRING,
name : STRING>,
postingDate : STRUCT<
year : STRING,
day : STRING,
month : STRING>,
descriptionSnippet : STRING,
expirationDate : STRUCT<
year : STRING,
day : STRING,
month : STRING>,
position : STRUCT<
jobFunctions : MAP<STRING,STRING>,    ------these were arrays of structure
in my previous attempts
industries : STRING,
title : STRING,
jobType : STRUCT<
code : STRING,
name : STRING>,
experienceLevel : STRUCT<
code : STRING,
name : STRING>>,
id : STRING,
customerJobCode : STRING,
skillsAndExperience : STRING,
salary : STRING,
jobPoster : STRUCT<
id : STRING,
firstName : STRING,
lastName : STRING,
headline : STRING>,
referralBonus : STRING,
locationDescription : STRING>>>
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/sunita/tables/jobs';

Thanks Stephen for sharing your thoughts. It helped.

Also if someone /Stephen could help me display this information in a useful
manner, that would be great. Right now all the values show up as arrays.
Here is what I mean:
For a query like this:
hive> select jobs.values.company.name, jobs.values.position.title,
jobs.values.locationdescription from linkedin_jobsearch;

This is the output:

["CyberCoders","CyberCoders","CyberCoders","Management Science
Associates","Google","Google","CyberCoders","CyberCoders","HP","Sigmaways","Global
Data Consultancy","Global Data
Consultancy","CyberCoders","CyberCoders","CyberCoders","VMware","CD IT
Recruitment","CD IT Recruitment","Digital Reasoning Systems","AOL"]
["Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Software
Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Software
Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Data
Architect","Systems Engineer, Site Reliability Engineering","Systems
Engineer, Site Reliability Engineering","NoSQL Engineer - MongoDB for big
data, web crawling - RELO OFFER","NoSQL Engineer - MongoDB for big data,
web crawling - RELO OFFER","Hadoop Database Administrator Medicare","Hadoop
/ Big Data Consultant","Lead Hadoop developer","Head of Big Data -
Hadoop","Hadoop Engineer - Hadoop, Operations, Linux Admin, Java,
Storage","Sr. Hadoop Administrator - Hadoop, MapReduce, HDFS","Sr. Hadoop
Administrator - Hadoop, MapReduce, HDFS","Software Engineer - Big
Data","Hadoop Team Lead Consultant - Global Leader in Big Data
solutions","Hadoop Administrator Consultant - Global Leader in Big Data
solutions","Java Developer","Sr.Software Engineer-Big Data-Hadoop"]
["Pittsburgh, PA","Pittsburgh, PA","Harrisburg, PA","Pittsburgh, PA
(Shadyside area near Bakery Square)","Pittsburgh, PA, USA","Pittsburgh,
PA","Cleveland, OH","Akron, OH","Herndon, VA","Cupertino, CA","London,
United Kingdom","London, United Kingdom","Mountain View, CA","san jose,
CA","Santa Clara, CA","Palo Alto, CA","Home based - Live anywhere in the UK
or Benelux","Home based - Live anywhere in the UK or Benelux","Herndon,
VA","Dulles, VA"]
Time taken: 8.518 seconds

All company names come into an array, all position titles into another
array and all locationdescription into yet another array. I cannot map 1
value to the other.

The below query gives a decent output where individual columns can be
somewhat mapped:

hive> select jobs.values[0].company.name, jobs.values[0].position.title,
jobs.values[0].locationdescription from linkedin_jobsearch;

CyberCoders     Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica
Analytics  Pittsburgh, PA
Time taken: 8.543 seconds

But if I want to get the whole list this does not work. I have tried
setting Input and output formats and setting serde properties to map to
columns, but the output is the same. I haven't tried LATERAL VIEW
json_tuple as yet, I found it cryptic and I hope there is something simpler.

I can think of writing a UDF which loops for the length of the array and I
can access values array incrementally. But the catch here is, what if one
of the records has any of their fields missing.

Is there a better solution?

regards
Sunita




On Wed, Jun 19, 2013 at 3:29 PM, Sunita Arvind <sunitarv...@gmail.com>wrote:

> Thanks Stephen,
> Let me explore options. I will let you all know once I am successful.
>
> regards
> Sunita
>
>
> On Wed, Jun 19, 2013 at 3:08 PM, Stephen Sprague <sprag...@gmail.com>wrote:
>
>> try_parsed_json is not trivial imho :)
>>
>> start with the very, very basic, for example,  { "jobs" : "foo" }.  Get
>> that to work first. :)   When that works add a level of nesting and see
>> what happens.  Keep building on it until you either break it (and then you
>> know that last thing you added broke it and can concentrate on that) or
>> you'll have worked out all the bugs and your final example will work.
>> Nothing fancy here except old school trial and error.
>>
>> An alternative I keep bringing up when native semantics don't go one's
>> way is the transform() function.  use python, perl,  ruby or whatever to
>> parse the json and go nuts with the rich features of said language.  just
>> write your output to stdout as a delimited serialization of what you want
>> to store and that's it.  That would be another way to get your scalars,
>> arrays and structs to work.
>>
>> Don't give up yet though on the JsonSerde! :)  Its probably something
>> very easy that we just can't see.
>>
>>
>>
>> On Wed, Jun 19, 2013 at 10:00 AM, Sunita Arvind <sunitarv...@gmail.com>wrote:
>>
>>> Thanks for looking into it Ramki.
>>> Yes I had tried these options. Here is what I get (renamed the table to
>>> have a meaningful name):
>>>
>>> hive> select jobs.values[1].id from linkedinjobsearch;
>>> ......mapreduce task details....
>>> OK
>>> NULL
>>> Time taken: 9.586 seconds
>>>
>>>
>>> hive> select jobs.values[0].position.title from linkedinjobsearch;
>>>  Total MapReduce jobs = 1
>>> Launching Job 1 out of 1
>>>
>>> OK
>>> NULL
>>> Time taken: 9.617 seconds
>>>
>>>
>>> I am trying to connect btrace to the process to be able to trace the
>>> code but cant get it to respond. Here is what I tried:
>>>
>>> [sunita@node01 ~]$ hive --debug, recursive=y, port=7000,mainSuspend=y,
>>> childSuspend=y
>>> ERROR: Cannot load this JVM TI agent twice, check your java command line
>>> for duplicate jdwp options.
>>> Error occurred during initialization of VM
>>> agent library failed to init: jdwp
>>>
>>> Tried changing the port also. Any idea regarding the debuggers that can
>>> be used. I also tried explain query and that does not show any issues
>>> either.
>>>
>>> regards
>>> Sunita
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 12:11 PM, Ramki Palle <ramki.pa...@gmail.com>wrote:
>>>
>>>> Can you run some other queries from job1 table and see if any query
>>>> returns some data?
>>>>
>>>> I am guessing your query "select jobs.values.position.title from
>>>> jobs1;"  may have some issue. May be it should be as
>>>>
>>>> select jobs.values[0].position.title from jobs1;
>>>>
>>>>
>>>> Regards,
>>>> Ramki.
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 8:24 AM, Sunita Arvind 
>>>> <sunitarv...@gmail.com>wrote:
>>>>
>>>>> Thanks Stephen,
>>>>>
>>>>> That's just what I tried with the try_parsed table. It is exactly same
>>>>> data with lesser nesting in the structure and lesser number of entries.
>>>>> Do you mean to say that highly nested jsons can lead to issues? What
>>>>> are typical solution to such issues? Write UDFs in hive or parse the JSON
>>>>> into a delimited file?
>>>>> I have heard of custom serdes also. Not sure if UDFs and custom serdes
>>>>> are one and the same.
>>>>>
>>>>> regards
>>>>> Sunita
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 10:38 AM, Stephen Sprague 
>>>>> <sprag...@gmail.com>wrote:
>>>>>
>>>>>> I think you might have to start small here instead of going for the
>>>>>> home run on the first swing.  when all else fails start with a trivial 
>>>>>> json
>>>>>> object and then build up from there and see what additional step breaks
>>>>>> it.   that way you know if the trivial example fails is something
>>>>>> fundamental and not the complexity of your json object that's throwing
>>>>>> things off.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 4:34 AM, Sunita Arvind <sunitarv...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Thanks for sharing your experience Richa.
>>>>>>> I do have timestamps but in the format of year : INT, day : INT,
>>>>>>> month : INT.
>>>>>>> As per your suggestion, I changed them all to string, but still get
>>>>>>> null as the output.
>>>>>>>
>>>>>>> regards
>>>>>>> Sunita
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 2:17 AM, Richa Sharma <
>>>>>>> mailtorichasha...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Do you have any timestamp fields in the table that might contain
>>>>>>>> null value ?
>>>>>>>>
>>>>>>>> I faced a similar situation sometime back - changing the data type
>>>>>>>> to string made it work.
>>>>>>>>
>>>>>>>> But I was working on delimited text files.
>>>>>>>> Not sure if it applies to JSON .. but its still worth giving a try
>>>>>>>> !!
>>>>>>>>
>>>>>>>> Richa
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jun 19, 2013 at 7:28 AM, Sunita Arvind <
>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Having the a column name same as the table name, is a problem due
>>>>>>>>> to which I was not able to reference jobs.values.id from jobs.
>>>>>>>>> Changing the table name to jobs1 resolved the semantic error.
>>>>>>>>> However, the query still returns null
>>>>>>>>>
>>>>>>>>> hive> select jobs.values.position.title from jobs1;
>>>>>>>>> Total MapReduce jobs = 1
>>>>>>>>> Launching Job 1 out of 1
>>>>>>>>> Number of reduce tasks is set to 0 since there's no reduce operator
>>>>>>>>> Starting Job = job_201306080116_0036, Tracking URL =
>>>>>>>>> http://node01.expressanalytics.net:50030/jobdetails.jsp?jobid=job_201306080116_0036
>>>>>>>>> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill
>>>>>>>>> job_201306080116_0036
>>>>>>>>> Hadoop job information for Stage-1: number of mappers: 1; number
>>>>>>>>> of reducers: 0
>>>>>>>>> 2013-06-18 18:55:52,381 Stage-1 map = 0%,  reduce = 0%
>>>>>>>>> 2013-06-18 18:55:56,394 Stage-1 map = 100%,  reduce = 0%,
>>>>>>>>> Cumulative CPU 0.88 sec
>>>>>>>>> 2013-06-18 18:55:57,400 Stage-1 map = 100%,  reduce = 0%,
>>>>>>>>> Cumulative CPU 0.88 sec
>>>>>>>>> 2013-06-18 18:55:58,407 Stage-1 map = 100%,  reduce = 100%,
>>>>>>>>> Cumulative CPU 0.88 sec
>>>>>>>>> MapReduce Total cumulative CPU time: 880 msec
>>>>>>>>> Ended Job = job_201306080116_0036
>>>>>>>>> MapReduce Jobs Launched:
>>>>>>>>> Job 0: Map: 1   Cumulative CPU: 0.88 sec   HDFS Read: 35374 HDFS
>>>>>>>>> Write: 3 SUCCESS
>>>>>>>>> Total MapReduce CPU Time Spent: 880 msec
>>>>>>>>> OK
>>>>>>>>> null
>>>>>>>>> Time taken: 9.591 seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> regards
>>>>>>>>> Sunita
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 18, 2013 at 9:35 PM, Sunita Arvind <
>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok.
>>>>>>>>>> The data files are quite small. Around 35 KB and 1 KB each.
>>>>>>>>>>
>>>>>>>>>> [sunita@node01 tables]$  hadoop fs -ls /user/sunita/tables/jobs
>>>>>>>>>> Found 1 items
>>>>>>>>>> -rw-r--r--   3 sunita hdfs      35172 2013-06-18 18:31
>>>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [sunita@node01 tables]$ hadoop fs -text
>>>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json |more
>>>>>>>>>> {"jobs": {"_total": 1812, "_count": 20, "_start": 0, "values":
>>>>>>>>>> [{"company": {"i
>>>>>>>>>> d": 21836, "name": "CyberCoders"}, "postingDate": {"year": 2013,
>>>>>>>>>> "day": 10, "mo
>>>>>>>>>> nth": 6}, "descriptionSnippet": "Software Engineer-Hadoop, HDFS,
>>>>>>>>>> HBase, Pig- Ve
>>>>>>>>>> rtica Analytics Senior Hadoop Engineer - Skills Required -
>>>>>>>>>> Hadoop, HDFS, HBase,
>>>>>>>>>>  Pig, SQL, Industrial Software Development, System Integration,
>>>>>>>>>> Java, high perf
>>>>>>>>>> ormance, multi-threading, VerticaWe are a well known consumer
>>>>>>>>>> product developme
>>>>>>>>>> nt company and we are looking to add a Hadoop Engineer to our
>>>>>>>>>> Engineering team.
>>>>>>>>>>   You will be working with the latest ", "expirationDate":
>>>>>>>>>> {"year": 2013, "day"
>>>>>>>>>>
>>>>>>>>>> Its a single line, so used 'more' rather than 'head'. But
>>>>>>>>>> effectively, the file exists and has the data.
>>>>>>>>>>
>>>>>>>>>> regards
>>>>>>>>>> Sunita
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 18, 2013 at 8:38 PM, Stephen Sprague <
>>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> As Nitin alluded to its best to confirm the data is definitely
>>>>>>>>>>> in hdfs using hdfs semantics rather than hive for the first step.
>>>>>>>>>>>
>>>>>>>>>>> 1. how big is it?   hadoop fs -ls <your hdfs dir>
>>>>>>>>>>> 2. cat a bit of it and see if anything is there.   hadoop fs
>>>>>>>>>>> -text <your hdfs dir>/<filename> | head -10
>>>>>>>>>>>
>>>>>>>>>>> do you see any data from step #2?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 18, 2013 at 3:58 PM, Sunita Arvind <
>>>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I ran some complex queries. Something to the extent of
>>>>>>>>>>>>                     select jobs from jobs;
>>>>>>>>>>>>  which triggers map reduce jobs but does not show errors and
>>>>>>>>>>>> produces the same output "null". If I try referencing the struct 
>>>>>>>>>>>> elements,
>>>>>>>>>>>> I get error which seems to be the root cause.
>>>>>>>>>>>>
>>>>>>>>>>>> Attached are the select statement outputs with the
>>>>>>>>>>>> corresponding hive logs.
>>>>>>>>>>>>
>>>>>>>>>>>> I have also attached my usage details of another table -
>>>>>>>>>>>> try_parsed which has a subset of the same data which seems to work 
>>>>>>>>>>>> fine.
>>>>>>>>>>>> Also attached is the input file for this table - try_parsed.json
>>>>>>>>>>>> Thanks for your help
>>>>>>>>>>>>
>>>>>>>>>>>> Sunita
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 18, 2013 at 4:35 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> can you run a little more complex query
>>>>>>>>>>>>>
>>>>>>>>>>>>> select uniq across columns or do some maths. so we know when
>>>>>>>>>>>>> it fires up a mapreduce
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jun 19, 2013 at 1:59 AM, Sunita Arvind <
>>>>>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for responding Nitin. Yes I am sure that serde is
>>>>>>>>>>>>>> working fine and json file is being picked based on all the 
>>>>>>>>>>>>>> errors that
>>>>>>>>>>>>>> showed up till this stage. What sort of error are you 
>>>>>>>>>>>>>> suspecting. File not
>>>>>>>>>>>>>> present or serde not parsing it ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tuesday, June 18, 2013, Nitin Pawar wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> select * from table is as good as hdfs -cat
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> are you sure there is any data in the table?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jun 18, 2013 at 11:54 PM, Sunita Arvind <
>>>>>>>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am able to parse the input JSON file and load it into
>>>>>>>>>>>>>>>> hive. I do not see any errors with create table, so I am 
>>>>>>>>>>>>>>>> assuming that. But
>>>>>>>>>>>>>>>> when I try to read the data, I get null
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> hive> select * from jobs;
>>>>>>>>>>>>>>>> OK
>>>>>>>>>>>>>>>> null
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have validated the JSON with JSONLint and Notepad++ JSON
>>>>>>>>>>>>>>>> plugin and it is a valid JSON. Here is my create table
>>>>>>>>>>>>>>>> statement and attached is the json input file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> create external table jobs (
>>>>>>>>>>>>>>>> jobs STRUCT<
>>>>>>>>>>>>>>>> values : ARRAY<STRUCT<
>>>>>>>>>>>>>>>> company : STRUCT<
>>>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>>>> name : STRING>,
>>>>>>>>>>>>>>>> postingDate : STRUCT<
>>>>>>>>>>>>>>>> year : INT,
>>>>>>>>>>>>>>>> day : INT,
>>>>>>>>>>>>>>>> month : INT>,
>>>>>>>>>>>>>>>> descriptionSnippet : STRING,
>>>>>>>>>>>>>>>> expirationDate : STRUCT<
>>>>>>>>>>>>>>>> year : INT,
>>>>>>>>>>>>>>>> day : INT,
>>>>>>>>>>>>>>>> month : INT>,
>>>>>>>>>>>>>>>> position : STRUCT<
>>>>>>>>>>>>>>>> title : STRING,
>>>>>>>>>>>>>>>> jobFunctions : ARRAY<STRUCT<
>>>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>>>> industries : ARRAY<STRUCT<
>>>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>>>> jobType : STRUCT<
>>>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>>>> name : STRING>,
>>>>>>>>>>>>>>>> experienceLevel : STRUCT<
>>>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>>>> customerJobCode : STRING,
>>>>>>>>>>>>>>>> skillsAndExperience : STRING,
>>>>>>>>>>>>>>>> salary : STRING,
>>>>>>>>>>>>>>>> jobPoster : STRUCT<
>>>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>>>> firstName : STRING,
>>>>>>>>>>>>>>>> lastName : STRING,
>>>>>>>>>>>>>>>> headline : STRING>,
>>>>>>>>>>>>>>>> referralBonus : STRING,
>>>>>>>>>>>>>>>> locationDescription : STRING>>>
>>>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>>>>>>>>>>>>>>>> LOCATION '/user/sunita/tables/jobs';
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The table creation works fine, but when I attempt to query,
>>>>>>>>>>>>>>>> I get null as the result.
>>>>>>>>>>>>>>>> I tried adding Input/Output formats, Serde Properties,
>>>>>>>>>>>>>>>> nothing seems to impact.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am of the opinion that the libraries cannot handle this
>>>>>>>>>>>>>>>> level of nesting and I probably will have to write a custom 
>>>>>>>>>>>>>>>> serde or a
>>>>>>>>>>>>>>>> parser myself. Just wanted to seek guidance before I get into 
>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>> Appreciate your help and guidance.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> regards
>>>>>>>>>>>>>>>> Sunita
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive select shows null after successful data load

Reply via email to