Finally I could get it work. The issue resolves once I remove the arrays within position structure. So that is the limitation of the serde. I changed 'industries' to string and 'jobfunctions' to Map<string,string> I can query the table just fine now. Here is the complete DDL for reference:
create external table linkedin_Jobsearch ( jobs STRUCT< values : ARRAY<STRUCT< company : STRUCT< id : STRING, name : STRING>, postingDate : STRUCT< year : STRING, day : STRING, month : STRING>, descriptionSnippet : STRING, expirationDate : STRUCT< year : STRING, day : STRING, month : STRING>, position : STRUCT< jobFunctions : MAP<STRING,STRING>, ------these were arrays of structure in my previous attempts industries : STRING, title : STRING, jobType : STRUCT< code : STRING, name : STRING>, experienceLevel : STRUCT< code : STRING, name : STRING>>, id : STRING, customerJobCode : STRING, skillsAndExperience : STRING, salary : STRING, jobPoster : STRUCT< id : STRING, firstName : STRING, lastName : STRING, headline : STRING>, referralBonus : STRING, locationDescription : STRING>>> ) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/sunita/tables/jobs'; Thanks Stephen for sharing your thoughts. It helped. Also if someone /Stephen could help me display this information in a useful manner, that would be great. Right now all the values show up as arrays. Here is what I mean: For a query like this: hive> select jobs.values.company.name, jobs.values.position.title, jobs.values.locationdescription from linkedin_jobsearch; This is the output: ["CyberCoders","CyberCoders","CyberCoders","Management Science Associates","Google","Google","CyberCoders","CyberCoders","HP","Sigmaways","Global Data Consultancy","Global Data Consultancy","CyberCoders","CyberCoders","CyberCoders","VMware","CD IT Recruitment","CD IT Recruitment","Digital Reasoning Systems","AOL"] ["Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics","Data Architect","Systems Engineer, Site Reliability Engineering","Systems Engineer, Site Reliability Engineering","NoSQL Engineer - MongoDB for big data, web crawling - RELO OFFER","NoSQL Engineer - MongoDB for big data, web crawling - RELO OFFER","Hadoop Database Administrator Medicare","Hadoop / Big Data Consultant","Lead Hadoop developer","Head of Big Data - Hadoop","Hadoop Engineer - Hadoop, Operations, Linux Admin, Java, Storage","Sr. Hadoop Administrator - Hadoop, MapReduce, HDFS","Sr. Hadoop Administrator - Hadoop, MapReduce, HDFS","Software Engineer - Big Data","Hadoop Team Lead Consultant - Global Leader in Big Data solutions","Hadoop Administrator Consultant - Global Leader in Big Data solutions","Java Developer","Sr.Software Engineer-Big Data-Hadoop"] ["Pittsburgh, PA","Pittsburgh, PA","Harrisburg, PA","Pittsburgh, PA (Shadyside area near Bakery Square)","Pittsburgh, PA, USA","Pittsburgh, PA","Cleveland, OH","Akron, OH","Herndon, VA","Cupertino, CA","London, United Kingdom","London, United Kingdom","Mountain View, CA","san jose, CA","Santa Clara, CA","Palo Alto, CA","Home based - Live anywhere in the UK or Benelux","Home based - Live anywhere in the UK or Benelux","Herndon, VA","Dulles, VA"] Time taken: 8.518 seconds All company names come into an array, all position titles into another array and all locationdescription into yet another array. I cannot map 1 value to the other. The below query gives a decent output where individual columns can be somewhat mapped: hive> select jobs.values[0].company.name, jobs.values[0].position.title, jobs.values[0].locationdescription from linkedin_jobsearch; CyberCoders Software Engineer-Hadoop, HDFS, HBase, Pig- Vertica Analytics Pittsburgh, PA Time taken: 8.543 seconds But if I want to get the whole list this does not work. I have tried setting Input and output formats and setting serde properties to map to columns, but the output is the same. I haven't tried LATERAL VIEW json_tuple as yet, I found it cryptic and I hope there is something simpler. I can think of writing a UDF which loops for the length of the array and I can access values array incrementally. But the catch here is, what if one of the records has any of their fields missing. Is there a better solution? regards Sunita On Wed, Jun 19, 2013 at 3:29 PM, Sunita Arvind <sunitarv...@gmail.com>wrote: > Thanks Stephen, > Let me explore options. I will let you all know once I am successful. > > regards > Sunita > > > On Wed, Jun 19, 2013 at 3:08 PM, Stephen Sprague <sprag...@gmail.com>wrote: > >> try_parsed_json is not trivial imho :) >> >> start with the very, very basic, for example, { "jobs" : "foo" }. Get >> that to work first. :) When that works add a level of nesting and see >> what happens. Keep building on it until you either break it (and then you >> know that last thing you added broke it and can concentrate on that) or >> you'll have worked out all the bugs and your final example will work. >> Nothing fancy here except old school trial and error. >> >> An alternative I keep bringing up when native semantics don't go one's >> way is the transform() function. use python, perl, ruby or whatever to >> parse the json and go nuts with the rich features of said language. just >> write your output to stdout as a delimited serialization of what you want >> to store and that's it. That would be another way to get your scalars, >> arrays and structs to work. >> >> Don't give up yet though on the JsonSerde! :) Its probably something >> very easy that we just can't see. >> >> >> >> On Wed, Jun 19, 2013 at 10:00 AM, Sunita Arvind <sunitarv...@gmail.com>wrote: >> >>> Thanks for looking into it Ramki. >>> Yes I had tried these options. Here is what I get (renamed the table to >>> have a meaningful name): >>> >>> hive> select jobs.values[1].id from linkedinjobsearch; >>> ......mapreduce task details.... >>> OK >>> NULL >>> Time taken: 9.586 seconds >>> >>> >>> hive> select jobs.values[0].position.title from linkedinjobsearch; >>> Total MapReduce jobs = 1 >>> Launching Job 1 out of 1 >>> >>> OK >>> NULL >>> Time taken: 9.617 seconds >>> >>> >>> I am trying to connect btrace to the process to be able to trace the >>> code but cant get it to respond. Here is what I tried: >>> >>> [sunita@node01 ~]$ hive --debug, recursive=y, port=7000,mainSuspend=y, >>> childSuspend=y >>> ERROR: Cannot load this JVM TI agent twice, check your java command line >>> for duplicate jdwp options. >>> Error occurred during initialization of VM >>> agent library failed to init: jdwp >>> >>> Tried changing the port also. Any idea regarding the debuggers that can >>> be used. I also tried explain query and that does not show any issues >>> either. >>> >>> regards >>> Sunita >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 19, 2013 at 12:11 PM, Ramki Palle <ramki.pa...@gmail.com>wrote: >>> >>>> Can you run some other queries from job1 table and see if any query >>>> returns some data? >>>> >>>> I am guessing your query "select jobs.values.position.title from >>>> jobs1;" may have some issue. May be it should be as >>>> >>>> select jobs.values[0].position.title from jobs1; >>>> >>>> >>>> Regards, >>>> Ramki. >>>> >>>> >>>> On Wed, Jun 19, 2013 at 8:24 AM, Sunita Arvind >>>> <sunitarv...@gmail.com>wrote: >>>> >>>>> Thanks Stephen, >>>>> >>>>> That's just what I tried with the try_parsed table. It is exactly same >>>>> data with lesser nesting in the structure and lesser number of entries. >>>>> Do you mean to say that highly nested jsons can lead to issues? What >>>>> are typical solution to such issues? Write UDFs in hive or parse the JSON >>>>> into a delimited file? >>>>> I have heard of custom serdes also. Not sure if UDFs and custom serdes >>>>> are one and the same. >>>>> >>>>> regards >>>>> Sunita >>>>> >>>>> >>>>> On Wed, Jun 19, 2013 at 10:38 AM, Stephen Sprague >>>>> <sprag...@gmail.com>wrote: >>>>> >>>>>> I think you might have to start small here instead of going for the >>>>>> home run on the first swing. when all else fails start with a trivial >>>>>> json >>>>>> object and then build up from there and see what additional step breaks >>>>>> it. that way you know if the trivial example fails is something >>>>>> fundamental and not the complexity of your json object that's throwing >>>>>> things off. >>>>>> >>>>>> >>>>>> On Wed, Jun 19, 2013 at 4:34 AM, Sunita Arvind <sunitarv...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> Thanks for sharing your experience Richa. >>>>>>> I do have timestamps but in the format of year : INT, day : INT, >>>>>>> month : INT. >>>>>>> As per your suggestion, I changed them all to string, but still get >>>>>>> null as the output. >>>>>>> >>>>>>> regards >>>>>>> Sunita >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 19, 2013 at 2:17 AM, Richa Sharma < >>>>>>> mailtorichasha...@gmail.com> wrote: >>>>>>> >>>>>>>> Do you have any timestamp fields in the table that might contain >>>>>>>> null value ? >>>>>>>> >>>>>>>> I faced a similar situation sometime back - changing the data type >>>>>>>> to string made it work. >>>>>>>> >>>>>>>> But I was working on delimited text files. >>>>>>>> Not sure if it applies to JSON .. but its still worth giving a try >>>>>>>> !! >>>>>>>> >>>>>>>> Richa >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 19, 2013 at 7:28 AM, Sunita Arvind < >>>>>>>> sunitarv...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Having the a column name same as the table name, is a problem due >>>>>>>>> to which I was not able to reference jobs.values.id from jobs. >>>>>>>>> Changing the table name to jobs1 resolved the semantic error. >>>>>>>>> However, the query still returns null >>>>>>>>> >>>>>>>>> hive> select jobs.values.position.title from jobs1; >>>>>>>>> Total MapReduce jobs = 1 >>>>>>>>> Launching Job 1 out of 1 >>>>>>>>> Number of reduce tasks is set to 0 since there's no reduce operator >>>>>>>>> Starting Job = job_201306080116_0036, Tracking URL = >>>>>>>>> http://node01.expressanalytics.net:50030/jobdetails.jsp?jobid=job_201306080116_0036 >>>>>>>>> Kill Command = /usr/lib/hadoop/bin/hadoop job -kill >>>>>>>>> job_201306080116_0036 >>>>>>>>> Hadoop job information for Stage-1: number of mappers: 1; number >>>>>>>>> of reducers: 0 >>>>>>>>> 2013-06-18 18:55:52,381 Stage-1 map = 0%, reduce = 0% >>>>>>>>> 2013-06-18 18:55:56,394 Stage-1 map = 100%, reduce = 0%, >>>>>>>>> Cumulative CPU 0.88 sec >>>>>>>>> 2013-06-18 18:55:57,400 Stage-1 map = 100%, reduce = 0%, >>>>>>>>> Cumulative CPU 0.88 sec >>>>>>>>> 2013-06-18 18:55:58,407 Stage-1 map = 100%, reduce = 100%, >>>>>>>>> Cumulative CPU 0.88 sec >>>>>>>>> MapReduce Total cumulative CPU time: 880 msec >>>>>>>>> Ended Job = job_201306080116_0036 >>>>>>>>> MapReduce Jobs Launched: >>>>>>>>> Job 0: Map: 1 Cumulative CPU: 0.88 sec HDFS Read: 35374 HDFS >>>>>>>>> Write: 3 SUCCESS >>>>>>>>> Total MapReduce CPU Time Spent: 880 msec >>>>>>>>> OK >>>>>>>>> null >>>>>>>>> Time taken: 9.591 seconds >>>>>>>>> >>>>>>>>> >>>>>>>>> regards >>>>>>>>> Sunita >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 18, 2013 at 9:35 PM, Sunita Arvind < >>>>>>>>> sunitarv...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Ok. >>>>>>>>>> The data files are quite small. Around 35 KB and 1 KB each. >>>>>>>>>> >>>>>>>>>> [sunita@node01 tables]$ hadoop fs -ls /user/sunita/tables/jobs >>>>>>>>>> Found 1 items >>>>>>>>>> -rw-r--r-- 3 sunita hdfs 35172 2013-06-18 18:31 >>>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [sunita@node01 tables]$ hadoop fs -text >>>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json |more >>>>>>>>>> {"jobs": {"_total": 1812, "_count": 20, "_start": 0, "values": >>>>>>>>>> [{"company": {"i >>>>>>>>>> d": 21836, "name": "CyberCoders"}, "postingDate": {"year": 2013, >>>>>>>>>> "day": 10, "mo >>>>>>>>>> nth": 6}, "descriptionSnippet": "Software Engineer-Hadoop, HDFS, >>>>>>>>>> HBase, Pig- Ve >>>>>>>>>> rtica Analytics Senior Hadoop Engineer - Skills Required - >>>>>>>>>> Hadoop, HDFS, HBase, >>>>>>>>>> Pig, SQL, Industrial Software Development, System Integration, >>>>>>>>>> Java, high perf >>>>>>>>>> ormance, multi-threading, VerticaWe are a well known consumer >>>>>>>>>> product developme >>>>>>>>>> nt company and we are looking to add a Hadoop Engineer to our >>>>>>>>>> Engineering team. >>>>>>>>>> You will be working with the latest ", "expirationDate": >>>>>>>>>> {"year": 2013, "day" >>>>>>>>>> >>>>>>>>>> Its a single line, so used 'more' rather than 'head'. But >>>>>>>>>> effectively, the file exists and has the data. >>>>>>>>>> >>>>>>>>>> regards >>>>>>>>>> Sunita >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 18, 2013 at 8:38 PM, Stephen Sprague < >>>>>>>>>> sprag...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> As Nitin alluded to its best to confirm the data is definitely >>>>>>>>>>> in hdfs using hdfs semantics rather than hive for the first step. >>>>>>>>>>> >>>>>>>>>>> 1. how big is it? hadoop fs -ls <your hdfs dir> >>>>>>>>>>> 2. cat a bit of it and see if anything is there. hadoop fs >>>>>>>>>>> -text <your hdfs dir>/<filename> | head -10 >>>>>>>>>>> >>>>>>>>>>> do you see any data from step #2? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 18, 2013 at 3:58 PM, Sunita Arvind < >>>>>>>>>>> sunitarv...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I ran some complex queries. Something to the extent of >>>>>>>>>>>> select jobs from jobs; >>>>>>>>>>>> which triggers map reduce jobs but does not show errors and >>>>>>>>>>>> produces the same output "null". If I try referencing the struct >>>>>>>>>>>> elements, >>>>>>>>>>>> I get error which seems to be the root cause. >>>>>>>>>>>> >>>>>>>>>>>> Attached are the select statement outputs with the >>>>>>>>>>>> corresponding hive logs. >>>>>>>>>>>> >>>>>>>>>>>> I have also attached my usage details of another table - >>>>>>>>>>>> try_parsed which has a subset of the same data which seems to work >>>>>>>>>>>> fine. >>>>>>>>>>>> Also attached is the input file for this table - try_parsed.json >>>>>>>>>>>> Thanks for your help >>>>>>>>>>>> >>>>>>>>>>>> Sunita >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 18, 2013 at 4:35 PM, Nitin Pawar < >>>>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> can you run a little more complex query >>>>>>>>>>>>> >>>>>>>>>>>>> select uniq across columns or do some maths. so we know when >>>>>>>>>>>>> it fires up a mapreduce >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jun 19, 2013 at 1:59 AM, Sunita Arvind < >>>>>>>>>>>>> sunitarv...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for responding Nitin. Yes I am sure that serde is >>>>>>>>>>>>>> working fine and json file is being picked based on all the >>>>>>>>>>>>>> errors that >>>>>>>>>>>>>> showed up till this stage. What sort of error are you >>>>>>>>>>>>>> suspecting. File not >>>>>>>>>>>>>> present or serde not parsing it ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tuesday, June 18, 2013, Nitin Pawar wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> select * from table is as good as hdfs -cat >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> are you sure there is any data in the table? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jun 18, 2013 at 11:54 PM, Sunita Arvind < >>>>>>>>>>>>>>> sunitarv...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am able to parse the input JSON file and load it into >>>>>>>>>>>>>>>> hive. I do not see any errors with create table, so I am >>>>>>>>>>>>>>>> assuming that. But >>>>>>>>>>>>>>>> when I try to read the data, I get null >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> hive> select * from jobs; >>>>>>>>>>>>>>>> OK >>>>>>>>>>>>>>>> null >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have validated the JSON with JSONLint and Notepad++ JSON >>>>>>>>>>>>>>>> plugin and it is a valid JSON. Here is my create table >>>>>>>>>>>>>>>> statement and attached is the json input file. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> create external table jobs ( >>>>>>>>>>>>>>>> jobs STRUCT< >>>>>>>>>>>>>>>> values : ARRAY<STRUCT< >>>>>>>>>>>>>>>> company : STRUCT< >>>>>>>>>>>>>>>> id : STRING, >>>>>>>>>>>>>>>> name : STRING>, >>>>>>>>>>>>>>>> postingDate : STRUCT< >>>>>>>>>>>>>>>> year : INT, >>>>>>>>>>>>>>>> day : INT, >>>>>>>>>>>>>>>> month : INT>, >>>>>>>>>>>>>>>> descriptionSnippet : STRING, >>>>>>>>>>>>>>>> expirationDate : STRUCT< >>>>>>>>>>>>>>>> year : INT, >>>>>>>>>>>>>>>> day : INT, >>>>>>>>>>>>>>>> month : INT>, >>>>>>>>>>>>>>>> position : STRUCT< >>>>>>>>>>>>>>>> title : STRING, >>>>>>>>>>>>>>>> jobFunctions : ARRAY<STRUCT< >>>>>>>>>>>>>>>> code : STRING, >>>>>>>>>>>>>>>> name : STRING>>, >>>>>>>>>>>>>>>> industries : ARRAY<STRUCT< >>>>>>>>>>>>>>>> code : STRING, >>>>>>>>>>>>>>>> id : STRING, >>>>>>>>>>>>>>>> name : STRING>>, >>>>>>>>>>>>>>>> jobType : STRUCT< >>>>>>>>>>>>>>>> code : STRING, >>>>>>>>>>>>>>>> name : STRING>, >>>>>>>>>>>>>>>> experienceLevel : STRUCT< >>>>>>>>>>>>>>>> code : STRING, >>>>>>>>>>>>>>>> name : STRING>>, >>>>>>>>>>>>>>>> id : STRING, >>>>>>>>>>>>>>>> customerJobCode : STRING, >>>>>>>>>>>>>>>> skillsAndExperience : STRING, >>>>>>>>>>>>>>>> salary : STRING, >>>>>>>>>>>>>>>> jobPoster : STRUCT< >>>>>>>>>>>>>>>> id : STRING, >>>>>>>>>>>>>>>> firstName : STRING, >>>>>>>>>>>>>>>> lastName : STRING, >>>>>>>>>>>>>>>> headline : STRING>, >>>>>>>>>>>>>>>> referralBonus : STRING, >>>>>>>>>>>>>>>> locationDescription : STRING>>> >>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' >>>>>>>>>>>>>>>> LOCATION '/user/sunita/tables/jobs'; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The table creation works fine, but when I attempt to query, >>>>>>>>>>>>>>>> I get null as the result. >>>>>>>>>>>>>>>> I tried adding Input/Output formats, Serde Properties, >>>>>>>>>>>>>>>> nothing seems to impact. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am of the opinion that the libraries cannot handle this >>>>>>>>>>>>>>>> level of nesting and I probably will have to write a custom >>>>>>>>>>>>>>>> serde or a >>>>>>>>>>>>>>>> parser myself. Just wanted to seek guidance before I get into >>>>>>>>>>>>>>>> that. >>>>>>>>>>>>>>>> Appreciate your help and guidance. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> regards >>>>>>>>>>>>>>>> Sunita >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >