Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that are pretty-printed. (expanding over multiple-lines) The entire json document has to be on a single line.
After I reformated some of the source files, now I am getting the expected output. On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <[email protected]>wrote: > I also tried: > > doc = LOAD '/json-pcr/pcr-000001.json' USING > com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]); > flat = FOREACH doc GENERATE (chararray)json#'a' AS first, (long)json#'b' > AS second ; > DUMP flat; > > but I got no output either. > > Input(s): > Successfully read 0 records (35863 bytes) from: > "/json-pcr/pcr-000001.json" > > Output(s): > Successfully stored 0 records in: > "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210" > > > > On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <[email protected] > > wrote: > >> I got Pig and Hive working ona single-node and I am able to run some >> script/queries over regular text files (access log files); with a record >> per line. >> >> Now, I want to process some JSON files. >> >> As mentioned before, it seems that ElephantBird would be a would be a >> good solution to read JSON files. >> >> I uploaded 5 files to HDFS. Each file only contain a single JSON >> document. The documents are NOT in a single line, but rather contain >> pretty-printed JSON expanding over multiple lines. >> >> I'm trying something simple, extracting two (primitive) attributes at the >> top of the document: >> { >> a : "some value", >> ... >> b : 133, >> ... >> } >> >> So, lets start with a LOAD of a single file (single JSON document): >> >> REGISTER 'bunch of JAR files from elephant-bird and its dependencies'; >> doc = LOAD '/json-pcr/pcr-000001.json' using >> com.twitter.elephantbird.pig.load.JsonLoader(); >> flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b' AS >> second ; >> DUMP flat; >> >> Apparently the job runs without problem, but I get no output. The output >> I get includes this message: >> >> Input(s): >> Successfully read 0 records (35863 bytes) from: >> "/json-pcr/pcr-000001.json" >> >> I was expecting to get >> >> ( "some value", 133 ) >> >> Any idea on what I am doing wrong? >> >> >> >> >> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <[email protected] >> > wrote: >> >>> I think you have a misconception of HBase. >>> >>> You don't need to actually have mutable data for it to be effective. >>> The key is that you need to have access to specific records and work a >>> very small subset of the data and not the complete data set. >>> >>> >>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <[email protected]> >>> wrote: >>> >>> Hi Mike, >>> >>> Yes, I also have thought about HBase or Cassandra but my data is pretty >>> much a snapshot, it does not require updates. Most of my aggregations will >>> also need to be computed once and won't change over time with the exception >>> of some aggregation that is based on the last N days of data. Should I >>> still consider HBase ? I think that probably it will be good for the >>> aggregated data. >>> >>> I have no idea what are sequence files, but I will take a look. My raw >>> data is stored in the cloud, not in my Hadoop cluster. >>> >>> I'll keep looking at Pig with ElephantBird. >>> Thanks, >>> >>> -Jorge >>> >>> >>> >>> >>> >>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel < >>> [email protected]> wrote: >>> >>>> Hi.. >>>> >>>> Have you thought about HBase? >>>> >>>> I would suggest that if you're using Hive or Pig, to look at taking >>>> these files and putting the JSON records in to a sequence file. >>>> Or set of sequence files.... (Then look at HBase to help index them...) >>>> 200KB is small. >>>> >>>> That would be the same for either pig/hive. >>>> >>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty >>>> nice. And yes you get each record as a row, however you can always flatten >>>> them as needed. >>>> >>>> Hive? >>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or >>>> Edward Capriolo could give you a better answer. >>>> Going from memory, I don't know that there is a good SerDe that would >>>> write JSON, just read it. (Hive) >>>> >>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated >>>> and biased. >>>> >>>> I think you're on the right track or at least train of thought. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> >>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]> >>>> wrote: >>>> >>>> Hello, >>>> I'm new to Hadoop. >>>> I have a large quantity of JSON documents with a structure similar >>>> to what is shown below. >>>> >>>> { >>>> g : "some-group-identifier", >>>> sg: "some-subgroup-identifier", >>>> j : "some-job-identifier", >>>> page : 23, >>>> ... // other fields omitted >>>> important-data : [ >>>> { >>>> f1 : "abc", >>>> f2 : "a", >>>> f3 : "/" >>>> ... >>>> }, >>>> ... >>>> { >>>> f1 : "xyz", >>>> f2 : "q", >>>> f3 : "/", >>>> ... >>>> }, >>>> ], >>>> ... // other fields omitted >>>> other-important-data : [ >>>> { >>>> x1 : "ford", >>>> x2 : "green", >>>> x3 : 35 >>>> map : { >>>> "free-field" : "value", >>>> "other-free-field" : value2" >>>> } >>>> }, >>>> ... >>>> { >>>> x1 : "vw", >>>> x2 : "red", >>>> x3 : 54, >>>> ... >>>> }, >>>> ] >>>> }, >>>> } >>>> >>>> >>>> Each file contains a single JSON document (gzip compressed, and roughly >>>> about 200KB uncompressed of pretty-printed json text per document) >>>> >>>> I am interested in analyzing only the "important-data" array and the >>>> "other-important-data" array. >>>> My source data would ideally be easier to analyze if it looked like a >>>> couple of tables with a fixed set of columns. Only the column "map" would >>>> be a complex column, all others would be primitives. >>>> >>>> ( g, sg, j, page, f1, f2, f3 ) >>>> >>>> ( g, sg, j, page, x1, x2, x3, map ) >>>> >>>> So, for each JSON document, I would like to "create" several rows, but I >>>> would like to avoid the intermediate step of persisting -and duplicating- >>>> the "flattened" data. >>>> >>>> In order to avoid persisting the data flattened, I thought I had to >>>> write my own map-reduce in Java code, but discovered that others have had >>>> the same problem of using JSON as the source and there are somewhat >>>> "standard" solutions. >>>> >>>> By reading about the SerDe approach for Hive I get the impression that >>>> each JSON document is transformed into a single "row" of the table with >>>> some columns being an array, a map of other nested structures. >>>> a) Is there a way to break each JSON document into several "rows" for a >>>> Hive external table? >>>> b) It seems there are too many JSON SerDe libraries! Is there any of >>>> them considered the de-facto standard? >>>> >>>> The Pig approach seems also promising using Elephant Bird Do anybody >>>> has pointers to more user documentation on this project? Or is browsing >>>> through the examples in GitHub my only source? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >
