For most hive JSON serdes you are going to want what some people call JSON record format. This is essentially a text file with a JSON document per line which represents a record, with reasonably consistent structure. You can achieve this by ensuring your JSON is not pretty formatted (one doc per line) and then just using binary concatenation in the MergeContent processor Bryne mentioned.
Simon On 21 Apr 2016, at 22:38, Bryan Bende <[email protected]<mailto:[email protected]>> wrote: Also, this blog has a picture of what I described with MergeContent: https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and -Bryan On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <[email protected]<mailto:[email protected]>> wrote: Hi Igor, I don't know that much about Hive so I can't really say what format it needs to be in for Hive to understand it. If it needs to be a valid array of JSON documents, in MergeContent change the Delimiter Strategy to "Text" which means it will use whatever values you type directly into Header, Footer, Demarcator, and then specify [ ] , respectively as the values. That will get you something like this where {...} are the incoming documents: [ {...}, {...}, ] -Bryan On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <[email protected]<mailto:[email protected]>> wrote: Hi Brian, I am aware of this example. But I want to store JSON as it is and create external table. Like in this example. http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/ What I don't know is how to properly merge multiple JSON in one file in order for hive to read it properly. On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <[email protected]<mailto:[email protected]>> wrote: Hello, I believe this example shows an approach to do it (it includes Hive even though the title is Solr/banana): https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html The short version is that it extracts several attributes from each tweet using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile content with a pipe delimited string of those attributes, and then creates a Hive table that knows how to handle that delimiter. With this approach you don't need to set the header, footer, and demarcator in MergeContent. create table if not exists tweets_text_partition( tweet_id bigint, created_unixtime bigint, created_time string, displayname string, msg string, fulltext string ) row format delimited fields terminated by "|" location "/tmp/tweets_staging"; -Bryan On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <[email protected]<mailto:[email protected]>> wrote: Hi guys, I want to create a following workflow: 1.Fetch tweets using GetTwitter processor. 2.Merge tweets in a bigger file using MergeContent process. 3.Store merged files in HDFS. 4. On the hadoop/hive side I want to create an external table based on these tweets. There are examples how to do this tbut what I am missing is how to configure MergeContent processor: what to set as header,footer and demarcator. And what to use on on hive side as separator so thatit will split merged tweets in rows. Hope I described myself clearly. Thanks in advance.
