Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Simon Ball Thu, 21 Apr 2016 13:42:56 -0700

For most hive JSON serdes you are going to want what some people call JSON 
record format. This is essentially a text file with a JSON document per line 
which represents a record, with reasonably consistent structure. You can 
achieve this by ensuring your JSON is not pretty formatted (one doc per line) 
and then just using binary concatenation in the MergeContent processor Bryne 
mentioned.


Simon


On 21 Apr 2016, at 22:38, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:

Also, this blog has a picture of what I described with MergeContent:

https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and

-Bryan

On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:
Hi Igor,

I don't know that much about Hive so I can't really say what format it needs to 
be in for Hive to understand it.

If it needs to be a valid array of JSON documents, in MergeContent change the 
Delimiter Strategy to "Text" which means it will use whatever values you type 
directly into Header, Footer, Demarcator, and then specify [ ] ,  respectively 
as the values.

That will get you something like this where {...} are the incoming documents:

[
{...},
{...},
]

-Bryan


On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov 
<[email protected]<mailto:[email protected]>> wrote:
Hi Brian,

I am aware of this example. But I want to store JSON as it is and create 
external table. Like in this example. 
http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
What I don't know is how to properly merge multiple JSON in one file in order 
for hive to read it properly.

On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I believe this example shows an approach to do it (it includes Hive even though 
the title is Solr/banana):
https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html

The short version is that it extracts several attributes from each tweet using 
EvaluateJsonPath, then uses ReplaceText to replace the FlowFile content with a 
pipe delimited string of those attributes, and then creates a Hive table that 
knows how to handle that delimiter. With this approach you don't need to set 
the header, footer, and demarcator in MergeContent.

create table if not exists tweets_text_partition(
tweet_id bigint,
created_unixtime bigint,
created_time string,
displayname string,
msg string,
fulltext string
)
row format delimited fields terminated by "|"
location "/tmp/tweets_staging";

-Bryan


On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov 
<[email protected]<mailto:[email protected]>> wrote:
Hi guys,

I want to create a following workflow:

1.Fetch tweets using GetTwitter processor.
2.Merge tweets in a bigger file using MergeContent process.
3.Store merged files in HDFS.
4. On the hadoop/hive side I want to create an external table based on these 
tweets.

There are examples how to do this tbut what I am missing is how to configure 
MergeContent processor: what to set as header,footer and demarcator. And what 
to use on on hive side as separator so thatit will split merged tweets in rows. 
Hope I described myself clearly.

Thanks in advance.

Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Reply via email to