Thanks guys. I think it will work.
One thing: merged file comes out without extension. How do I add extension
to a merged file?

On Thu, Apr 21, 2016 at 4:42 PM, Simon Ball <sb...@hortonworks.com> wrote:

> For most hive JSON serdes you are going to want what some people call JSON
> record format. This is essentially a text file with a JSON document per
> line which represents a record, with reasonably consistent structure. You
> can achieve this by ensuring your JSON is not pretty formatted (one doc per
> line) and then just using binary concatenation in the MergeContent
> processor Bryne mentioned.
>
> Simon
>
>
> On 21 Apr 2016, at 22:38, Bryan Bende <bbe...@gmail.com> wrote:
>
> Also, this blog has a picture of what I described with MergeContent:
>
> https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and
>
> -Bryan
>
> On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Hi Igor,
>>
>> I don't know that much about Hive so I can't really say what format it
>> needs to be in for Hive to understand it.
>>
>> If it needs to be a valid array of JSON documents, in MergeContent change
>> the Delimiter Strategy to "Text" which means it will use whatever values
>> you type directly into Header, Footer, Demarcator, and then specify [ ] ,
>>  respectively as the values.
>>
>> That will get you something like this where {...} are the incoming
>> documents:
>>
>> [
>> {...},
>> {...},
>> ]
>>
>> -Bryan
>>
>>
>> On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com>
>> wrote:
>>
>>> Hi Brian,
>>>
>>> I am aware of this example. But I want to store JSON as it is and create
>>> external table. Like in this example.
>>> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
>>> What I don't know is how to properly merge multiple JSON in one file in
>>> order for hive to read it properly.
>>>
>>> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I believe this example shows an approach to do it (it includes Hive
>>>> even though the title is Solr/banana):
>>>>
>>>> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
>>>>
>>>> The short version is that it extracts several attributes from each
>>>> tweet using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile
>>>> content with a pipe delimited string of those attributes, and then creates
>>>> a Hive table that knows how to handle that delimiter. With this approach
>>>> you don't need to set the header, footer, and demarcator in MergeContent.
>>>>
>>>> create table if not exists tweets_text_partition(
>>>> tweet_id bigint,
>>>> created_unixtime bigint,
>>>> created_time string,
>>>> displayname string,
>>>> msg string,
>>>> fulltext string
>>>> )
>>>> row format delimited fields terminated by "|"
>>>> location "/tmp/tweets_staging";
>>>>
>>>> -Bryan
>>>>
>>>>
>>>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <igork.ine...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I want to create a following workflow:
>>>>>
>>>>> 1.Fetch tweets using GetTwitter processor.
>>>>> 2.Merge tweets in a bigger file using MergeContent process.
>>>>> 3.Store merged files in HDFS.
>>>>> 4. On the hadoop/hive side I want to create an external table based on
>>>>> these tweets.
>>>>>
>>>>> There are examples how to do this tbut what I am missing is how to
>>>>> configure MergeContent processor: what to set as header,footer and
>>>>> demarcator. And what to use on on hive side as separator so thatit will
>>>>> split merged tweets in rows. Hope I described myself clearly.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to