I've been doing a lot of this recently into both hive and spark. One thing that will make life a lot easier is to use the JSON record file format, this is essentially just a JSON document per line of a text file, which means you can use nifi's MergeContent processor to handle batching into hdfs. Avro also makes a lot of sense and can be generated directly out of nifi.
Simon - Simon Elliston Ball Product Solutions Architect +44 7930 424111<tel:+44%207930%20424111> Hortonworks - Powering the future of data On 2 Mar 2016, at 11:33, Mike Harding <[email protected]<mailto:[email protected]>> wrote: Hi All, I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data. What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables. Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example? Any examples or suggestions much appreciated, Thanks, M
