Hi Conrad, Thanks for the heads up, I will investigate Apache Drill. I also forgot to mention that I have downstream requirements about which tools the data modellers are comfortable using - they want to use Hive and Spark as the data access engines primarily so the data needs to be persisted in HDFS in a way that it can be easily accessed by these services.
But your right - there is multiple ways of doing this and I'm hoping NiFi would help scope/simplify the pipeline design. Cheers, M On 2 March 2016 at 10:38, Conrad Crampton <[email protected]> wrote: > Hi, > I am doing something similar, but having wrestled with Hive data > population (not from NiFi) and its performance I am currently looking at > Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar > size to yours). To this end, I have chosen Avro as my ‘persistence’ format > and using a number of processors to get from raw data though mapping > attributes to json to avro (via schemas) and ultimately storing in HDFS. > Querying this with Drill is a breeze then as the schema is already > specified within the data which Drill understands. The schema can also be > extended without impacting existing data too. > HTH – I’m sure there are a ton of other ways to skin this particular cat > though, > Conrad > > From: Mike Harding <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, 2 March 2016 at 10:33 > To: "[email protected]" <[email protected]> > Subject: Nifi JSON event storage in HDFS > > Hi All, > > I currently have a small hadoop cluster running with HDFS and Hive. My > ultimate goal is to leverage NiFi's ingestion and flow capabilities to > store real-time external JSON formatted event data. > > What I am unclear about is what the best strategy/design is for storing > FlowFile data (i.e. JSON events in my case) within HDFS that can then be > accessed and analysed in Hive tables. > > Is much of the design in terms of storage handled in the NiFi flow or do I > need to set something up external of NiFi to ensure I can query each JSON > formatted event as a record in a Hive log table for example? > > Any examples or suggestions much appreciated, > > Thanks, > M > > > ***This email originated outside SecureData*** > > Click here <https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to > report this email as spam. > > > SecureData, combating cyber threats > > ------------------------------ > > The information contained in this message or any of its attachments may be > privileged and confidential and intended for the exclusive use of the > intended recipient. If you are not the intended recipient any disclosure, > reproduction, distribution or other dissemination or use of this > communications is strictly prohibited. The views expressed in this email > are those of the individual and not necessarily of SecureData Europe Ltd. > Any prices quoted are only valid if followed up by a formal written quote. > > SecureData Europe Limited. Registered in England & Wales 04365896. > Registered Address: SecureData House, Hermitage Court, Hermitage Lane, > Maidstone, Kent, ME16 9NT >
