I am exploring to use kite processor to store data into Hadoop. I hope this lets me change storage engine form hdfs to hive to hbase later. Since my Hadoop distribution is MapR, I didn't have full success yet. Sumo
Sent from my iPhone > On Mar 2, 2016, at 2:54 AM, Mike Harding <[email protected]> wrote: > > Hi Conrad, > > Thanks for the heads up, I will investigate Apache Drill. I also forgot to > mention that I have downstream requirements about which tools the data > modellers are comfortable using - they want to use Hive and Spark as the data > access engines primarily so the data needs to be persisted in HDFS in a way > that it can be easily accessed by these services. > > But your right - there is multiple ways of doing this and I'm hoping NiFi > would help scope/simplify the pipeline design. > > Cheers, > M > >> On 2 March 2016 at 10:38, Conrad Crampton <[email protected]> >> wrote: >> Hi, >> I am doing something similar, but having wrestled with Hive data population >> (not from NiFi) and its performance I am currently looking at Apache Drill >> as my SQL abstraction layer over my Hadoop cluster (similar size to yours). >> To this end, I have chosen Avro as my ‘persistence’ format and using a >> number of processors to get from raw data though mapping attributes to json >> to avro (via schemas) and ultimately storing in HDFS. Querying this with >> Drill is a breeze then as the schema is already specified within the data >> which Drill understands. The schema can also be extended without impacting >> existing data too. >> HTH – I’m sure there are a ton of other ways to skin this particular cat >> though, >> Conrad >> >> From: Mike Harding <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Wednesday, 2 March 2016 at 10:33 >> To: "[email protected]" <[email protected]> >> Subject: Nifi JSON event storage in HDFS >> >> Hi All, >> >> I currently have a small hadoop cluster running with HDFS and Hive. My >> ultimate goal is to leverage NiFi's ingestion and flow capabilities to store >> real-time external JSON formatted event data. >> >> What I am unclear about is what the best strategy/design is for storing >> FlowFile data (i.e. JSON events in my case) within HDFS that can then be >> accessed and analysed in Hive tables. >> >> Is much of the design in terms of storage handled in the NiFi flow or do I >> need to set something up external of NiFi to ensure I can query each JSON >> formatted event as a record in a Hive log table for example? >> >> Any examples or suggestions much appreciated, >> >> Thanks, >> M >> >> >> ***This email originated outside SecureData*** >> >> Click here to report this email as spam. >> >> >> >> SecureData, combating cyber threats >> >> The information contained in this message or any of its attachments may be >> privileged and confidential and intended for the exclusive use of the >> intended recipient. If you are not the intended recipient any disclosure, >> reproduction, distribution or other dissemination or use of this >> communications is strictly prohibited. The views expressed in this email are >> those of the individual and not necessarily of SecureData Europe Ltd. Any >> prices quoted are only valid if followed up by a formal written quote. >> >> SecureData Europe Limited. Registered in England & Wales 04365896. >> Registered Address: SecureData House, Hermitage Court, Hermitage Lane, >> Maidstone, Kent, ME16 9NT >> >
