Thank you, both Bryan and Giovanni for giving me so much insight on this matter.
I see why you would strongly prefer Kite over this, now that I landed on one tutorial <http://blog.cloudera.com/blog/2014/12/how-to-ingest-data-quickly-using-the-kite-cli/> on kite-dataset and their documentation page <http://kitesdk.org/docs/1.1.0>. (thanks for pointing the name out). I also noticed NiFi-238 <https://issues.apache.org/jira/browse/NIFI-238?focusedCommentId=14350688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350688> (Pull Request <https://github.com/apache/nifi/pull/24#discussion_r24779170>) has incorporated Kite into Nifi back in 2015 and NiFi-1193 <https://issues.apache.org/jira/browse/NIFI-1193> to Hive in 2016 and made available 3 processors, but I am confused since they are no longer available in the documentation <https://nifi.apache.org/docs/nifi-docs/>, rather I only see StoreInKiteDataset <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.kite.StoreInKiteDataset/index.html>, which appear to be a new version of what was called 'KiteStorageProcessor' in the Github, but I don't see the other two. My original goal was to have one HDFS storage dedicated to raw data alone, and a second HDFS dedicated to storing pre-processed data and analysis. If I were to do this with Kite and NiFi, the way I am currently seeing this being done is: --------- *Raw Data HDFS:* - - Apache Nifi - A set of GetFile and GetHTTP processors to acquire the data from multiple sources we have. - A PutHDFS to store the raw data in HDFS. *Pre-Processed & Analysis HDFS: * - - ApacheNifi - A set of GetHDFS to get data from *Raw Data HDFS*. - A set of ExecuteScript to convert XML files to JSON or CSV. - A set of ConvertCSVToAvro and ConvertJSONToAvro as the Kite processor requires AVRO format. - StoreInKiteDataset to store all data in either Avro or Parquet format. - - Apache Spark - Perform batch jobs to pre-process the data into data analysis sets to be exported elsewhere (dashboard, machine learning, etc). --------- However, a few things I am still confused about: (1) is this the best way to go about storing the raw data? (2) Would ExecuteScript allow for map reduce? Or would this become a bottleneck? (3) I originally considered using the Apache Spark Stream module for mini-batches integrated with ni-fi to at least pre-process the data as it arrives, but I am a bit unclear on how to go about this now. Would create a port through Ni-Fi be the way to go? (That is the only way I saw it being done on tutorials). Thank you, On Wed, Feb 15, 2017 at 7:02 AM, Giovanni Lanzani < [email protected]> wrote: > Hi Carlos, > > > > I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in > this case) the workflow would look like this: > > > > - do stuff with NiFi > > - convert flowfiles to Avro > > - (optional: merge Avro files) > > - PutHDFS into a temp folder > > - periodically run Spark on that temp folder to convert to Parquet. > > > > I believe you can work out the first four points by yourself. The last > point would just be a Python file that looks like this: > > > > from pyspark.sql import SparkSession > > > > spark = (SparkSession.builder > > .appName("Python Spark SQL basic example") > > .config("spark.some.config.option", "some-value") > > .getOrCreate()) > > > > (spark.read.format('com.databricks.spark.avro').load(' > /tmp/path/dataset.avro') > > .write.format('parquet') > > .mode('append') > > .save('/path/to/outfile')) > > > > You can then periodically invoke this file with spark-submit filename.py > > > > For optimal usage, I’d explore the options of having the temporary path > folder partitioned by hour (or day) and then invoke the above script once > per temporary folder. > > > > That said, a few remarks: > > - this is a rather complicated flow for something so simple. Kite-dataset > would work better; > > - however if you need more complicated processing, you have all the > options to do so > > - as Parquet is columnar storage, having little files is useless. So when > you’re merging them, make sure you have enough data (>~ 50MB and to several > tens of GB’s) in the final file; > > - The above code is trivially portable to Scala if you prefer, as I’m > using Python as a mere DSL on top of Spark (no serializations outside the > JVM). > > > > Cheers, > > > > Giovanni > > > > *From:* Carlos Paradis [mailto:[email protected]] > *Sent:* Tuesday, February 14, 2017 11:12 PM > *To:* [email protected] > *Subject:* Re: Integration between Apache NiFi and Parquet or Workaround? > > > > Hi James, > > > > Thank you for pointing the issue out! :-) I wanted to point out another > alternative solution to Kite I observed, to hear if you had any insight on > this approach too if you don't mind. > > > > When I saw a presentation of Ni-Fi and Parquet being used in a guest > project, although not many details implementation wise were discussed, it > was mentioned using also Apache Spark (apparently only) leaving a port from > Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it > <https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html> > (github > <https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>) > on Jan 2016 that seems to head towards that direction. > > > > The configuration looked as follows according to the tutorial's image: > > > > https://community.hortonworks.com/storage/attachments/1669- > screen-shot-2016-01-31-at-21029-pm.png > > > > The group presentation also used Spark, but I am not sure if they used the > same port approach, this is all I have: > > > > *PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD > <String,String>] > > > > *PackageToParquetRunner *-> FileProcessorClass -> RDD Filter -> > RDDflatMap -> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create > Data Frame (SQL Context) -> Write Parquet (DataFrame). > > > > When you say, > > > > then running periodic jobs to build Parquet data sets. > > > > Would such Spark setup be the case as period jobs? I am minimally > acquainted on how Spark goes about MapReduce using RDDs, but I am not > certain to what extent it would support the NiFi pipeline for such purpose > (not to mention, on the way it appears, seems to leave a hole in NiFi > diagram as a port, which makes it unable to monitor for data provenance). > > > > --- > > > > Do you think these details and Kite details would be worth mentioning as a > comment on the JIRA issue you pointed out? > > > Thanks! > > > > > > On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote: > > Carlos, > > Welcome to NiFi! I believe the Kite dataset is currently the most direct, > built-in solution for writing Parquet files from NiFi. > > I'm not an expert on Parquet, but I understand columnar formats like > Parquet and ORC are not easily written to in the incremental, streaming > fashion that NiFi excels at (I hope writing this will prompt expert > correction). Other alternatives typically involve NiFi writing to more > stream-friendly data stores or formats directly, then running periodic jobs > to build Parquet data sets. Hive, Drill, and similar tools can do this. > > You are certainly not alone in wanting better Parquet support, there is at > least one JIRA ticket for it as well: > > Add processors for Google Cloud Storage Fetch/Put/Delete > https://issues.apache.org/jira/browse/NIFI-2725 > > You might want to chime in with some details of your use case, or create a > new ticket if that's not a fit for you. > > > > Thanks, > > James > > > > On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote: > > Hi, > > > > Our group has recently started trying to prototype a setup of > Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any > documentation other than a scarce discussion on using Kite as a workaround > to integrate NiFi and Parquet. > > > > Are there any future plans for this integration from NiFi or anyone would > be able to give me some insight in which scenario this workaround would > (not) be worthwhile and alternatives? > > > > The most recent discussion > <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html> > I found in this list dates from May 11, 2016. I also saw some interest in > doing this on Stackoverflow here > <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>, > and here > <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format> > . > > > > Thanks, > > > > -- > > Carlos Paradis > > http://carlosparadis.com <http://carlosandrade.co> > > > > > > > > -- > > Carlos Paradis > > http://carlosparadis.com <http://carlosandrade.co> > -- Carlos Paradis http://carlosparadis.com <http://carlosandrade.co>
