Hi James, Thank you for pointing the issue out! :-) I wanted to point out another alternative solution to Kite I observed, to hear if you had any insight on this approach too if you don't mind.
When I saw a presentation of Ni-Fi and Parquet being used in a guest project, although not many details implementation wise were discussed, it was mentioned using also Apache Spark (apparently only) leaving a port from Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it <https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html> (github <https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>) on Jan 2016 that seems to head towards that direction. The configuration looked as follows according to the tutorial's image: https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png The group presentation also used Spark, but I am not sure if they used the same port approach, this is all I have: *PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD <String,String>] *PackageToParquetRunner *-> FileProcessorClass -> RDD Filter -> RDDflatMap -> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create Data Frame (SQL Context) -> Write Parquet (DataFrame). When you say, then running periodic jobs to build Parquet data sets. Would such Spark setup be the case as period jobs? I am minimally acquainted on how Spark goes about MapReduce using RDDs, but I am not certain to what extent it would support the NiFi pipeline for such purpose (not to mention, on the way it appears, seems to leave a hole in NiFi diagram as a port, which makes it unable to monitor for data provenance). --- Do you think these details and Kite details would be worth mentioning as a comment on the JIRA issue you pointed out? Thanks! On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote: > Carlos, > > Welcome to NiFi! I believe the Kite dataset is currently the most direct, > built-in solution for writing Parquet files from NiFi. > > I'm not an expert on Parquet, but I understand columnar formats like > Parquet and ORC are not easily written to in the incremental, streaming > fashion that NiFi excels at (I hope writing this will prompt expert > correction). Other alternatives typically involve NiFi writing to more > stream-friendly data stores or formats directly, then running periodic jobs > to build Parquet data sets. Hive, Drill, and similar tools can do this. > > You are certainly not alone in wanting better Parquet support, there is at > least one JIRA ticket for it as well: > > Add processors for Google Cloud Storage Fetch/Put/Delete > https://issues.apache.org/jira/browse/NIFI-2725 > > You might want to chime in with some details of your use case, or create a > new ticket if that's not a fit for you. > > Thanks, > > James > > On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote: > >> Hi, >> >> Our group has recently started trying to prototype a setup of >> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any >> documentation other than a scarce discussion on using Kite as a workaround >> to integrate NiFi and Parquet. >> >> Are there any future plans for this integration from NiFi or anyone would >> be able to give me some insight in which scenario this workaround would >> (not) be worthwhile and alternatives? >> >> The most recent discussion >> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html> >> I found in this list dates from May 11, 2016. I also saw some interest in >> doing this on Stackoverflow here >> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>, >> and here >> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format> >> . >> >> Thanks, >> >> -- >> Carlos Paradis >> http://carlosparadis.com <http://carlosandrade.co> >> > > -- Carlos Paradis http://carlosparadis.com <http://carlosandrade.co>
