I'll caveat this by saying that up until 10 mins ago I had never looked at Parquet, so I could be completely wrong, but...
The Parquet API seems heavily geared towards HDFS. For example, take the AvroParquetWriter: https://github.com/Parquet/parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroParquetWriter.java You have to give it a Hadoop Path object to write the data to, so this wouldn't really work in the middle of a NiFi flow if you wanted to have a processor like ConvertXyzToParquet, because the processor needs to write the output to an OutputStream which is a location in NiFi's internal repositories, not HDFS. It could make sense at the end of a flow when writing to HDFS, so you could probably implement a custom processor similar to PutHDFS that used the Parquet libraries to write the data to HDFS as Parquet (assuming you merged together a bunch of data before this) . This is probably what the Kite processors are already doing, but not sure. -Bryan On Tue, Feb 14, 2017 at 5:12 PM, Carlos Paradis <[email protected]> wrote: > Hi James, > > Thank you for pointing the issue out! :-) I wanted to point out another > alternative solution to Kite I observed, to hear if you had any insight on > this approach too if you don't mind. > > When I saw a presentation of Ni-Fi and Parquet being used in a guest > project, although not many details implementation wise were discussed, it > was mentioned using also Apache Spark (apparently only) leaving a port from > Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it > (github) on Jan 2016 that seems to head towards that direction. > > The configuration looked as follows according to the tutorial's image: > > https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png > > > The group presentation also used Spark, but I am not sure if they used the > same port approach, this is all I have: > > > PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>] > > PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap -> > RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL > Context) -> Write Parquet (DataFrame). > > When you say, > >> then running periodic jobs to build Parquet data sets. > > > Would such Spark setup be the case as period jobs? I am minimally acquainted > on how Spark goes about MapReduce using RDDs, but I am not certain to what > extent it would support the NiFi pipeline for such purpose (not to mention, > on the way it appears, seems to leave a hole in NiFi diagram as a port, > which makes it unable to monitor for data provenance). > > --- > > Do you think these details and Kite details would be worth mentioning as a > comment on the JIRA issue you pointed out? > > Thanks! > > > On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote: >> >> Carlos, >> >> Welcome to NiFi! I believe the Kite dataset is currently the most direct, >> built-in solution for writing Parquet files from NiFi. >> >> I'm not an expert on Parquet, but I understand columnar formats like >> Parquet and ORC are not easily written to in the incremental, streaming >> fashion that NiFi excels at (I hope writing this will prompt expert >> correction). Other alternatives typically involve NiFi writing to more >> stream-friendly data stores or formats directly, then running periodic jobs >> to build Parquet data sets. Hive, Drill, and similar tools can do this. >> >> You are certainly not alone in wanting better Parquet support, there is at >> least one JIRA ticket for it as well: >> >> Add processors for Google Cloud Storage Fetch/Put/Delete >> https://issues.apache.org/jira/browse/NIFI-2725 >> >> You might want to chime in with some details of your use case, or create a >> new ticket if that's not a fit for you. >> >> Thanks, >> >> James >> >> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote: >>> >>> Hi, >>> >>> Our group has recently started trying to prototype a setup of >>> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any >>> documentation other than a scarce discussion on using Kite as a workaround >>> to integrate NiFi and Parquet. >>> >>> Are there any future plans for this integration from NiFi or anyone would >>> be able to give me some insight in which scenario this workaround would >>> (not) be worthwhile and alternatives? >>> >>> The most recent discussion I found in this list dates from May 11, 2016. >>> I also saw some interest in doing this on Stackoverflow here, and here. >>> >>> Thanks, >>> >>> -- >>> Carlos Paradis >>> http://carlosparadis.com >> >> > > > > -- > Carlos Paradis > http://carlosparadis.com
