Hi Carlos,
I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in this
case) the workflow would look like this:
- do stuff with NiFi
- convert flowfiles to Avro
- (optional: merge Avro files)
- PutHDFS into a temp folder
- periodically run Spark on that temp folder to convert to Parquet.
I believe you can work out the first four points by yourself. The last point
would just be a Python file that looks like this:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("Python Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate())
(spark.read.format('com.databricks.spark.avro').load('/tmp/path/dataset.avro')
.write.format('parquet')
.mode('append')
.save('/path/to/outfile'))
You can then periodically invoke this file with spark-submit filename.py
For optimal usage, I’d explore the options of having the temporary path folder
partitioned by hour (or day) and then invoke the above script once per
temporary folder.
That said, a few remarks:
- this is a rather complicated flow for something so simple. Kite-dataset would
work better;
- however if you need more complicated processing, you have all the options to
do so
- as Parquet is columnar storage, having little files is useless. So when
you’re merging them, make sure you have enough data (>~ 50MB and to several
tens of GB’s) in the final file;
- The above code is trivially portable to Scala if you prefer, as I’m using
Python as a mere DSL on top of Spark (no serializations outside the JVM).
Cheers,
Giovanni
From: Carlos Paradis [mailto:[email protected]]
Sent: Tuesday, February 14, 2017 11:12 PM
To: [email protected]
Subject: Re: Integration between Apache NiFi and Parquet or Workaround?
Hi James,
Thank you for pointing the issue out! :-) I wanted to point out another
alternative solution to Kite I observed, to hear if you had any insight on this
approach too if you don't mind.
When I saw a presentation of Ni-Fi and Parquet being used in a guest project,
although not many details implementation wise were discussed, it was mentioned
using also Apache Spark (apparently only) leaving a port from Ni-Fi to read in
the data. Someone in Hortonworks posted a tutorial on
it<https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
(github<https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
on Jan 2016 that seems to head towards that direction.
The configuration looked as follows according to the tutorial's image:
https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png
The group presentation also used Spark, but I am not sure if they used the same
port approach, this is all I have:
PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>]
PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap ->
RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL
Context) -> Write Parquet (DataFrame).
When you say,
then running periodic jobs to build Parquet data sets.
Would such Spark setup be the case as period jobs? I am minimally acquainted on
how Spark goes about MapReduce using RDDs, but I am not certain to what extent
it would support the NiFi pipeline for such purpose (not to mention, on the way
it appears, seems to leave a hole in NiFi diagram as a port, which makes it
unable to monitor for data provenance).
---
Do you think these details and Kite details would be worth mentioning as a
comment on the JIRA issue you pointed out?
Thanks!
On Tue, Feb 14, 2017 at 11:46 AM, James Wing
<[email protected]<mailto:[email protected]>> wrote:
Carlos,
Welcome to NiFi! I believe the Kite dataset is currently the most direct,
built-in solution for writing Parquet files from NiFi.
I'm not an expert on Parquet, but I understand columnar formats like Parquet
and ORC are not easily written to in the incremental, streaming fashion that
NiFi excels at (I hope writing this will prompt expert correction). Other
alternatives typically involve NiFi writing to more stream-friendly data stores
or formats directly, then running periodic jobs to build Parquet data sets.
Hive, Drill, and similar tools can do this.
You are certainly not alone in wanting better Parquet support, there is at
least one JIRA ticket for it as well:
Add processors for Google Cloud Storage Fetch/Put/Delete
https://issues.apache.org/jira/browse/NIFI-2725
You might want to chime in with some details of your use case, or create a new
ticket if that's not a fit for you.
Thanks,
James
On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis
<[email protected]<mailto:[email protected]>> wrote:
Hi,
Our group has recently started trying to prototype a setup of
Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
documentation other than a scarce discussion on using Kite as a workaround to
integrate NiFi and Parquet.
Are there any future plans for this integration from NiFi or anyone would be
able to give me some insight in which scenario this workaround would (not) be
worthwhile and alternatives?
The most recent
discussion<http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
I found in this list dates from May 11, 2016. I also saw some interest in
doing this on Stackoverflow
here<http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
and
here<http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>.
Thanks,
--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>
--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>