RE: Integration between Apache NiFi and Parquet or Workaround?

Giovanni Lanzani Wed, 15 Feb 2017 09:02:59 -0800

Hi Carlos,

I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in this 
case) the workflow would look like this:


- do stuff with NiFi
- convert flowfiles to Avro
- (optional: merge Avro files)
- PutHDFS into a temp folder
- periodically run Spark on that temp folder to convert to Parquet.

I believe you can work out the first four points by yourself. The last point 
would just be a Python file that looks like this:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
                     .appName("Python Spark SQL basic example")
         .config("spark.some.config.option", "some-value")
         .getOrCreate())

(spark.read.format('com.databricks.spark.avro').load('/tmp/path/dataset.avro')
          .write.format('parquet')
          .mode('append')
          .save('/path/to/outfile'))

You can then periodically invoke this file with spark-submit filename.py

For optimal usage, I’d explore the options of having the temporary path folder 
partitioned by hour (or day) and then invoke the above script once per 
temporary folder.

That said, a few remarks:
- this is a rather complicated flow for something so simple. Kite-dataset would 
work better;
- however if you need more complicated processing, you have all the options to 
do so
- as Parquet is columnar storage, having little files is useless. So when 
you’re merging them, make sure you have enough data (>~ 50MB and to several 
tens of GB’s) in the final file;
- The above code is trivially portable to Scala if you prefer, as I’m using 
Python as a mere DSL on top of Spark (no serializations outside the JVM).

Cheers,

Giovanni

From: Carlos Paradis [mailto:[email protected]]
Sent: Tuesday, February 14, 2017 11:12 PM
To: [email protected]
Subject: Re: Integration between Apache NiFi and Parquet or Workaround?

Hi James,

Thank you for pointing the issue out! :-) I wanted to point out another 
alternative solution to Kite I observed, to hear if you had any insight on this 
approach too if you don't mind.

When I saw a presentation of Ni-Fi and Parquet being used in a guest project, 
although not many details implementation wise were discussed, it was mentioned 
using also Apache Spark (apparently only) leaving a port from Ni-Fi to read in 
the data. Someone in Hortonworks posted a tutorial on 
it<https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
 
(github<https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
 on Jan 2016 that seems to head towards that direction.

The configuration looked as follows according to the tutorial's image:

https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png

The group presentation also used Spark, but I am not sure if they used the same 
port approach, this is all I have:

PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>]

PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap -> 
RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL 
Context) -> Write Parquet (DataFrame).

When you say,

then running periodic jobs to build Parquet data sets.

Would such Spark setup be the case as period jobs? I am minimally acquainted on 
how Spark goes about MapReduce using RDDs, but I am not certain to what extent 
it would support the NiFi pipeline for such purpose (not to mention, on the way 
it appears, seems to leave a hole in NiFi diagram as a port, which makes it 
unable to monitor for data provenance).

---

Do you think these details and Kite details would be worth mentioning as a 
comment on the JIRA issue you pointed out?

Thanks!


On Tue, Feb 14, 2017 at 11:46 AM, James Wing 
<[email protected]<mailto:[email protected]>> wrote:
Carlos,
Welcome to NiFi!  I believe the Kite dataset is currently the most direct, 
built-in solution for writing Parquet files from NiFi.

I'm not an expert on Parquet, but I understand columnar formats like Parquet 
and ORC are not easily written to in the incremental, streaming fashion that 
NiFi excels at (I hope writing this will prompt expert correction).  Other 
alternatives typically involve NiFi writing to more stream-friendly data stores 
or formats directly, then running periodic jobs to build Parquet data sets.  
Hive, Drill, and similar tools can do this.

You are certainly not alone in wanting better Parquet support, there is at 
least one JIRA ticket for it as well:

Add processors for Google Cloud Storage Fetch/Put/Delete
https://issues.apache.org/jira/browse/NIFI-2725
You might want to chime in with some details of your use case, or create a new 
ticket if that's not a fit for you.

Thanks,
James

On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Our group has recently started trying to prototype a setup of 
Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any 
documentation other than a scarce discussion on using Kite as a workaround to 
integrate NiFi and Parquet.

Are there any future plans for this integration from NiFi or anyone would be 
able to give me some insight in which scenario this workaround would (not) be 
worthwhile and alternatives?

The most recent 
discussion<http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
 I found in this list dates from May 11, 2016. I also saw some interest in 
doing this on Stackoverflow 
here<http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
 and 
here<http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>.

Thanks,

--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>




--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>

RE: Integration between Apache NiFi and Parquet or Workaround?

Reply via email to