Re: Integration between Apache NiFi and Parquet or Workaround?

Carlos Paradis Wed, 15 Feb 2017 14:28:02 -0800

Thank you, both Bryan and Giovanni for giving me so much insight on this
matter.


I see why you would strongly prefer Kite over this, now that I landed on one
tutorial
<http://blog.cloudera.com/blog/2014/12/how-to-ingest-data-quickly-using-the-kite-cli/>
on kite-dataset and their documentation page <http://kitesdk.org/docs/1.1.0>.
(thanks for pointing the name out).

I also noticed NiFi-238
<https://issues.apache.org/jira/browse/NIFI-238?focusedCommentId=14350688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350688>
 (Pull Request <https://github.com/apache/nifi/pull/24#discussion_r24779170>)
has incorporated Kite into Nifi back in 2015 and NiFi-1193
<https://issues.apache.org/jira/browse/NIFI-1193> to Hive in 2016 and made
available 3 processors, but I am confused since they are no longer
available in the documentation <https://nifi.apache.org/docs/nifi-docs/>,
rather I only see StoreInKiteDataset
<https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.kite.StoreInKiteDataset/index.html>,
which appear to be a new version of what was called 'KiteStorageProcessor'
in the Github, but I don't see the other two.


My original goal was to have one HDFS storage dedicated to raw data alone,
and a second HDFS dedicated to storing pre-processed data and analysis. If
I were to do this with Kite and NiFi, the way I am currently seeing this
being done is:

---------
*Raw Data HDFS:*

   - - Apache Nifi
      - A set of GetFile and GetHTTP processors to acquire the data from
      multiple sources we have.
      - A PutHDFS to store the raw data in HDFS.

*Pre-Processed & Analysis HDFS: *


   - - ApacheNifi
      - A set of GetHDFS to get data from *Raw Data HDFS*.
      - A set of ExecuteScript to convert XML files to JSON or CSV.
      - A set of ConvertCSVToAvro and ConvertJSONToAvro as the Kite
      processor requires AVRO format.
      - StoreInKiteDataset to store all data in either Avro or Parquet
      format.
      - - Apache Spark
      - Perform batch jobs to pre-process the data into data analysis sets
      to be exported elsewhere (dashboard, machine learning, etc).

---------

However, a few things I am still confused about: (1) is this the best way
to go about storing the raw data? (2) Would ExecuteScript allow for map
reduce? Or would this become a bottleneck? (3) I originally considered
using the Apache Spark Stream module for mini-batches integrated with
ni-fi to at least pre-process the data as it arrives, but I am a bit
unclear on how to go about this now. Would create a port through Ni-Fi be
the way to go? (That is the only way I saw it being done on tutorials).

Thank you,




On Wed, Feb 15, 2017 at 7:02 AM, Giovanni Lanzani <
[email protected]> wrote:

> Hi Carlos,
>
>
>
> I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in
> this case) the workflow would look like this:
>
>
>
> - do stuff with NiFi
>
> - convert flowfiles to Avro
>
> - (optional: merge Avro files)
>
> - PutHDFS into a temp folder
>
> - periodically run Spark on that temp folder to convert to Parquet.
>
>
>
> I believe you can work out the first four points by yourself. The last
> point would just be a Python file that looks like this:
>
>
>
> from pyspark.sql import SparkSession
>
>
>
> spark = (SparkSession.builder
>
>                      .appName("Python Spark SQL basic example")
>
>          .config("spark.some.config.option", "some-value")
>
>          .getOrCreate())
>
>
>
> (spark.read.format('com.databricks.spark.avro').load('
> /tmp/path/dataset.avro')
>
>           .write.format('parquet')
>
>           .mode('append')
>
>           .save('/path/to/outfile'))
>
>
>
> You can then periodically invoke this file with spark-submit filename.py
>
>
>
> For optimal usage, I’d explore the options of having the temporary path
> folder partitioned by hour (or day) and then invoke the above script once
> per temporary folder.
>
>
>
> That said, a few remarks:
>
> - this is a rather complicated flow for something so simple. Kite-dataset
> would work better;
>
> - however if you need more complicated processing, you have all the
> options to do so
>
> - as Parquet is columnar storage, having little files is useless. So when
> you’re merging them, make sure you have enough data (>~ 50MB and to several
> tens of GB’s) in the final file;
>
> - The above code is trivially portable to Scala if you prefer, as I’m
> using Python as a mere DSL on top of Spark (no serializations outside the
> JVM).
>
>
>
> Cheers,
>
>
>
> Giovanni
>
>
>
> *From:* Carlos Paradis [mailto:[email protected]]
> *Sent:* Tuesday, February 14, 2017 11:12 PM
> *To:* [email protected]
> *Subject:* Re: Integration between Apache NiFi and Parquet or Workaround?
>
>
>
> Hi James,
>
>
>
> Thank you for pointing the issue out! :-) I wanted to point out another
> alternative solution to Kite I observed, to hear if you had any insight on
> this approach too if you don't mind.
>
>
>
> When I saw a presentation of Ni-Fi and Parquet being used in a guest
> project, although not many details implementation wise were discussed, it
> was mentioned using also Apache Spark (apparently only) leaving a port from
> Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
> <https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
>  (github
> <https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
> on Jan 2016 that seems to head towards that direction.
>
>
>
> The configuration looked as follows according to the tutorial's image:
>
>
>
> https://community.hortonworks.com/storage/attachments/1669-
> screen-shot-2016-01-31-at-21029-pm.png
>
>
>
> The group presentation also used Spark, but I am not sure if they used the
> same port approach, this is all I have:
>
>
>
> *PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD
> <String,String>]
>
>
>
> *PackageToParquetRunner *-> FileProcessorClass -> RDD Filter ->
> RDDflatMap -> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create
> Data Frame (SQL Context) -> Write Parquet (DataFrame).
>
>
>
> When you say,
>
>
>
> then running periodic jobs to build Parquet data sets.
>
>
>
> Would such Spark setup be the case as period jobs? I am minimally
> acquainted on how Spark goes about MapReduce using RDDs, but I am not
> certain to what extent it would support the NiFi pipeline for such purpose
> (not to mention, on the way it appears, seems to leave a hole in NiFi
> diagram as a port, which makes it unable to monitor for data provenance).
>
>
>
> ---
>
>
>
> Do you think these details and Kite details would be worth mentioning as a
> comment on the JIRA issue you pointed out?
>
>
> Thanks!
>
>
>
>
>
> On Tue, Feb 14, 2017 at 11:46 AM, James Wing <[email protected]> wrote:
>
> Carlos,
>
> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
> built-in solution for writing Parquet files from NiFi.
>
> I'm not an expert on Parquet, but I understand columnar formats like
> Parquet and ORC are not easily written to in the incremental, streaming
> fashion that NiFi excels at (I hope writing this will prompt expert
> correction).  Other alternatives typically involve NiFi writing to more
> stream-friendly data stores or formats directly, then running periodic jobs
> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>
> You are certainly not alone in wanting better Parquet support, there is at
> least one JIRA ticket for it as well:
>
> Add processors for Google Cloud Storage Fetch/Put/Delete
> https://issues.apache.org/jira/browse/NIFI-2725
>
> You might want to chime in with some details of your use case, or create a
> new ticket if that's not a fit for you.
>
>
>
> Thanks,
>
> James
>
>
>
> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <[email protected]> wrote:
>
> Hi,
>
>
>
> Our group has recently started trying to prototype a setup of
> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
> documentation other than a scarce discussion on using Kite as a workaround
> to integrate NiFi and Parquet.
>
>
>
> Are there any future plans for this integration from NiFi or anyone would
> be able to give me some insight in which scenario this workaround would
> (not) be worthwhile and alternatives?
>
>
>
> The most recent discussion
> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
> I found in this list dates from May 11, 2016. I also saw some interest in
> doing this on Stackoverflow here
> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
> and here
> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
> .
>
>
>
> Thanks,
>
>
>
> --
>
> Carlos Paradis
>
> http://carlosparadis.com <http://carlosandrade.co>
>
>
>
>
>
>
>
> --
>
> Carlos Paradis
>
> http://carlosparadis.com <http://carlosandrade.co>
>



-- 
Carlos Paradis
http://carlosparadis.com <http://carlosandrade.co>

Re: Integration between Apache NiFi and Parquet or Workaround?

Reply via email to