Re: Simple CSV to Parquet without Hadoop

scott Wed, 15 Aug 2018 08:21:03 -0700

Thanks Bryan. I'll give the Hadoop client a try.

On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <[email protected]> wrote:


> I think there is a good chance that installing the Hadoop client would
> solve the issue, but I can't say for sure since I don't have a Windows
> machine to test.
>
> The processor depends on the Apache Parquet Java client library which
> depends on Apache Hadoop client [1], and the Hadoop client has a
> limitation on Windows where it requires something additional.
>
> [1] https://github.com/apache/parquet-mr/blob/master/
> parquet-avro/pom.xml#L62-L65
>
>
>
> On Wed, Aug 15, 2018 at 10:16 AM, scott <[email protected]> wrote:
> > If I install a Hadoop client on my NiFi host, would I be able to get past
> > this error?
> > I don't understand why this processor depends on Hadoop. Other projects
> like
> > Drill and Spark don't have such a dependency to be able to write Parquet
> > files.
> >
> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
> > <[email protected]> wrote:
> >>
> >> It's a warning. You can ignore that.
> >>
> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <[email protected]> wrote:
> >>>
> >>> Scott,
> >>>
> >>> Sorry I did not realize the Hadoop client would be looking for this
> >>> winutils.exe when running on Windows.
> >>>
> >>> On linux and MacOS you don't need anything external installed outside
> >>> of NiFi so I wasn't expecting this.
> >>>
> >>> Not sure if there is any other good option here regarding Parquet.
> >>>
> >>> Thanks,
> >>>
> >>> Bryan
> >>>
> >>>
> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <[email protected]> wrote:
> >>> > Hi Bryan,
> >>> > I'm fine if I have to trick the API, but don't I still need Hadoop
> >>> > installed
> >>> > somewhere? After creating the core-site.xml as you described, I get
> the
> >>> > following errors:
> >>> >
> >>> > Failed to locate the winutils binary in the hadoop binary path
> >>> > IOException: Could not locate executable null\bin\winutils.exe in the
> >>> > Hadoop
> >>> > binaries
> >>> > Unable to load native-hadoop library for your platform... using
> >>> > builtin-java
> >>> > classes where applicable
> >>> > Failed to write due to java.io.IOException: No FileSystem for scheme
> >>> >
> >>> > BTW, I'm using NiFi version 1.5
> >>> >
> >>> > Thanks,
> >>> > Scott
> >>> >
> >>> >
> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <[email protected]>
> wrote:
> >>> >>
> >>> >> Scott,
> >>> >>
> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop
> Filesystem
> >>> >> object which is why NiFi can't read and write Parquet directly to
> flow
> >>> >> files (i.e. they don't provide a way to read/write to/from Java
> input
> >>> >> and output streams).
> >>> >>
> >>> >> The best you can do is trick the Hadoop API into using the local
> >>> >> file-system by creating a core-site.xml with the following:
> >>> >>
> >>> >> <configuration>
> >>> >>     <property>
> >>> >>         <name>fs.defaultFS</name>
> >>> >>         <value>file:///</value>
> >>> >>     </property>
> >>> >> </configuration>
> >>> >>
> >>> >> That will make PutParquet or FetchParquet work with your local
> >>> >> file-system.
> >>> >>
> >>> >> Thanks,
> >>> >>
> >>> >> Bryan
> >>> >>
> >>> >>
> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <[email protected]> wrote:
> >>> >> > Hello NiFi community,
> >>> >> > Is there a simple way to read CSV files and write them out as
> >>> >> > Parquet
> >>> >> > files
> >>> >> > without Hadoop? I run NiFi on Windows and don't have access to a
> >>> >> > Hadoop
> >>> >> > environment. I'm trying to write the output of my ETL in a
> >>> >> > compressed
> >>> >> > and
> >>> >> > still query-able format. Is there something I should be using
> >>> >> > instead of
> >>> >> > Parquet?
> >>> >> >
> >>> >> > Thanks for your time,
> >>> >> > Scott
> >>> >
> >>> >
> >
> >
>

Re: Simple CSV to Parquet without Hadoop

Reply via email to