Re: Simple CSV to Parquet without Hadoop

Mike Thomsen Wed, 15 Aug 2018 08:32:50 -0700

Scott,

You can also try Docker on Windows. Something like this should work:


docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p
8080:8080 apache/nifi:latest

I don't have Windows either, but Docker seems to work fine for my
colleagues that have to use it on Windows. That should bridge C:\nifi_temp
and /opt/data_output between host and container and remap localhost:8080 to
the container on 8080 so you don't have to mess with a Hadoop client just
to try out some Parquet stuff.

Mike

On Wed, Aug 15, 2018 at 11:20 AM scott <[email protected]> wrote:

> Thanks Bryan. I'll give the Hadoop client a try.
>
> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <[email protected]> wrote:
>
>> I think there is a good chance that installing the Hadoop client would
>> solve the issue, but I can't say for sure since I don't have a Windows
>> machine to test.
>>
>> The processor depends on the Apache Parquet Java client library which
>> depends on Apache Hadoop client [1], and the Hadoop client has a
>> limitation on Windows where it requires something additional.
>>
>> [1]
>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L62-L65
>>
>>
>>
>> On Wed, Aug 15, 2018 at 10:16 AM, scott <[email protected]> wrote:
>> > If I install a Hadoop client on my NiFi host, would I be able to get
>> past
>> > this error?
>> > I don't understand why this processor depends on Hadoop. Other projects
>> like
>> > Drill and Spark don't have such a dependency to be able to write Parquet
>> > files.
>> >
>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>> > <[email protected]> wrote:
>> >>
>> >> It's a warning. You can ignore that.
>> >>
>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <[email protected]> wrote:
>> >>>
>> >>> Scott,
>> >>>
>> >>> Sorry I did not realize the Hadoop client would be looking for this
>> >>> winutils.exe when running on Windows.
>> >>>
>> >>> On linux and MacOS you don't need anything external installed outside
>> >>> of NiFi so I wasn't expecting this.
>> >>>
>> >>> Not sure if there is any other good option here regarding Parquet.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Bryan
>> >>>
>> >>>
>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <[email protected]> wrote:
>> >>> > Hi Bryan,
>> >>> > I'm fine if I have to trick the API, but don't I still need Hadoop
>> >>> > installed
>> >>> > somewhere? After creating the core-site.xml as you described, I get
>> the
>> >>> > following errors:
>> >>> >
>> >>> > Failed to locate the winutils binary in the hadoop binary path
>> >>> > IOException: Could not locate executable null\bin\winutils.exe in
>> the
>> >>> > Hadoop
>> >>> > binaries
>> >>> > Unable to load native-hadoop library for your platform... using
>> >>> > builtin-java
>> >>> > classes where applicable
>> >>> > Failed to write due to java.io.IOException: No FileSystem for scheme
>> >>> >
>> >>> > BTW, I'm using NiFi version 1.5
>> >>> >
>> >>> > Thanks,
>> >>> > Scott
>> >>> >
>> >>> >
>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <[email protected]>
>> wrote:
>> >>> >>
>> >>> >> Scott,
>> >>> >>
>> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop
>> Filesystem
>> >>> >> object which is why NiFi can't read and write Parquet directly to
>> flow
>> >>> >> files (i.e. they don't provide a way to read/write to/from Java
>> input
>> >>> >> and output streams).
>> >>> >>
>> >>> >> The best you can do is trick the Hadoop API into using the local
>> >>> >> file-system by creating a core-site.xml with the following:
>> >>> >>
>> >>> >> <configuration>
>> >>> >>     <property>
>> >>> >>         <name>fs.defaultFS</name>
>> >>> >>         <value>file:///</value>
>> >>> >>     </property>
>> >>> >> </configuration>
>> >>> >>
>> >>> >> That will make PutParquet or FetchParquet work with your local
>> >>> >> file-system.
>> >>> >>
>> >>> >> Thanks,
>> >>> >>
>> >>> >> Bryan
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <[email protected]>
>> wrote:
>> >>> >> > Hello NiFi community,
>> >>> >> > Is there a simple way to read CSV files and write them out as
>> >>> >> > Parquet
>> >>> >> > files
>> >>> >> > without Hadoop? I run NiFi on Windows and don't have access to a
>> >>> >> > Hadoop
>> >>> >> > environment. I'm trying to write the output of my ETL in a
>> >>> >> > compressed
>> >>> >> > and
>> >>> >> > still query-able format. Is there something I should be using
>> >>> >> > instead of
>> >>> >> > Parquet?
>> >>> >> >
>> >>> >> > Thanks for your time,
>> >>> >> > Scott
>> >>> >
>> >>> >
>> >
>> >
>>
>
>

Re: Simple CSV to Parquet without Hadoop

Reply via email to