Re: Simple CSV to Parquet without Hadoop

scott Wed, 15 Aug 2018 11:53:25 -0700

Mike, that's a good tip. I'll test that, but unfortunately, I've already
committed to Windows.
What about a script? Is there some tool you know of that can just be called
by NiFi to convert an input CSV file to a Parquet file?


On Wed, Aug 15, 2018 at 8:32 AM, Mike Thomsen <[email protected]>
wrote:

> Scott,
>
> You can also try Docker on Windows. Something like this should work:
>
> docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p
> 8080:8080 apache/nifi:latest
>
> I don't have Windows either, but Docker seems to work fine for my
> colleagues that have to use it on Windows. That should bridge C:\nifi_temp
> and /opt/data_output between host and container and remap localhost:8080 to
> the container on 8080 so you don't have to mess with a Hadoop client just
> to try out some Parquet stuff.
>
> Mike
>
> On Wed, Aug 15, 2018 at 11:20 AM scott <[email protected]> wrote:
>
>> Thanks Bryan. I'll give the Hadoop client a try.
>>
>> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <[email protected]> wrote:
>>
>>> I think there is a good chance that installing the Hadoop client would
>>> solve the issue, but I can't say for sure since I don't have a Windows
>>> machine to test.
>>>
>>> The processor depends on the Apache Parquet Java client library which
>>> depends on Apache Hadoop client [1], and the Hadoop client has a
>>> limitation on Windows where it requires something additional.
>>>
>>> [1] https://github.com/apache/parquet-mr/blob/master/
>>> parquet-avro/pom.xml#L62-L65
>>>
>>>
>>>
>>> On Wed, Aug 15, 2018 at 10:16 AM, scott <[email protected]> wrote:
>>> > If I install a Hadoop client on my NiFi host, would I be able to get
>>> past
>>> > this error?
>>> > I don't understand why this processor depends on Hadoop. Other
>>> projects like
>>> > Drill and Spark don't have such a dependency to be able to write
>>> Parquet
>>> > files.
>>> >
>>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>>> > <[email protected]> wrote:
>>> >>
>>> >> It's a warning. You can ignore that.
>>> >>
>>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <[email protected]> wrote:
>>> >>>
>>> >>> Scott,
>>> >>>
>>> >>> Sorry I did not realize the Hadoop client would be looking for this
>>> >>> winutils.exe when running on Windows.
>>> >>>
>>> >>> On linux and MacOS you don't need anything external installed outside
>>> >>> of NiFi so I wasn't expecting this.
>>> >>>
>>> >>> Not sure if there is any other good option here regarding Parquet.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Bryan
>>> >>>
>>> >>>
>>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <[email protected]> wrote:
>>> >>> > Hi Bryan,
>>> >>> > I'm fine if I have to trick the API, but don't I still need Hadoop
>>> >>> > installed
>>> >>> > somewhere? After creating the core-site.xml as you described, I
>>> get the
>>> >>> > following errors:
>>> >>> >
>>> >>> > Failed to locate the winutils binary in the hadoop binary path
>>> >>> > IOException: Could not locate executable null\bin\winutils.exe in
>>> the
>>> >>> > Hadoop
>>> >>> > binaries
>>> >>> > Unable to load native-hadoop library for your platform... using
>>> >>> > builtin-java
>>> >>> > classes where applicable
>>> >>> > Failed to write due to java.io.IOException: No FileSystem for
>>> scheme
>>> >>> >
>>> >>> > BTW, I'm using NiFi version 1.5
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Scott
>>> >>> >
>>> >>> >
>>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <[email protected]>
>>> wrote:
>>> >>> >>
>>> >>> >> Scott,
>>> >>> >>
>>> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop
>>> Filesystem
>>> >>> >> object which is why NiFi can't read and write Parquet directly to
>>> flow
>>> >>> >> files (i.e. they don't provide a way to read/write to/from Java
>>> input
>>> >>> >> and output streams).
>>> >>> >>
>>> >>> >> The best you can do is trick the Hadoop API into using the local
>>> >>> >> file-system by creating a core-site.xml with the following:
>>> >>> >>
>>> >>> >> <configuration>
>>> >>> >>     <property>
>>> >>> >>         <name>fs.defaultFS</name>
>>> >>> >>         <value>file:///</value>
>>> >>> >>     </property>
>>> >>> >> </configuration>
>>> >>> >>
>>> >>> >> That will make PutParquet or FetchParquet work with your local
>>> >>> >> file-system.
>>> >>> >>
>>> >>> >> Thanks,
>>> >>> >>
>>> >>> >> Bryan
>>> >>> >>
>>> >>> >>
>>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <[email protected]>
>>> wrote:
>>> >>> >> > Hello NiFi community,
>>> >>> >> > Is there a simple way to read CSV files and write them out as
>>> >>> >> > Parquet
>>> >>> >> > files
>>> >>> >> > without Hadoop? I run NiFi on Windows and don't have access to a
>>> >>> >> > Hadoop
>>> >>> >> > environment. I'm trying to write the output of my ETL in a
>>> >>> >> > compressed
>>> >>> >> > and
>>> >>> >> > still query-able format. Is there something I should be using
>>> >>> >> > instead of
>>> >>> >> > Parquet?
>>> >>> >> >
>>> >>> >> > Thanks for your time,
>>> >>> >> > Scott
>>> >>> >
>>> >>> >
>>> >
>>> >
>>>
>>
>>

Re: Simple CSV to Parquet without Hadoop

Reply via email to