Re: Simple CSV to Parquet without Hadoop

scott Wed, 15 Aug 2018 12:23:23 -0700

Just tested in my Centos VM, worked like a charm without Hadoop. I'll open
a Jira bug on PutParquet, doesn't seem to run on Windows.
Still not sure what I can do. Converting our production Windows NiFi
install to Docker would be a major effort.
Has anyone heard of a Parquet writer tool I can download and call from NiFi?


On Wed, Aug 15, 2018 at 12:01 PM, Mike Thomsen <[email protected]>
wrote:

> > Mike, that's a good tip. I'll test that, but unfortunately, I've already
> committed to Windows.
>
> You can run both Docker and the standard NiFi docker image on Windows.
>
> On Wed, Aug 15, 2018 at 2:52 PM scott <[email protected]> wrote:
>
>> Mike, that's a good tip. I'll test that, but unfortunately, I've already
>> committed to Windows.
>> What about a script? Is there some tool you know of that can just be
>> called by NiFi to convert an input CSV file to a Parquet file?
>>
>> On Wed, Aug 15, 2018 at 8:32 AM, Mike Thomsen <[email protected]>
>> wrote:
>>
>>> Scott,
>>>
>>> You can also try Docker on Windows. Something like this should work:
>>>
>>> docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p
>>> 8080:8080 apache/nifi:latest
>>>
>>> I don't have Windows either, but Docker seems to work fine for my
>>> colleagues that have to use it on Windows. That should bridge C:\nifi_temp
>>> and /opt/data_output between host and container and remap localhost:8080 to
>>> the container on 8080 so you don't have to mess with a Hadoop client just
>>> to try out some Parquet stuff.
>>>
>>> Mike
>>>
>>> On Wed, Aug 15, 2018 at 11:20 AM scott <[email protected]> wrote:
>>>
>>>> Thanks Bryan. I'll give the Hadoop client a try.
>>>>
>>>> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <[email protected]> wrote:
>>>>
>>>>> I think there is a good chance that installing the Hadoop client would
>>>>> solve the issue, but I can't say for sure since I don't have a Windows
>>>>> machine to test.
>>>>>
>>>>> The processor depends on the Apache Parquet Java client library which
>>>>> depends on Apache Hadoop client [1], and the Hadoop client has a
>>>>> limitation on Windows where it requires something additional.
>>>>>
>>>>> [1] https://github.com/apache/parquet-mr/blob/master/
>>>>> parquet-avro/pom.xml#L62-L65
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 15, 2018 at 10:16 AM, scott <[email protected]> wrote:
>>>>> > If I install a Hadoop client on my NiFi host, would I be able to get
>>>>> past
>>>>> > this error?
>>>>> > I don't understand why this processor depends on Hadoop. Other
>>>>> projects like
>>>>> > Drill and Spark don't have such a dependency to be able to write
>>>>> Parquet
>>>>> > files.
>>>>> >
>>>>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>>>>> > <[email protected]> wrote:
>>>>> >>
>>>>> >> It's a warning. You can ignore that.
>>>>> >>
>>>>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <[email protected]> wrote:
>>>>> >>>
>>>>> >>> Scott,
>>>>> >>>
>>>>> >>> Sorry I did not realize the Hadoop client would be looking for this
>>>>> >>> winutils.exe when running on Windows.
>>>>> >>>
>>>>> >>> On linux and MacOS you don't need anything external installed
>>>>> outside
>>>>> >>> of NiFi so I wasn't expecting this.
>>>>> >>>
>>>>> >>> Not sure if there is any other good option here regarding Parquet.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>>
>>>>> >>> Bryan
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <[email protected]>
>>>>> wrote:
>>>>> >>> > Hi Bryan,
>>>>> >>> > I'm fine if I have to trick the API, but don't I still need
>>>>> Hadoop
>>>>> >>> > installed
>>>>> >>> > somewhere? After creating the core-site.xml as you described, I
>>>>> get the
>>>>> >>> > following errors:
>>>>> >>> >
>>>>> >>> > Failed to locate the winutils binary in the hadoop binary path
>>>>> >>> > IOException: Could not locate executable null\bin\winutils.exe
>>>>> in the
>>>>> >>> > Hadoop
>>>>> >>> > binaries
>>>>> >>> > Unable to load native-hadoop library for your platform... using
>>>>> >>> > builtin-java
>>>>> >>> > classes where applicable
>>>>> >>> > Failed to write due to java.io.IOException: No FileSystem for
>>>>> scheme
>>>>> >>> >
>>>>> >>> > BTW, I'm using NiFi version 1.5
>>>>> >>> >
>>>>> >>> > Thanks,
>>>>> >>> > Scott
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <[email protected]>
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >> Scott,
>>>>> >>> >>
>>>>> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop
>>>>> Filesystem
>>>>> >>> >> object which is why NiFi can't read and write Parquet directly
>>>>> to flow
>>>>> >>> >> files (i.e. they don't provide a way to read/write to/from Java
>>>>> input
>>>>> >>> >> and output streams).
>>>>> >>> >>
>>>>> >>> >> The best you can do is trick the Hadoop API into using the local
>>>>> >>> >> file-system by creating a core-site.xml with the following:
>>>>> >>> >>
>>>>> >>> >> <configuration>
>>>>> >>> >>     <property>
>>>>> >>> >>         <name>fs.defaultFS</name>
>>>>> >>> >>         <value>file:///</value>
>>>>> >>> >>     </property>
>>>>> >>> >> </configuration>
>>>>> >>> >>
>>>>> >>> >> That will make PutParquet or FetchParquet work with your local
>>>>> >>> >> file-system.
>>>>> >>> >>
>>>>> >>> >> Thanks,
>>>>> >>> >>
>>>>> >>> >> Bryan
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <[email protected]>
>>>>> wrote:
>>>>> >>> >> > Hello NiFi community,
>>>>> >>> >> > Is there a simple way to read CSV files and write them out as
>>>>> >>> >> > Parquet
>>>>> >>> >> > files
>>>>> >>> >> > without Hadoop? I run NiFi on Windows and don't have access
>>>>> to a
>>>>> >>> >> > Hadoop
>>>>> >>> >> > environment. I'm trying to write the output of my ETL in a
>>>>> >>> >> > compressed
>>>>> >>> >> > and
>>>>> >>> >> > still query-able format. Is there something I should be using
>>>>> >>> >> > instead of
>>>>> >>> >> > Parquet?
>>>>> >>> >> >
>>>>> >>> >> > Thanks for your time,
>>>>> >>> >> > Scott
>>>>> >>> >
>>>>> >>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>

Re: Simple CSV to Parquet without Hadoop

Reply via email to