Re: Simple CSV to Parquet without Hadoop

Matt Burgess Wed, 15 Aug 2018 12:44:11 -0700

I don't think you have to install Hadoop on Windows in order to get it
to work, just the winutils.exe and I guess put it wherever it's
looking for it (that might be configurable via an environment variable
or something).


There are pre-built binaries [1] for various versions of Hadoop, even
though you'll be writing to a local file system you'll want to match
the version of winutils.exe with the version of Hadoop (usually 2.7.3
for slightly older NiFi versions or 3.0.0 for the latest version(s) I
think) for best results.

Regards,
Matt

[1] https://github.com/steveloughran/winutils

On Wed, Aug 15, 2018 at 3:23 PM scott <[email protected]> wrote:
>
> Just tested in my Centos VM, worked like a charm without Hadoop. I'll open a 
> Jira bug on PutParquet, doesn't seem to run on Windows.
> Still not sure what I can do. Converting our production Windows NiFi install 
> to Docker would be a major effort.
> Has anyone heard of a Parquet writer tool I can download and call from NiFi?
>
> On Wed, Aug 15, 2018 at 12:01 PM, Mike Thomsen <[email protected]> wrote:
>>
>> > Mike, that's a good tip. I'll test that, but unfortunately, I've already 
>> > committed to Windows.
>>
>> You can run both Docker and the standard NiFi docker image on Windows.
>>
>> On Wed, Aug 15, 2018 at 2:52 PM scott <[email protected]> wrote:
>>>
>>> Mike, that's a good tip. I'll test that, but unfortunately, I've already 
>>> committed to Windows.
>>> What about a script? Is there some tool you know of that can just be called 
>>> by NiFi to convert an input CSV file to a Parquet file?
>>>
>>> On Wed, Aug 15, 2018 at 8:32 AM, Mike Thomsen <[email protected]> 
>>> wrote:
>>>>
>>>> Scott,
>>>>
>>>> You can also try Docker on Windows. Something like this should work:
>>>>
>>>> docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p 
>>>> 8080:8080 apache/nifi:latest
>>>>
>>>> I don't have Windows either, but Docker seems to work fine for my 
>>>> colleagues that have to use it on Windows. That should bridge C:\nifi_temp 
>>>> and /opt/data_output between host and container and remap localhost:8080 
>>>> to the container on 8080 so you don't have to mess with a Hadoop client 
>>>> just to try out some Parquet stuff.
>>>>
>>>> Mike
>>>>
>>>> On Wed, Aug 15, 2018 at 11:20 AM scott <[email protected]> wrote:
>>>>>
>>>>> Thanks Bryan. I'll give the Hadoop client a try.
>>>>>
>>>>> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <[email protected]> wrote:
>>>>>>
>>>>>> I think there is a good chance that installing the Hadoop client would
>>>>>> solve the issue, but I can't say for sure since I don't have a Windows
>>>>>> machine to test.
>>>>>>
>>>>>> The processor depends on the Apache Parquet Java client library which
>>>>>> depends on Apache Hadoop client [1], and the Hadoop client has a
>>>>>> limitation on Windows where it requires something additional.
>>>>>>
>>>>>> [1] 
>>>>>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L62-L65
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 15, 2018 at 10:16 AM, scott <[email protected]> wrote:
>>>>>> > If I install a Hadoop client on my NiFi host, would I be able to get 
>>>>>> > past
>>>>>> > this error?
>>>>>> > I don't understand why this processor depends on Hadoop. Other 
>>>>>> > projects like
>>>>>> > Drill and Spark don't have such a dependency to be able to write 
>>>>>> > Parquet
>>>>>> > files.
>>>>>> >
>>>>>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>>>>>> > <[email protected]> wrote:
>>>>>> >>
>>>>>> >> It's a warning. You can ignore that.
>>>>>> >>
>>>>>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <[email protected]> wrote:
>>>>>> >>>
>>>>>> >>> Scott,
>>>>>> >>>
>>>>>> >>> Sorry I did not realize the Hadoop client would be looking for this
>>>>>> >>> winutils.exe when running on Windows.
>>>>>> >>>
>>>>>> >>> On linux and MacOS you don't need anything external installed outside
>>>>>> >>> of NiFi so I wasn't expecting this.
>>>>>> >>>
>>>>>> >>> Not sure if there is any other good option here regarding Parquet.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>>
>>>>>> >>> Bryan
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <[email protected]> wrote:
>>>>>> >>> > Hi Bryan,
>>>>>> >>> > I'm fine if I have to trick the API, but don't I still need Hadoop
>>>>>> >>> > installed
>>>>>> >>> > somewhere? After creating the core-site.xml as you described, I 
>>>>>> >>> > get the
>>>>>> >>> > following errors:
>>>>>> >>> >
>>>>>> >>> > Failed to locate the winutils binary in the hadoop binary path
>>>>>> >>> > IOException: Could not locate executable null\bin\winutils.exe in 
>>>>>> >>> > the
>>>>>> >>> > Hadoop
>>>>>> >>> > binaries
>>>>>> >>> > Unable to load native-hadoop library for your platform... using
>>>>>> >>> > builtin-java
>>>>>> >>> > classes where applicable
>>>>>> >>> > Failed to write due to java.io.IOException: No FileSystem for 
>>>>>> >>> > scheme
>>>>>> >>> >
>>>>>> >>> > BTW, I'm using NiFi version 1.5
>>>>>> >>> >
>>>>>> >>> > Thanks,
>>>>>> >>> > Scott
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <[email protected]> 
>>>>>> >>> > wrote:
>>>>>> >>> >>
>>>>>> >>> >> Scott,
>>>>>> >>> >>
>>>>>> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop 
>>>>>> >>> >> Filesystem
>>>>>> >>> >> object which is why NiFi can't read and write Parquet directly to 
>>>>>> >>> >> flow
>>>>>> >>> >> files (i.e. they don't provide a way to read/write to/from Java 
>>>>>> >>> >> input
>>>>>> >>> >> and output streams).
>>>>>> >>> >>
>>>>>> >>> >> The best you can do is trick the Hadoop API into using the local
>>>>>> >>> >> file-system by creating a core-site.xml with the following:
>>>>>> >>> >>
>>>>>> >>> >> <configuration>
>>>>>> >>> >>     <property>
>>>>>> >>> >>         <name>fs.defaultFS</name>
>>>>>> >>> >>         <value>file:///</value>
>>>>>> >>> >>     </property>
>>>>>> >>> >> </configuration>
>>>>>> >>> >>
>>>>>> >>> >> That will make PutParquet or FetchParquet work with your local
>>>>>> >>> >> file-system.
>>>>>> >>> >>
>>>>>> >>> >> Thanks,
>>>>>> >>> >>
>>>>>> >>> >> Bryan
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <[email protected]> 
>>>>>> >>> >> wrote:
>>>>>> >>> >> > Hello NiFi community,
>>>>>> >>> >> > Is there a simple way to read CSV files and write them out as
>>>>>> >>> >> > Parquet
>>>>>> >>> >> > files
>>>>>> >>> >> > without Hadoop? I run NiFi on Windows and don't have access to a
>>>>>> >>> >> > Hadoop
>>>>>> >>> >> > environment. I'm trying to write the output of my ETL in a
>>>>>> >>> >> > compressed
>>>>>> >>> >> > and
>>>>>> >>> >> > still query-able format. Is there something I should be using
>>>>>> >>> >> > instead of
>>>>>> >>> >> > Parquet?
>>>>>> >>> >> >
>>>>>> >>> >> > Thanks for your time,
>>>>>> >>> >> > Scott
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>
>

Re: Simple CSV to Parquet without Hadoop

Reply via email to