Re: Simple CSV to Parquet without Hadoop

Mike Thomsen Wed, 15 Aug 2018 12:02:14 -0700

> Mike, that's a good tip. I'll test that, but unfortunately, I've already
committed to Windows.


You can run both Docker and the standard NiFi docker image on Windows.

On Wed, Aug 15, 2018 at 2:52 PM scott <tcots8...@gmail.com> wrote:

> Mike, that's a good tip. I'll test that, but unfortunately, I've already
> committed to Windows.
> What about a script? Is there some tool you know of that can just be
> called by NiFi to convert an input CSV file to a Parquet file?
>
> On Wed, Aug 15, 2018 at 8:32 AM, Mike Thomsen <mikerthom...@gmail.com>
> wrote:
>
>> Scott,
>>
>> You can also try Docker on Windows. Something like this should work:
>>
>> docker run -d --name nifi-test -v C:/nifi_temp:/opt/data_output -p
>> 8080:8080 apache/nifi:latest
>>
>> I don't have Windows either, but Docker seems to work fine for my
>> colleagues that have to use it on Windows. That should bridge C:\nifi_temp
>> and /opt/data_output between host and container and remap localhost:8080 to
>> the container on 8080 so you don't have to mess with a Hadoop client just
>> to try out some Parquet stuff.
>>
>> Mike
>>
>> On Wed, Aug 15, 2018 at 11:20 AM scott <tcots8...@gmail.com> wrote:
>>
>>> Thanks Bryan. I'll give the Hadoop client a try.
>>>
>>> On Wed, Aug 15, 2018 at 7:51 AM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> I think there is a good chance that installing the Hadoop client would
>>>> solve the issue, but I can't say for sure since I don't have a Windows
>>>> machine to test.
>>>>
>>>> The processor depends on the Apache Parquet Java client library which
>>>> depends on Apache Hadoop client [1], and the Hadoop client has a
>>>> limitation on Windows where it requires something additional.
>>>>
>>>> [1]
>>>> https://github.com/apache/parquet-mr/blob/master/parquet-avro/pom.xml#L62-L65
>>>>
>>>>
>>>>
>>>> On Wed, Aug 15, 2018 at 10:16 AM, scott <tcots8...@gmail.com> wrote:
>>>> > If I install a Hadoop client on my NiFi host, would I be able to get
>>>> past
>>>> > this error?
>>>> > I don't understand why this processor depends on Hadoop. Other
>>>> projects like
>>>> > Drill and Spark don't have such a dependency to be able to write
>>>> Parquet
>>>> > files.
>>>> >
>>>> > On Tue, Aug 14, 2018 at 2:58 PM, Juan Pablo Gardella
>>>> > <gardellajuanpa...@gmail.com> wrote:
>>>> >>
>>>> >> It's a warning. You can ignore that.
>>>> >>
>>>> >> On Tue, 14 Aug 2018 at 18:53 Bryan Bende <bbe...@gmail.com> wrote:
>>>> >>>
>>>> >>> Scott,
>>>> >>>
>>>> >>> Sorry I did not realize the Hadoop client would be looking for this
>>>> >>> winutils.exe when running on Windows.
>>>> >>>
>>>> >>> On linux and MacOS you don't need anything external installed
>>>> outside
>>>> >>> of NiFi so I wasn't expecting this.
>>>> >>>
>>>> >>> Not sure if there is any other good option here regarding Parquet.
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> Bryan
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Aug 14, 2018 at 5:31 PM, scott <tcots8...@gmail.com> wrote:
>>>> >>> > Hi Bryan,
>>>> >>> > I'm fine if I have to trick the API, but don't I still need Hadoop
>>>> >>> > installed
>>>> >>> > somewhere? After creating the core-site.xml as you described, I
>>>> get the
>>>> >>> > following errors:
>>>> >>> >
>>>> >>> > Failed to locate the winutils binary in the hadoop binary path
>>>> >>> > IOException: Could not locate executable null\bin\winutils.exe in
>>>> the
>>>> >>> > Hadoop
>>>> >>> > binaries
>>>> >>> > Unable to load native-hadoop library for your platform... using
>>>> >>> > builtin-java
>>>> >>> > classes where applicable
>>>> >>> > Failed to write due to java.io.IOException: No FileSystem for
>>>> scheme
>>>> >>> >
>>>> >>> > BTW, I'm using NiFi version 1.5
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> > Scott
>>>> >>> >
>>>> >>> >
>>>> >>> > On Tue, Aug 14, 2018 at 12:44 PM, Bryan Bende <bbe...@gmail.com>
>>>> wrote:
>>>> >>> >>
>>>> >>> >> Scott,
>>>> >>> >>
>>>> >>> >> Unfortunately the Parquet API itself is tied to the Hadoop
>>>> Filesystem
>>>> >>> >> object which is why NiFi can't read and write Parquet directly
>>>> to flow
>>>> >>> >> files (i.e. they don't provide a way to read/write to/from Java
>>>> input
>>>> >>> >> and output streams).
>>>> >>> >>
>>>> >>> >> The best you can do is trick the Hadoop API into using the local
>>>> >>> >> file-system by creating a core-site.xml with the following:
>>>> >>> >>
>>>> >>> >> <configuration>
>>>> >>> >>     <property>
>>>> >>> >>         <name>fs.defaultFS</name>
>>>> >>> >>         <value>file:///</value>
>>>> >>> >>     </property>
>>>> >>> >> </configuration>
>>>> >>> >>
>>>> >>> >> That will make PutParquet or FetchParquet work with your local
>>>> >>> >> file-system.
>>>> >>> >>
>>>> >>> >> Thanks,
>>>> >>> >>
>>>> >>> >> Bryan
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> On Tue, Aug 14, 2018 at 3:22 PM, scott <tcots8...@gmail.com>
>>>> wrote:
>>>> >>> >> > Hello NiFi community,
>>>> >>> >> > Is there a simple way to read CSV files and write them out as
>>>> >>> >> > Parquet
>>>> >>> >> > files
>>>> >>> >> > without Hadoop? I run NiFi on Windows and don't have access to
>>>> a
>>>> >>> >> > Hadoop
>>>> >>> >> > environment. I'm trying to write the output of my ETL in a
>>>> >>> >> > compressed
>>>> >>> >> > and
>>>> >>> >> > still query-able format. Is there something I should be using
>>>> >>> >> > instead of
>>>> >>> >> > Parquet?
>>>> >>> >> >
>>>> >>> >> > Thanks for your time,
>>>> >>> >> > Scott
>>>> >>> >
>>>> >>> >
>>>> >
>>>> >
>>>>
>>>
>>>
>

Re: Simple CSV to Parquet without Hadoop

Reply via email to