Re: Using hudi with pyspark

2019-09-11 Thread Vinoth Chandar
Awesome.  Also you could try building off master 0.5.0-snapshot if you are
having some trouble with the bundles.

Greatly appreciate if you can share progress/feedback.

On Wed, Sep 11, 2019 at 1:55 AM Rodrigo Dominguez 
wrote:

> Hi Kabeer
>
> I was able to build a simple script on python, and submit it with:
>
> spark-submit --jars
> $HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.7.jar
> --packages com.databricks:spark-avro_2.11:4.0.0 --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer ./test.py
>
> Yes, the idea is to use upsert, I’ll take a look at the project.
>
> Thank you
>
> Rodrigo Dominguez
> www.rorra.com.ar
>
>
> > On Sep 10, 2019, at 10:03 PM, Kabeer Ahmed  wrote:
> >
> > Hi Rodrigo,
> >
> > Welcome to the HUDI users group. The entire Hudi code base is Java and
> Scala based. But there is nothing stopping you from using it through Python
> (pyspark). You should be able to copy all the packaging jars into your
> Spark installation and use them. But please note that you wouldnt be able
> to define your own CombineAndUpdate logic (as far as I know). For eg: if
> you wanted to write your own logic to compare the records that are being
> ingested to the ones persisted, I am not aware how to write them when using
> PySpark.
> > If you are only after running using Python to use HUDI to run upsert use
> cases, then I would highly recommend that you look into the Metorikku
> project at: https://github.com/YotpoLtd/metorikku (
> https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2FYotpoLtd%2Fmetorikku&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> The project does quite a lot without writing any code at all. It is based
> on HUDI.
> > If you are still after a Python example, then I can try to write one and
> share it with you.
> > Hope this helps,
> > Kabeer.
> >
> > On Sep 10 2019, at 4:07 pm, Rodrigo Dominguez 
> wrote:
> >> I’m new to Hudi, and I’m wondering whether I can use it with python
> (pyspark) and the way to use it.
> >>
> >> I was able to download the source code, compile the project, run the
> Scala and java samples, but didn’t see any single Python source code and
> I’m wondering whether this is possible.
> >> Thank you
> >> Rodrigo Dominguez
> >> [email protected]
> >>
> >
>
>


Re: Using hudi with pyspark

2019-09-11 Thread Rodrigo Dominguez
Hi Kabeer

I was able to build a simple script on python, and submit it with:

spark-submit --jars 
$HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.7.jar 
--packages com.databricks:spark-avro_2.11:4.0.0 --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer ./test.py 

Yes, the idea is to use upsert, I’ll take a look at the project.

Thank you

Rodrigo Dominguez
www.rorra.com.ar


> On Sep 10, 2019, at 10:03 PM, Kabeer Ahmed  wrote:
> 
> Hi Rodrigo,
> 
> Welcome to the HUDI users group. The entire Hudi code base is Java and Scala 
> based. But there is nothing stopping you from using it through Python 
> (pyspark). You should be able to copy all the packaging jars into your Spark 
> installation and use them. But please note that you wouldnt be able to define 
> your own CombineAndUpdate logic (as far as I know). For eg: if you wanted to 
> write your own logic to compare the records that are being ingested to the 
> ones persisted, I am not aware how to write them when using PySpark.
> If you are only after running using Python to use HUDI to run upsert use 
> cases, then I would highly recommend that you look into the Metorikku project 
> at: https://github.com/YotpoLtd/metorikku 
> (https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2FYotpoLtd%2Fmetorikku&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
>  The project does quite a lot without writing any code at all. It is based on 
> HUDI.
> If you are still after a Python example, then I can try to write one and 
> share it with you.
> Hope this helps,
> Kabeer.
> 
> On Sep 10 2019, at 4:07 pm, Rodrigo Dominguez  wrote:
>> I’m new to Hudi, and I’m wondering whether I can use it with python 
>> (pyspark) and the way to use it.
>> 
>> I was able to download the source code, compile the project, run the Scala 
>> and java samples, but didn’t see any single Python source code and I’m 
>> wondering whether this is possible.
>> Thank you
>> Rodrigo Dominguez
>> [email protected]
>> 
> 



Re: Using hudi with pyspark

2019-09-10 Thread Kabeer Ahmed
Hi Rodrigo,

Welcome to the HUDI users group. The entire Hudi code base is Java and Scala 
based. But there is nothing stopping you from using it through Python 
(pyspark). You should be able to copy all the packaging jars into your Spark 
installation and use them. But please note that you wouldnt be able to define 
your own CombineAndUpdate logic (as far as I know). For eg: if you wanted to 
write your own logic to compare the records that are being ingested to the ones 
persisted, I am not aware how to write them when using PySpark.
If you are only after running using Python to use HUDI to run upsert use cases, 
then I would highly recommend that you look into the Metorikku project at: 
https://github.com/YotpoLtd/metorikku 
(https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2FYotpoLtd%2Fmetorikku&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 The project does quite a lot without writing any code at all. It is based on 
HUDI.
If you are still after a Python example, then I can try to write one and share 
it with you.
Hope this helps,
Kabeer.

On Sep 10 2019, at 4:07 pm, Rodrigo Dominguez  wrote:
> I’m new to Hudi, and I’m wondering whether I can use it with python (pyspark) 
> and the way to use it.
>
> I was able to download the source code, compile the project, run the Scala 
> and java samples, but didn’t see any single Python source code and I’m 
> wondering whether this is possible.
> Thank you
> Rodrigo Dominguez
> [email protected]
>