subject:"RDD\-like API for entirely local workflows\?"

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)

Hi Juan, Of course! My prototype is here: https://github.com/OpenRefine/OpenRefine/tree/spark-prototype I suspect it can be quite hard for you to jump in the code at this stage of the project, but here are some concise pointers: The or-spark module contains the Spark-based implementation of our

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén

Would you be able to send the code you are running?That would be great if you include some sample data. Is that possible? El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) escribió: Hi Stephen and Juan, Thanks both for your replies - you are right, I used the wrong

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)

Hi Stephen and Juan, Thanks both for your replies - you are right, I used the wrong terminology! The local mode is what fits our needs best (and what I have benchmarking so far). That being said, the problems I mention are still applicable to this context. There is still a serialization overhead

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén

Hi Antonin. It seems you are confusing Standalone with Local mode. They are 2 different modes. >From Spark in Action book: "In local mode, there is only one executor in the >same client JVM as the driver, butthis executor can spawn several threads to >run tasks. In local mode, Spark uses your

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch

Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to pand

RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)

Hi, I am working on revamping the architecture of OpenRefine, an ETL tool, to execute workflows on datasets which do not fit in RAM. Spark's RDD API is a great fit for the tool's operations, and provides everything we need: partitioning and lazy evaluation. However, OpenRefine is a lightweight t

Re: RDD-like API for entirely local workflows?

Re: RDD-like API for entirely local workflows?

Re: RDD-like API for entirely local workflows?

Re: RDD-like API for entirely local workflows?

Re: RDD-like API for entirely local workflows?

RDD-like API for entirely local workflows?

6 matches

Site Navigation

Mail list logo

Footer information