Hi Juan,
Of course! My prototype is here:
https://github.com/OpenRefine/OpenRefine/tree/spark-prototype
I suspect it can be quite hard for you to jump in the code at this stage
of the project, but here are some concise pointers:
The or-spark module contains the Spark-based implementation of our
Would you be able to send the code you are running?That would be great if you
include some sample data.
Is that possible?
El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists)
escribió:
Hi Stephen and Juan,
Thanks both for your replies - you are right, I used the wrong
Hi Stephen and Juan,
Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).
That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead
Hi Antonin.
It seems you are confusing Standalone with Local mode. They are 2 different
modes.
>From Spark in Action book: "In local mode, there is only one executor in the
>same client JVM as the driver, butthis executor can spawn several threads to
>run tasks.
In local mode, Spark uses your
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
pand
Hi,
I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.
Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.
However, OpenRefine is a lightweight t