Re: JavaSerializerInstance is slow

2021-09-02 Thread Antonin Delpeuch (lists)
Hi Kohki, Serialization of tasks happens in local mode too and as far as I am aware there is no way to disable this (although it would definitely be useful in my opinion). You can see the local mode as a testing mode, in which you would want to catch any serialization errors, before they appear i

Async API to save RDDs?

2020-08-05 Thread Antonin Delpeuch (lists)
Hi, The RDD API provides async variants of a few RDD methods, which let the user execute the corresponding jobs asynchronously. This makes it possible to cancel the jobs for instance: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/AsyncRDDActions.html There does not seem to be

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
our help so far! Antonin On 04/07/2020 19:19, Juan Martín Guillén wrote: > Would you be able to send the code you are running? > That would be great if you include some sample data. > Is that possible? > > > El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) &g

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
; https://spark.apache.org/docs/latest/submitting-applications.html#master-urls > > Regards, > Juan Martín. > > > > > El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) > escribió: > > > Hi, > > I am working on revamping the archit

RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi, I am working on revamping the architecture of OpenRefine, an ETL tool, to execute workflows on datasets which do not fit in RAM. Spark's RDD API is a great fit for the tool's operations, and provides everything we need: partitioning and lazy evaluation. However, OpenRefine is a lightweight t