Re: [DISCUSS] Execute dataset scan tasks in distributed system

2020-08-04 Thread Joris Van den Bossche
Hi Hongze, I am not too familiar with distributed systems in general, but I did work on using the Arrow Dataset API in the python Dask library which can work in a distributed way (https://dask.org/). For dask, we used the second idea of sending serialized data to the workers, but on the level of

Re: [DISCUSS] Execute dataset scan tasks in distributed system

2020-07-31 Thread Micah Kornfield
Hi Hongze, > Does anyone ever try using Arrow Dataset API in a distributed system? My understanding is the Dataset project was initially was intended for running on a single node machine. It might be reasonable to extend it to be useable in a distributed system, but I'll let the primary contrib