Re: parquet/beam

2018-06-01 Thread Reuven Lax
This is an interesting direction. I had looked at Arrow as a default Coder replacement for schema PCollections, and that didn't seem fruitful as we would need to create Arrow batches of size 1. However using Arrow to encode batches over the FnAPI might indeed be an interesting approach. Streaming

Re: parquet/beam

2018-05-31 Thread Lukasz Cwik
It really needs someone to take a deep dive and look into whether Arrow is a good fit now considering all the use cases that Apache Beam has. I did a look about a year ago when designing the Fn Data API and concluded at that point in time it wasn't great for several reasons but mainly due to the

Re: parquet/beam

2018-05-31 Thread Lukasz Cwik
Kenn, it can be done but requires explicit flow control communication between the Runner -> SDK and SDK -> Runner to be developed to support sub-bundle groupings. Transports and in memory layouts are related but improving our coders to use in memory layouts would give us most of the benefit. For

Re: parquet/beam

2018-05-31 Thread Kenneth Knowles
For the latter, can we have the Fn API data plane transmit sub-bundle groupings to benefit from the memory layout? On input the runner controls, on output the SDK controls (spilling)? Just random thoughts. Kenn On Thu, May 31, 2018 at 8:21 AM Lukasz Cwik wrote: > Tyler and I had reached out to

Re: parquet/beam

2018-05-31 Thread Lukasz Cwik
Tyler and I had reached out to Arrow folks[1] asking about how could we support the KV> when the iterable of values is beyond memory size limits. There is an open JIRA about adding support for large byte[] and strings and list types in ARROW-750[2]. Robert had pointed out that we could do the same

Re: parquet/beam

2018-05-31 Thread Reuven Lax
I've looked at arrow, and there's some trickiness. Beam has a record model and arrow works best with large batches of records. We could do per record encoding, but that might be inefficient in arrow. On Thu, May 31, 2018, 5:50 PM Ismaël Mejía wrote: > If I understand correctly Arrow allows a

Re: parquet/beam

2018-05-31 Thread Ismaël Mejía
If I understand correctly Arrow allows a common multi language in-memory data representation, so basically it is a columnar data format that you can use to transfer data betweeen libraries in python (pandas, numpy, etc), Java and other languages. This avoids the round-trip to disk to do so. So we

Re: parquet/beam

2018-05-30 Thread Chamikara Jayalath
On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik wrote: > For Python Parquet support, hopefully we can have cross language pipelines > solve this so we only need to implement it once. If it is really popular, > having it implemented more then once may be worthwhile. > I'd say Parquet format is

Re: parquet/beam

2018-05-30 Thread Lukasz Cwik
For Python Parquet support, hopefully we can have cross language pipelines solve this so we only need to implement it once. If it is really popular, having it implemented more then once may be worthwhile. Would the point of Arrow be to treat it as an IO connector similar to ParquetIO or JdbcIO (I

parquet/beam

2018-05-30 Thread Austin Bennett
I can see great use cases with s3/Parquet - so that's a great addition (which JB is addressing, for Java)! It would be even more ideal for the use cases I find myself around for there to be python parquet support, so for perhaps this next release: Would it make sense to be exploring: