Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Possibly, although updating an embedding will likely change every value in the dataset. That seems to call for file versioning and meta-data about the process that generated it. > Thanks, you may mention me as a contributor to the blog post if you'd like! > Done ;). Thanks again, Joaquin

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
nd run git diff. DeltaLake would help here, but again, is seems that it only 'tracks' Spark operations done directly on the file? Thanks! Joaquin PS. Nick, would you like to be mentioned as a contributor in the blog post? Your comments helped a lot to improve it ;). On Tue, Jun 30, 2020 at 6:4

Re: Arrow as a common open standard for machine learning data

2020-06-30 Thread Joaquin Vanschoren
r" will invoke the same code > paths as the Arrow protocol file reader > > - Wes > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren > wrote: > > > > Thank you all for your very detailed answers! I also read in other > threads > > that the 1.0.0 release m

Re: Arrow as a common open standard for machine learning data

2019-06-20 Thread Joaquin Vanschoren
Thank you all for your very detailed answers! I also read in other threads that the 1.0.0 release might be coming somewhere this fall? I'm really looking forward to that. @Wes: will there be any practical difference between Feather and Arrow after the 1.0.0 release? It is just an alias? What would

Re: Arrow as a common open standard for machine learning data

2019-06-12 Thread Joaquin Vanschoren
them: https://github.com/apache/arrow/blob/master/site/faq.md > > Neal > > On Wed, Jun 12, 2019 at 3:39 AM Joaquin Vanschoren < > joaquin.vanscho...@gmail.com> wrote: > > > Dear all, > > > > Thanks for creating Arrow! I'm part of OpenML.org, an open sourc

Arrow as a common open standard for machine learning data

2019-06-12 Thread Joaquin Vanschoren
Dear all, Thanks for creating Arrow! I'm part of OpenML.org, an open source initiative/platform for sharing machine learning datasets and models. We are currently storing data in either ARFF or Parquet, but are looking into whether e.g. Feather or a mix of Feather and Parquet could be the new