This has been asked several times in the past but I'm not aware of anything "dataframe-like" in Java that's build against Arrow (or otherwise) that fills the kind of need that pandas does. There was a Scala project some years ago Saddle [1] (not Arrow-based) built initially by one of the early pandas developers but I don't think it's still being actively developed. To build a higher-level Java API on top of the Arrow Java libraries would be incredibly useful to the community I'm sure.
[1]: https://github.com/saddle/saddle On Tue, Mar 16, 2021 at 5:06 PM Paul Whalen <[email protected]> wrote: > > Hi, > > I've been using Arrow for some time now, mostly in the context of Arrow > Flight between Java and Python. While it's quite easy to convert Arrow data > in Python to a pandas dataframe and manipulate it, I'm struggling to find an > obvious analogue on the Java side. VectorSchemaRoot is useful for > loading/unloading/moving data, but clumsy for doing higher level operations, > especially joins/aggregations/etc across "tables". > > In other words, if I wanted to load non Arrow formatted data from somewhere > into Java, manipulate it with a dataframe like API, and then send the result > somewhere via Flight, what library would be the best/simplest way to > accomplish that? I see lots of progress in other languages, but I'm > wondering what would be recommended for Java. > > I'm currently looking at Spark SQL just in-application, but that seems a > touch heavyweight, and I'm not sure it would do exactly what I've described > (nor am I terribly familiar with Spark in the first place). > > If the premise of this question is flawed, please feel free to correct me. > > Thanks! > Paul
