There is a JVM based dataframe library: https://github.com/techascent/tech.ml.dataset
There are dplyr-like bindings for it: https://github.com/scicloj/tablecloth It supports mmap/in-place loading of array files (which the Java SDK does not): https://techascent.com/blog/memory-mapping-arrow.html And it performs just fine whether you use parquet or arrow: https://github.com/zero-one-group/geni-performance-benchmark It also supports graal native compilation so you can have a graal native executable that reads/writes/mmaps arrow data. On Tue, Mar 16, 2021 at 5:52 PM Andy Grove <[email protected]> wrote: > This isn't directly related to the question, but I was reading about the > newly released JDK 16 today and there is initial support for explicit > vectorized operations, which might be interesting to explore for anyone > considering building a Java DataFrame implementation. > > https://openjdk.java.net/jeps/338 > > On Tue, Mar 16, 2021 at 5:43 PM Andrew Melo <[email protected]> wrote: > >> I can't speak to how complete it is, but I looked earlier for >> something similar and ran across >> https://github.com/deeplearning4j/nd4j .. it's probably not an exact >> fit, but it does appear to be able to consume arrow buffers and expose >> them to java. >> >> Cheers >> Andrew >> >> On Tue, Mar 16, 2021 at 6:36 PM Wes McKinney <[email protected]> wrote: >> > >> > This has been asked several times in the past but I'm not aware of >> > anything "dataframe-like" in Java that's build against Arrow (or >> > otherwise) that fills the kind of need that pandas does. There was a >> > Scala project some years ago Saddle [1] (not Arrow-based) built >> > initially by one of the early pandas developers but I don't think it's >> > still being actively developed. To build a higher-level Java API on >> > top of the Arrow Java libraries would be incredibly useful to the >> > community I'm sure. >> > >> > [1]: https://github.com/saddle/saddle >> > >> > On Tue, Mar 16, 2021 at 5:06 PM Paul Whalen <[email protected]> wrote: >> > > >> > > Hi, >> > > >> > > I've been using Arrow for some time now, mostly in the context of >> Arrow Flight between Java and Python. While it's quite easy to convert >> Arrow data in Python to a pandas dataframe and manipulate it, I'm >> struggling to find an obvious analogue on the Java side. VectorSchemaRoot >> is useful for loading/unloading/moving data, but clumsy for doing higher >> level operations, especially joins/aggregations/etc across "tables". >> > > >> > > In other words, if I wanted to load non Arrow formatted data from >> somewhere into Java, manipulate it with a dataframe like API, and then send >> the result somewhere via Flight, what library would be the best/simplest >> way to accomplish that? I see lots of progress in other languages, but I'm >> wondering what would be recommended for Java. >> > > >> > > I'm currently looking at Spark SQL just in-application, but that >> seems a touch heavyweight, and I'm not sure it would do exactly what I've >> described (nor am I terribly familiar with Spark in the first place). >> > > >> > > If the premise of this question is flawed, please feel free to >> correct me. >> > > >> > > Thanks! >> > > Paul >> >
