Hello,
I've been playing around with DataFusion to explore the feasibility of
replacing current python/pandas data processing jobs with Rust/datafusion.
Ultimately, looking to improve performance / decrease cost.
I was doing some simple tests to start to measure performance differences on a
simple task (read a csv[1] and filter it).
Reading the csv datafusion seemed to outperform pandas by around 30% which was
nice.
*Rust took around 20-25ms to read the csv (compared to 32ms from pandas)
However, when filtering the data I was surprised to see that pandas was way
faster.
*Rust took around 500-600ms to filter the csv(compared to 1ms from pandas)
My code for each is below. I know I should be running the DataFusion times
through something similar to pythons %timeit but I didn't have that immediately
accessible and I ran many times to confirm it was roughly consistent.
Is this performance expected? Or am I using datafusion incorrectly?
Any insight is much appreciated!
[Rust]
```
use datafusion::error::Result;
use datafusion::prelude::*;
use std::time::Instant;
#[tokio::main]
async fn main() -> Result<()> {
let start = Instant::now();
let mut ctx = ExecutionContext::new();
let ratings_csv = "ratings_small.csv";
let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
println!("Read CSV Duration: {:?}", start.elapsed());
let q_start = Instant::now();
let results = df
.filter(col("userId").eq(lit(1)))?
.collect()
.await
.unwrap();
println!("Filter duration: {:?}", q_start.elapsed());
println!("Duration: {:?}", start.elapsed());
Ok(())
}
```
[Python]
```
In [1]: df = pd.read_csv("ratings_small.csv")
32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [2]: df.query("userId==1")
1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
[1]: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv
Matthew M. Turner
Email: [email protected]<mailto:[email protected]>
Phone: (908)-868-2786