Hello,

I've been playing around with DataFusion to explore the feasibility of 
replacing current python/pandas data processing jobs with Rust/datafusion.  
Ultimately, looking to improve performance / decrease cost.

I was doing some simple tests to start to measure performance differences on a 
simple task (read a csv[1] and filter it).

Reading the csv datafusion seemed to outperform pandas by around 30% which was 
nice.
*Rust took around 20-25ms to read the csv (compared to 32ms from pandas)

However, when filtering the data I was surprised to see that pandas was way 
faster.
*Rust took around 500-600ms to filter the csv(compared to 1ms from pandas)

My code for each is below.  I know I should be running the DataFusion times 
through something similar to pythons %timeit but I didn't have that immediately 
accessible and I ran many times to confirm it was roughly consistent.

Is this performance expected? Or am I using datafusion incorrectly?

Any insight is much appreciated!

[Rust]
```
use datafusion::error::Result;
use datafusion::prelude::*;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<()> {
    let start = Instant::now();

    let mut ctx = ExecutionContext::new();

    let ratings_csv = "ratings_small.csv";

    let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap();
    println!("Read CSV Duration: {:?}", start.elapsed());

    let q_start = Instant::now();
    let results = df
        .filter(col("userId").eq(lit(1)))?
        .collect()
        .await
        .unwrap();
    println!("Filter duration: {:?}", q_start.elapsed());

    println!("Duration: {:?}", start.elapsed());

    Ok(())
}
```

[Python]
```
In [1]: df = pd.read_csv("ratings_small.csv")
32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: df.query("userId==1")
1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

[1]: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv


Matthew M. Turner
Email: [email protected]<mailto:[email protected]>
Phone: (908)-868-2786

Reply via email to