(arrow-datafusion-comet) branch main updated: doc: Add Quickstart Comet doc section (#125)

sunchao Tue, 27 Feb 2024 16:23:33 -0800

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion-comet.git



The following commit(s) were added to refs/heads/main by this push:
     new 27f167b  doc: Add Quickstart Comet doc section (#125)
27f167b is described below

commit 27f167bad4ac2b90457ecce49682ef2932726c3b
Author: comphead <comph...@users.noreply.github.com>
AuthorDate: Tue Feb 27 16:22:54 2024 -0800

    doc: Add Quickstart Comet doc section (#125)
    
    Co-authored-by: o_voievodin <o_voievo...@apple.com>
---
 README.md | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/README.md b/README.md
index f078609..f48dfd9 100644
--- a/README.md
+++ b/README.md
@@ -58,3 +58,49 @@ Linux, Apple OSX (Intel and M1)
 - Apache Spark 3.2, 3.3, or 3.4
 - JDK 8 and up
 - GLIBC 2.17 (Centos 7) and up
+
+## Getting started
+
+Make sure the requirements above are met and software installed on your machine
+
+### Clone repo
+```commandline
+git clone https://github.com/apache/arrow-datafusion-comet.git
+```
+
+### Specify the Spark version and build the Comet
+Spark 3.4 used for the example.
+```
+cd arrow-datafusion-comet
+make release PROFILES="-Pspark-3.4"
+```
+
+### Run Spark with Comet enabled
+Make sure `SPARK_HOME` points to the same Spark version as Comet has built for.
+
+```
+$SPARK_HOME/bin/spark-shell --jars 
spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar \
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+--conf spark.comet.enabled=true \
+--conf spark.comet.exec.enabled=true \
+--conf spark.comet.exec.all.enabled=true
+```
+
+### Verify Comet enabled for Spark SQL query  
+
+Create a test Parquet source
+```scala
+scala> (0 until 10).toDF("a").write.mode("overwrite").parquet("/tmp/test")
+```
+
+Query the data from the test source and check: 
+- INFO message shows the native Comet library has been initialized.
+- The query plan reflects Comet operators being used for this query instead of 
Spark ones
+```scala
+scala> spark.read.parquet("/tmp/test").createOrReplaceTempView("t1"); 
spark.sql("select * from t1 where a > 5").explain
+INFO src/lib.rs: Comet native library initialized
+== Physical Plan ==
+        *(1) ColumnarToRow
+        +- CometFilter [a#14], (isnotnull(a#14) AND (a#14 > 5))
++- CometScan parquet [a#14] Batched: true, DataFilters: [isnotnull(a#14), 
(a#14 > 5)], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/test], PartitionFilters: [], PushedFilters: [IsNotNull(a), 
GreaterThan(a,5)], ReadSchema: struct<a:int>
+```
\ No newline at end of file

(arrow-datafusion-comet) branch main updated: doc: Add Quickstart Comet doc section (#125)

Reply via email to