Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich,

> Also have you got some benchmark results from your tests that you can
possibly share?

We only have some partial benchmark results internally so far. Once shuffle
and better memory management have been introduced, we plan to publish the
benchmark results (at least TPC-H) in the repo.

> Compared to standard Spark, what kind of performance gains can be
expected with Comet?

Currently, users could benefit from Comet in a few areas:
- Parquet read: a few improvements have been made against reading from S3
in particular, so users can expect better scan performance in this scenario
- Hash aggregation
- Columnar shuffle
- Decimals (Java's BigDecimal is pretty slow)

> Can one use Comet on k8s in conjunction with something like a Volcano
addon?

I think so. Comet is mostly orthogonal to the Spark scheduler framework.

Chao






On Fri, Feb 16, 2024 at 5:39 AM Mich Talebzadeh 
wrote:

> Hi Chao,
>
> As a cool feature
>
>
>- Compared to standard Spark, what kind of performance gains can be
>expected with Comet?
>-  Can one use Comet on k8s in conjunction with something like a
>Volcano addon?
>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge, sourced from both personal expertise and other resources but of
> course cannot be guaranteed . It is essential to note that, as with any
> advice, one verified and tested result holds more weight than a thousand
> expert opinions.
>
>
> On Tue, 13 Feb 2024 at 20:42, Chao Sun  wrote:
>
>> Hi all,
>>
>> We are very happy to announce that Project Comet, a plugin to
>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> has now been open sourced under the Apache Arrow umbrella. Please
>> check the project repo
>> https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen,

We will add a "Getting Started" section in the README soon, but basically
comet-spark-shell
<https://github.com/apache/arrow-datafusion-comet/blob/main/bin/comet-spark-shell>
in
the repo should provide a basic tool to build Comet and launch a Spark
shell with it.

Note that we haven't open sourced several features yet including shuffle
support, which the aggregate operation depends on. Please stay tuned!

Chao


On Wed, Feb 14, 2024 at 2:44 PM praveen sinha 
wrote:

> Hi Chao,
>
> Is there any example app/gist/repo which can help me use this plugin. I
> wanted to try out some realtime aggregate performance on top of parquet and
> spark dataframes.
>
> Thanks and Regards
> Praveen
>
>
> On Wed, Feb 14, 2024 at 9:20 AM Chao Sun  wrote:
>
>> > Out of interest what are the differences in the approach between this
>> and Glutten?
>>
>> Overall they are similar, although Gluten supports multiple backends
>> including Velox and Clickhouse. One major difference is (obviously)
>> Comet is based on DataFusion and Arrow, and written in Rust, while
>> Gluten is mostly C++.
>> I haven't looked very deep into Gluten yet, but there could be other
>> differences such as how strictly the engine follows Spark's semantics,
>> table format support (Iceberg, Delta, etc), fallback mechanism
>> (coarse-grained fallback on stage level or more fine-grained fallback
>> within stages), UDF support (Comet hasn't started on this yet),
>> shuffle support, memory management, etc.
>>
>> Both engines are backed by very strong and vibrant open source
>> communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
>> exciting to see how the projects will grow in future.
>>
>> Best,
>> Chao
>>
>> On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>> >
>> > Congratulations! Excellent work!
>> >
>> > On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>> >>
>> >> Absolutely thrilled to see the project going open-source! Huge
>> congrats to Chao and the entire team on this milestone!
>> >>
>> >> Yufei
>> >>
>> >>
>> >> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We are very happy to announce that Project Comet, a plugin to
>> >>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>> >>> has now been open sourced under the Apache Arrow umbrella. Please
>> >>> check the project repo
>> >>> https://github.com/apache/arrow-datafusion-comet for more details if
>> >>> you are interested. We'd love to collaborate with people from the open
>> >>> source community who share similar goals.
>> >>>
>> >>> Thanks,
>> >>> Chao
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >
>> >
>> > --
>> > John Zhuge
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and 
> Glutten?

Overall they are similar, although Gluten supports multiple backends
including Velox and Clickhouse. One major difference is (obviously)
Comet is based on DataFusion and Arrow, and written in Rust, while
Gluten is mostly C++.
I haven't looked very deep into Gluten yet, but there could be other
differences such as how strictly the engine follows Spark's semantics,
table format support (Iceberg, Delta, etc), fallback mechanism
(coarse-grained fallback on stage level or more fine-grained fallback
within stages), UDF support (Comet hasn't started on this yet),
shuffle support, memory management, etc.

Both engines are backed by very strong and vibrant open source
communities (Velox, Clickhouse, Arrow & DataFusion) so it's very
exciting to see how the projects will grow in future.

Best,
Chao

On Tue, Feb 13, 2024 at 10:06 PM John Zhuge  wrote:
>
> Congratulations! Excellent work!
>
> On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu  wrote:
>>
>> Absolutely thrilled to see the project going open-source! Huge congrats to 
>> Chao and the entire team on this milestone!
>>
>> Yufei
>>
>>
>> On Tue, Feb 13, 2024 at 12:43 PM Chao Sun  wrote:
>>>
>>> Hi all,
>>>
>>> We are very happy to announce that Project Comet, a plugin to
>>> accelerate Spark query execution via leveraging DataFusion and Arrow,
>>> has now been open sourced under the Apache Arrow umbrella. Please
>>> check the project repo
>>> https://github.com/apache/arrow-datafusion-comet for more details if
>>> you are interested. We'd love to collaborate with people from the open
>>> source community who share similar goals.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>
>
> --
> John Zhuge

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all,

We are very happy to announce that Project Comet, a plugin to
accelerate Spark query execution via leveraging DataFusion and Arrow,
has now been open sourced under the Apache Arrow umbrella. Please
check the project repo
https://github.com/apache/arrow-datafusion-comet for more details if
you are interested. We'd love to collaborate with people from the open
source community who share similar goals.

Thanks,
Chao

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Chao Sun
Hi Sanket,

Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a
lot of work to upgrade the Hive version to 3.x and up.

Normally though, you only need the Hive client in Spark to talk to
HiveMetastore (HMS) for things like table or partition metadata
information. In this case, Hive 2.3.9 used by Spark is already capable of
communicating with HMS of other versions like Hive 3.x. So, could you share
a bit of context why you want to use Hive 3.1.3 with Spark?

Chao


On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket
 wrote:

> Hi
>
>
>
> I Tried using the maven option and it’s working. But we are not allowed to
> download jars at runtime from maven because of some security restrictions.
>
>
>
> So, I tried again with downloading hive 3.1.3 and giving the location of
> jars and it worked this time. But now in our docker image we have 40 new
> Critical vulnerabilities due to Hive (scanned by AWS Inspector).
>
>
>
> So, The only solution I see here is to build *Spark 3.4.1* *with Hive
> 3.1.3*. But when I do so the build is failing while compiling the files
> in /spark/sql/hive. But when I am trying to build *Spark 3.4.1* *with
> Hive 2.3.9* the build is completed successfully.
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?
>
>
>
> Thanks,
>
> Sanket A.
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 8:52 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> What's the full traceback when you run the same thing via spark-shell? So
> something like:
>
>
>
> $SPARK_HOME/bin/spark-shell \
>--conf "spark.sql.hive.metastore.version=3.1.3" \
>--conf "spark.sql.hive.metastore.jars=path" \
>--conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"
>
>
>
> W.r.t building hive, there's no need - either download it from
> https://downloads.apache.org/hive/hive-3.1.3/
> 
> or use the maven option like Yasukazu suggested. If you do want to build it
> make sure you are using Java 8 to do so.
>
>
>
> On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
> wrote:
>
> Hi,
>
>
>
> I tried pointing to hive 3.1.3 using the below command. But still getting
> error. I see that the spark-hive-thriftserver_2.12/3.4.1 and
> spark-hive_2.12/3.4.1 have dependency on hive 2.3.9
>
>
>
> Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf
> "spark.sql.hive.metastore.jars=path" --conf
> "spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"
>
>
>
> Error:
>
>
>
>
>
> Also, when I am trying to build spark with Hive 3.1.3 I am getting
> following error.
>
>
>
> If anyone can give me some direction then it would of great help.
>
>
>
> Thanks,
>
> Sanket
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 1:32 AM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi,
>
>
>
> Why not download/build the hive 3.1.3 bundle and tell Spark to use that?
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 
>
>
>
> Basically, set:
>
> spark.sql.hive.metastore.version 3.1.3
>
> spark.sql.hive.metastore.jars path
>
> spark.sql.hive.metastore.jars.path 
>
>
>
> On Mon, Sep 4, 2023 at 7:42 PM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi,
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3. I tried by making
> below changes in spark pom.xml but it’s failing.
>
>
>
> Pom.xml
>
>
>
> Error:
>
>
>
> Can anyone help me with the required configurations?
>
>
>
> Thanks,
>
> SA
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is 

[ANNOUNCE] Apache Spark 3.2.3 released

2022-11-29 Thread Chao Sun
We are happy to announce the availability of Apache Spark 3.2.3!

Spark 3.2.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Chao

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Chao Sun
Congrats everyone! and thanks Yuming for driving the release!

On Wed, Oct 26, 2022 at 7:37 AM beliefer  wrote:
>
> Congratulations everyone have contributed to this release.
>
>
> At 2022-10-26 14:21:36, "Yuming Wang"  wrote:
>
> We are happy to announce the availability of Apache Spark 3.3.1!
>
> Spark 3.3.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread Chao Sun
Hi Fengyu,

Do you still have the Parquet file that caused the error? could you
open a JIRA and attach the file to it? I can take a look.

Chao

On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao  wrote:
>
> I'm trying to upgrade our spark (3.2.1 now)
>
> but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file
>
> Is anyone else having the same problem as me? Or do I need to provide any 
> information to the devs ?
>
> ```
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 
> (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: don't know what type: 15
> at org.apache.parquet.format.Util.read(Util.java:365)
> at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> don't know what type: 15
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894)
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560)
> at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
> at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
> at org.apache.parquet.format.Util.read(Util.java:362)
> ... 32 more
>
>
> ```
>
> similar to https://issues.apache.org/jira/browse/SPARK-11844, but we 

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Chao Sun
Thanks Huaxin for driving the release!

On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:

> It's Great!
> Congrats and thanks, huaxin!
>
>
> -- 原始邮件 --
> *发件人:* "huaxin gao" ;
> *发送时间:* 2022年1月29日(星期六) 上午9:07
> *收件人:* "dev";"user";
> *主题:* [ANNOUNCE] Apache Spark 3.2.1 released
>
> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


Re: Spark with Hadoop 3.3 distribution release in Download page

2021-01-19 Thread Chao Sun
Hi Gabriel,

The distribution won’t be available for download until Spark supports
Hadoop 3.3.x. At the moment, Spark cannot use Hadoop 3.3.0 because of
various issues. Our best bet is to wait until Hadoop 3.3.1 comes out.

Best,
Chao

On Tue, Jan 19, 2021 at 8:00 PM Gabriel Magno 
wrote:

> Correction: spark-3.0.1-bin-hadoop3.3.tgz
>
> --
> Gabriel Magno
>
>
> Em qua., 20 de jan. de 2021 às 00:34, Gabriel Magno <
> gabrielmag...@gmail.com> escreveu:
>
>> Are there any plans to provide a distribution release of Spark with
>> Hadoop3.3 (spark-3.0.1-bin-hadoop3.2.tgz) directly in the Spark Download
>> page (https://spark.apache.org/downloads.html)?
>>
>> --
>> Gabriel Magno
>>
>