Re: Enabling push-based shuffle in Spark

2020-01-27 Thread Long, Andrew
The easiest would be to create a fork of the code in github. I can also accept diffs. Cheers Andrew From: Min Shen Date: Monday, January 27, 2020 at 12:48 PM To: "Long, Andrew" , "dev@spark.apache.org" Subject: Re: Enabling push-based shuffle in Spark Hi Andrew, We

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Long, Andrew
Hey Bing, There’s a couple different approaches you could take. The quickest and easiest would be to use the existing APIs val bytes = spark.range(1000 bytes.foreachPartition(bytes =>{ //W ARNING anything used in here will need to be serializable. // There's some magic to serializing the

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Long, Andrew
, January 7, 2020 at 12:00 AM To: "Long, Andrew" Cc: "dev@spark.apache.org" Subject: Re: SortMergeJoinExec: Utilizing child partitioning when joining 1. Where can I find information on how to run standard performance tests/benchmarks? 2. Are performance degradations to existing quer

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-02 Thread Long, Andrew
“Thoughts on this approach?“ Just to warn you this is a hazardous optimization without cardinality information. Removing columns from the hash exchange reduces entropy potentially resulting in skew. Also keep in mind that if you reduce the number of columns on one side of the join you need

CR for adding bucket join support to V2 Datasources

2019-11-18 Thread Long, Andrew
Hey Friends, I recently created a pull request to add an optional support for bucket joins to V2 Datasources, via a concrete class representing the Spark Style ash Partitioning. If anyone has some free time Id appreciate a code review. This also adds a concrete implementation of V2

Timeline for Spark 3.0

2019-06-28 Thread Long, Andrew
Hey Friends, Is there a timeline for spark 3.0 in terms of the first RC and final release? Cheers Andrew

Bucketing and catalyst

2019-05-02 Thread Long, Andrew
Hey Friends, How aware of bucketing is Catalyst? I’ve been trying to piece together how Catalyst knows that it can remove a sort and shuffle given that both tables are bucketed and sorted the same way. Is there any classes in particular I should look at? Cheers Andrew

Re: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-05-01 Thread Long, Andrew
ursday, April 25, 2019 at 8:47 AM To: "Long, Andrew" Cc: dev Subject: Re: FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB I usually only see that in regards to folks parallelizing very large objects. From what I know, it's real

FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-04-23 Thread Long, Andrew
Hey Friends, Is there an easy way of figuring out whats being pull into the task context? I’ve been getting the following message which I suspect means I’ve unintentional caught some large objects but figuring out what those objects are is stumping me. 19/04/23 13:52:13 WARN

Sort order in bucketing in a custom datasource

2019-04-16 Thread Long, Andrew
Hey Friends, Is it possible to specify the sort order or bucketing in a way that can be used by the optimizer in spark? Cheers Andrew

Which parts of a parquet read happen on the driver vs the executor?

2019-04-11 Thread Long, Andrew
Hey Friends, I’m working on a POC that involves reading and writing parquet files mid dag. Writes are working but I’m struggling with getting reads working due to serialization issues. I’ve got code that works in master=local but not in yarn. So here are my questions. 1. Is there an

Re: Manually reading parquet files.

2019-03-21 Thread Long, Andrew
la:305) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, March 21, 2019 at 3:32 PM T

Manually reading parquet files.

2019-03-21 Thread Long, Andrew
Hello Friends, I’m working on a performance improvement that reads additional parquet files in the middle of a lambda and I’m running into some issues. This is what id like todo ds.mapPartitions(x=>{ //read parquet file in and perform an operation with x }) Here’s my current POC code but

Re: Spark data quality bug when reading parquet files from hive metastore

2018-09-07 Thread Long, Andrew
Thanks Fokko, I will definitely take a look at this. Cheers Andrew From: "Driesprong, Fokko" Date: Friday, August 24, 2018 at 2:39 AM To: "reubensaw...@hotmail.com" Cc: "dev@spark.apache.org" Subject: Re: Spark data quality bug when reading parquet files from hive metastore Hi Andrew,

Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew
Hello Friends, I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema. I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone

Feedback on first commit + jira issue I opened

2018-05-31 Thread Long, Andrew
Hello Friends, I’m a new committer and I’ve submitted my first patch and I had some questions about documentation standards. In my patch(jira below) I’ve added a config parameter to adjust the number of records show when a user calls .show() on a dataframe. I was hoping someone could double