The easiest would be to create a fork of the code in github. I can also
accept diffs.
Cheers Andrew
From: Min Shen
Date: Monday, January 27, 2020 at 12:48 PM
To: "Long, Andrew" , "dev@spark.apache.org"
Subject: Re: Enabling push-based shuffle in Spark
Hi Andrew,
We
Hey Bing,
There’s a couple different approaches you could take. The quickest and easiest
would be to use the existing APIs
val bytes = spark.range(1000
bytes.foreachPartition(bytes =>{
//W ARNING anything used in here will need to be serializable.
// There's some magic to serializing the
, January 7, 2020 at 12:00 AM
To: "Long, Andrew"
Cc: "dev@spark.apache.org"
Subject: Re: SortMergeJoinExec: Utilizing child partitioning when joining
1. Where can I find information on how to run standard performance
tests/benchmarks?
2. Are performance degradations to existing quer
“Thoughts on this approach?“
Just to warn you this is a hazardous optimization without cardinality
information. Removing columns from the hash exchange reduces entropy
potentially resulting in skew. Also keep in mind that if you reduce the number
of columns on one side of the join you need
Hey Friends,
I recently created a pull request to add an optional support for bucket joins
to V2 Datasources, via a concrete class representing the Spark Style ash
Partitioning. If anyone has some free time Id appreciate a code review. This
also adds a concrete implementation of V2
Hey Friends,
Is there a timeline for spark 3.0 in terms of the first RC and final release?
Cheers Andrew
Hey Friends,
How aware of bucketing is Catalyst? I’ve been trying to piece together how
Catalyst knows that it can remove a sort and shuffle given that both tables are
bucketed and sorted the same way. Is there any classes in particular I should
look at?
Cheers Andrew
ursday, April 25, 2019 at 8:47 AM
To: "Long, Andrew"
Cc: dev
Subject: Re: FW: Stage 152 contains a task of very large size (12747 KB). The
maximum recommended task size is 100 KB
I usually only see that in regards to folks parallelizing very large objects.
From what I know, it's real
Hey Friends,
Is there an easy way of figuring out whats being pull into the task context?
I’ve been getting the following message which I suspect means I’ve
unintentional caught some large objects but figuring out what those objects are
is stumping me.
19/04/23 13:52:13 WARN
Hey Friends,
Is it possible to specify the sort order or bucketing in a way that can be used
by the optimizer in spark?
Cheers Andrew
Hey Friends,
I’m working on a POC that involves reading and writing parquet files mid dag.
Writes are working but I’m struggling with getting reads working due to
serialization issues. I’ve got code that works in master=local but not in yarn.
So here are my questions.
1. Is there an
la:305)
at
com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100)
at
com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100)
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Thursday, March 21, 2019 at 3:32 PM
T
Hello Friends,
I’m working on a performance improvement that reads additional parquet files in
the middle of a lambda and I’m running into some issues. This is what id like
todo
ds.mapPartitions(x=>{
//read parquet file in and perform an operation with x
})
Here’s my current POC code but
Thanks Fokko,
I will definitely take a look at this.
Cheers Andrew
From: "Driesprong, Fokko"
Date: Friday, August 24, 2018 at 2:39 AM
To: "reubensaw...@hotmail.com"
Cc: "dev@spark.apache.org"
Subject: Re: Spark data quality bug when reading parquet files from hive
metastore
Hi Andrew,
Hello Friends,
I’ve encountered a bug where spark silently corrupts data when reading from a
parquet hive table where the table schema does not match the file schema. I’d
like to give a shot at adding some extra validations to the code to handle this
corner case and I was wondering if anyone
Hello Friends,
I’m a new committer and I’ve submitted my first patch and I had some questions
about documentation standards. In my patch(jira below) I’ve added a config
parameter to adjust the number of records show when a user calls .show() on a
dataframe. I was hoping someone could double
16 matches
Mail list logo