Re: Spark writing API

2023-08-16 Thread Andrew Melo
like with arrow's off-heap storage), it's crazy inefficient to try and do the equivalent of realloc() to grow the buffer size. Thanks Andrew > On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran > wrote: > >> >> >> On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote: >>

Re: Spark writing API

2023-08-02 Thread Andrew Melo
Hello Spark Devs Could anyone help me with this? Thanks, Andrew On Wed, May 31, 2023 at 20:57 Andrew Melo wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https

Spark writing API

2023-05-31 Thread Andrew Melo
e to bring the API to parity? Or is instead is it just a YMMV commitment Thanks! Andrew -- It's dark in this basement.

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
I'm Central US time (AKA UTC -6:00) On Tue, Feb 7, 2023 at 5:32 PM Holden Karau wrote: > > Awesome, I guess I should have asked folks for timezones that they’re in. > > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo wrote: >> >> Hello Holden, >> >> We are inter

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
Hello Holden, We are interested in Spark on k8s and would like the opportunity to speak with devs about what we're looking for slash better ways to use spark. Thanks! Andrew On Tue, Feb 7, 2023 at 5:24 PM Holden Karau wrote: > > Hi Folks, > > It seems like we could maybe use som

Re: Apache Spark 3.2.2 Release?

2022-07-07 Thread Andrew Ray
+1 (non-binding) Thanks! On Thu, Jul 7, 2022 at 7:00 AM Yang,Jie(INF) wrote: > +1 (non-binding) Thank you Dongjoon ~ > > > > *发件人**: *Ruifeng Zheng > *日期**: *2022年7月7日 星期四 16:28 > *收件人**: *dev > *主题**: *Re: Apache Spark 3.2.2 Release? > > > > +1 thank you Dongjoon! > > >

Re: Apache Spark 3.3 Release

2022-03-16 Thread Andrew Melo
have been sitting around for a while, but these are really important to our use-cases, and it would be nice to have them merged in. Cheers Andrew On Wed, Mar 16, 2022 at 6:21 PM Holden Karau wrote: > > I'd like to add/backport the logging in > https://github.com/apache/spark/pull/35881 PR

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Hi Mich, By default, pip caches downloaded binaries to somewhere like $HOME/.cache/pip. So after doing any "pip install", you'll want to either delete that directory, or pass the "--no-cache-dir" option to pip to prevent the download binaries from being added to the image.

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Silly Q, did you blow away the pip cache before committing the layer? That always trips me up. Cheers Andrew On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh wrote: > With no additional python packages etc we get 1.4GB compared to 2.19GB > before > > REPOSITO

Re: WholeStageCodeGen + DSv2

2021-05-19 Thread Andrew Melo
gt; Thanks, > Shubham > > On Wed, May 19, 2021 at 5:34 PM Takeshi Yamamuro > wrote: >> >> hi, Andrew, >> >> Welcome any improvement proposal for that. >> Could you file an issue in jira first to show us your idea and an example >> query >> to r

WholeStageCodeGen + DSv2

2021-05-18 Thread Andrew Melo
act like 'c1', 'c2', etc.. Would there be any interest in accepting a patch that shortens these variable names to try and stay under the limit? Thanks Andrew - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Secrets store for DSv2

2021-05-18 Thread Andrew Melo
! Andrew - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Andrew Melo
to performantly interchange jagged/ragged lists to/from python UDFs. Cheers Andrew On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon wrote: > > Thank you guys for all your feedback. I will start working on SPIP with > Koalas team. > I would expect the SPIP can be sent late this week or ear

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Andrew Melo
Hello Ryan, This proposal looks very interesting. Would future goals for this functionality include both support for aggregation functions, as well as support for processing ColumnBatch-es (instead of Row/InternalRow)? Thanks Andrew On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue wrote: > >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
e, so they work in their own personal python virtual environments. There are no "release artifacts" to publish, per-se, since each user is working independently and can install whatever they'd like into their virtual environments. Cheers Andrew > > On Wed, Jun 24, 2020 at 12:11 PM Wen

Re: DSv2 & DataSourceRegister

2020-04-16 Thread Andrew Melo
Hi again, Does anyone have thoughts on either the idea or the implementation? Thanks, Andrew On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo wrote: > > Hi all, > > I've opened a WIP PR here https://github.com/apache/spark/pull/28159 > I'm a novice at Scala, so I'm sure the code

Re: DSv2 & DataSourceRegister

2020-04-09 Thread Andrew Melo
: spark.read.format("root").option("tree","tvec").load("stdvector.root") Additionally, I did a very rough POC for spark2.4, which you can find at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24 . The same jar/inputfile works there as well. T

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Andrew Melo
o `DataSourceV2`? You're right, that was a typo. Since the whole point is to separate the (stable) registration interface from the (evolving) DSv2 API, it defeats the purpose to then directly reference the DSv2 API within the registration interface. I'll put together a PR. Thanks again, Andrew &g

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
n from META-INF and pass in the full class name to the DataFrameReader. Thanks Andrew > On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo wrote: > >> Hi Ryan, >> >> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue wrote: >> > >> > Hi Andrew, >> > >>

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
Hi Ryan, On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue wrote: > > Hi Andrew, > > With DataSourceV2, I recommend plugging in a catalog instead of using > DataSource. As you've noticed, the way that you plug in data sources isn't > very flexible. That's one of the reasons why we

DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
& format regardless of where they were executing their code. If there was no services registered with this new trait, the functionality would remain the same as before. I think this functionality will be useful as DSv2 continues to evolve, please let me know your

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-14 Thread Andrew Melo
dle and the desire to increase utillzation. Thanks Andrew Sean > > On Fri, Mar 13, 2020 at 6:33 PM Andrew Melo wrote: > > > > Hi Xingbo, Sean, > > > > On Fri, Mar 13, 2020 at 12:31 PM Xingbo Jiang > wrote: > >> > >> Andrew, could you provide mor

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
Hi Xingbo, Sean, On Fri, Mar 13, 2020 at 12:31 PM Xingbo Jiang wrote: > Andrew, could you provide more context of your use case please? Is it like > you deploy homogeneous containers on hosts with available resources, and > each container launches one worker? Or you deploy workers

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
end up with >1 worker per host. If I understand correctly, this proposal would make our use case unsupported. Thanks, Andrew > Thanks! > > Xingbo > -- It's dark in this basement.

Re: Enabling push-based shuffle in Spark

2020-01-27 Thread Long, Andrew
The easiest would be to create a fork of the code in github. I can also accept diffs. Cheers Andrew From: Min Shen Date: Monday, January 27, 2020 at 12:48 PM To: "Long, Andrew" , "dev@spark.apache.org" Subject: Re: Enabling push-based shuffle in Spark Hi Andrew, We

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Long, Andrew
Hey Bing, There’s a couple different approaches you could take. The quickest and easiest would be to use the existing APIs val bytes = spark.range(1000 bytes.foreachPartition(bytes =>{ //W ARNING anything used in here will need to be serializable. // There's some magic to serializing the

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Long, Andrew
existing queries. For example regressing a couple queries by 5% is fine BUT causing a query that would have previously run to crash is not ok. Additionally we have a sample of user queries +etl processes that we try not to break either. Cheers Andrew From: Brett Marcott Date: Tuesday

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-02 Thread Long, Andrew
(…)) left/right outer does look surprising though. You should see something like… left.execute().zipPartitions(right.execute()) { (leftIter, rightIter) => Cheers Andrew From: Brett Marcott Date: Tuesday, December 31, 2019 at 11:49 PM To: "dev@spark.apache.org" Subject: Sort

Re: how to get partition column info in Data Source V2 writer

2019-12-17 Thread Andrew Melo
t;2.4 and then again for 2.4->3.0. Cheers Andrew > > > Thanks, > Aakash >

CR for adding bucket join support to V2 Datasources

2019-11-18 Thread Long, Andrew
ClusteredDistribution to make specifying Clustered Distributionseasier. https://github.com/apache/spark/pull/26511 Cheers Andrew

Re: DSv2 reader lifecycle

2019-11-06 Thread Andrew Melo
Hi Ryan, Thanks for the pointers On Thu, Nov 7, 2019 at 8:13 AM Ryan Blue wrote: > Hi Andrew, > > This is expected behavior for DSv2 in 2.4. A separate reader is configured > for each operation because the configuration will change. A count, for > example, doesn't need to proj

DSv2 reader lifecycle

2019-11-05 Thread Andrew Melo
sive to deserialize all the various metadata, so I was holding the deserialized version in the DataSourceReader, but if Spark is repeatedly constructing new ones, then that doesn't help. If this is the expected behavior, how should I handle this as a consumer of the API? Thanks! Andrew

Re: Exposing functions to pyspark

2019-10-08 Thread Andrew Melo
Hello again, Is it possible to grab a handle to the underlying DataSourceReader backing a DataFrame? I see that there's no nice way to add extra methods to Dataset, so being able to grab the DataSource backing the dataframe would be a good escape hatch. Cheers Andrew On Mon, Sep 30, 2019 at 3

Exposing functions to pyspark

2019-09-30 Thread Andrew Melo
') \ .option("tree", "tree") \ .load('small-flat-tree.root') They don't have a reference to any of my DS objects -- "df" is a DataFrame object, which I don't own. Does an

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Andrew Melo
ere's a couple little API things I think could be useful, I've just not had time to write here/open a JIRA about. Thanks Andrew > On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun > wrote: > >> Ur, Sean. >> >> I prefer a full release like 2.0.0-preview. >> >&g

Fwd: Custom aggregations: modular and lightweight solutions?

2019-08-21 Thread Andrew Leverentz
mple that I think illustrates the core stumbling block I'm running into. Thanks, ~ Andrew -- Forwarded message - From: Andrew Leverentz Date: Tue, Aug 13, 2019 at 12:59 PM Subject: Re: Custom aggregations: modular and lightweight solutions? To: Here's a simpler example that I t

Timeline for Spark 3.0

2019-06-28 Thread Long, Andrew
Hey Friends, Is there a timeline for spark 3.0 in terms of the first RC and final release? Cheers Andrew

DSV2 API Question

2019-06-25 Thread Andrew Melo
Hello, I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column

Re: Detect executor core count

2019-06-18 Thread Andrew Melo
hat in any case, there should be a SparkSession available if I'm in the executor context, so I can fallback to something sensible just in case. Thanks for the help, everyone! > On Tue, Jun 18, 2019 at 8:13 PM Ilya Matiach > wrote: > >> Hi Andrew, >> >> I tried to do

Detect executor core count

2019-06-18 Thread Andrew Melo
Hello, Is there a way to detect the number of cores allocated for an executor within a java-based InputPartitionReader? Thanks! Andrew

Re: DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
hich I improperly passing in instead of Metadata.empty() Thanks again, Andrew > > On Tue, May 21, 2019 at 11:39 AM Andrew Melo wrote: >> >> Hello, >> >> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/) >> file format to replace a previous DSV

DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
and the constructors. I followed the pattern from "JavaBatchDataSourceV2.java" -- is it possible that test-case isn't up to date? Are there any other example Java DSV2 readers out in the wild I could compare against? Thanks! Andrew [1] java.lang.NullPointe

Bucketing and catalyst

2019-05-02 Thread Long, Andrew
Hey Friends, How aware of bucketing is Catalyst? I’ve been trying to piece together how Catalyst knows that it can remove a sort and shuffle given that both tables are bucketed and sorted the same way. Is there any classes in particular I should look at? Cheers Andrew

Re: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-05-01 Thread Long, Andrew
It turned out that I was unintentionally copying multiple copies of the Hadoop config to every partition in an rdd. >.< I was able to debug this by setting a break point on the warning message and inspecting the partition object itself. Cheers Andrew From: Russell Spitzer Date: Th

FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-04-23 Thread Long, Andrew
org.apache.spark.internal.Logging$class TaskSetManager: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB Cheers Andrew

Sort order in bucketing in a custom datasource

2019-04-16 Thread Long, Andrew
Hey Friends, Is it possible to specify the sort order or bucketing in a way that can be used by the optimizer in spark? Cheers Andrew

Which parts of a parquet read happen on the driver vs the executor?

2019-04-11 Thread Long, Andrew
ion//relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options)) ) import scala.collection.JavaConverters._ val i: Iterator[Any] = readFile(pFile) val rows = i.flatMap(_ match { case r: InternalRow => Seq(r) case b: ColumnarBatch => b.rowIterator().asScala }) rows } Cheers Andrew

DataSourceV2 exceptions

2019-04-08 Thread Andrew Melo
to Spark? The DSV2 interface (unless I'm misreading it) doesn't specify any caught exceptions that can be thrown in the DS, so should I instead catch/rethrow any exceptions as uncaught exceptions? If so, is there a recommended hierarchy to throw from? thanks! Andrew

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
On Fri, Apr 5, 2019 at 9:41 AM Jungtaek Lim wrote: > > Thanks Andrew for reporting this. I just submitted the fix. > https://github.com/apache/spark/pull/24304 Thanks! > > On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo wrote: >> >> Hello, >> >> I'm not sur

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
the 2.4.0 version of the docs renders correctly. Cheers Andrew On Fri, Apr 5, 2019 at 7:59 AM DB Tsai wrote: > > +user list > > We are happy to announce the availability of Spark 2.4.1! > > Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4 > maintenan

Re: Manually reading parquet files.

2019-03-21 Thread Long, Andrew
la:305) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, March 21, 2019 at 3:32 PM T

Manually reading parquet files.

2019-03-21 Thread Long, Andrew
st([0,1,5b,24,66647361]) //??this is wrong I think Has anyone attempted something similar? Cheers Andrew

Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Andrew Melo
we can still > add Mesos support in the future if we observe valid use cases. First time caller, long time listener. We have GPUs in our Mesos-based Spark cluster, and it would be nice to use them with Spark-based GPU-enabled frameworks (our use case is deep learning applications). Cheers And

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
e'll need to calculate the sum of their 4-d momenta, while samples with <2 electrons will need subtract two different physical quantities -- several more steps before we get to the point where we'll histogram the different subsamples for the outputs. Cheers Andrew > > On Mon, Feb 4, 2019 at

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello Ryan, On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue wrote: > > Andrew, can you give us more information about why partitioning the output > data doesn't work for your use case? > > It sounds like all you need to do is to create a table partitioned by A and > B, then you w

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
outputs (AB, !AB, A!B, !A!B). As we add more conditions, the combinatorics explode like n^2, when we could produce them all up front with this "multi filter" (or however it would be called). Cheers Andrew > > Best Regards > Moein > > -- > > Moein Hosseini > Data Engi

Re: Spark data quality bug when reading parquet files from hive metastore

2018-09-07 Thread Long, Andrew
Thanks Fokko, I will definitely take a look at this. Cheers Andrew From: "Driesprong, Fokko" Date: Friday, August 24, 2018 at 2:39 AM To: "reubensaw...@hotmail.com" Cc: "dev@spark.apache.org" Subject: Re: Spark data quality bug when reading parquet files f

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hi, I'm a long-time listener, first-time committer to spark, so this is good to get my feet wet. I'm particularly interested in SPARK-23836, which is an itch I may want to dive into and scratch myself in the next month or so since it's pretty painful for our use-case. Thanks! Andrew On Mon, Aug

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hi Holden, I'm agnostic to the approach (though it seems cleaner to have an explicit API for it). If you would like, I can take that JIRA and implement it (should be a 3-line function). Cheers Andrew On Mon, Aug 27, 2018 at 2:14 PM, Holden Karau wrote: > Seems reasonable. We should proba

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
Hello Sean, others - Just to confirm, is it OK for client applications to access SparkContext._active_spark_context, if it wraps the accesses in `with SparkContext._lock:`? If that's acceptable to Spark, I'll implement the modifications in the Jupyter extensions. thanks! Andrew On Tue, Aug 7

Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew
had any suggestions for where to start looking in the spark code. Cheers Andrew

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
; and subject to change. Is that something I should be unconcerned about. The other thought is that the accesses with SparkContext are protected by "SparkContext._lock" -- should I also use that lock? Thanks for your help! Andrew > > On Tue, Aug 7, 2018 at 5:34 PM Andr

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
ion and causing a JVM to start. Is there an easy way to call getActiveSession that doesn't start a JVM? Cheers Andrew > > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo wrote: >> >> Hello, >> >> One pain point with various Jupyter extensions [1][2] that provide >&

SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
but this would be my first contribution to Spark, and I want to make sure my plan was kosher before I implemented it. Thanks! Andrew [1] https://krishnan-r.github.io/sparkmonitor/ [2] https://github.com/mozilla/jupyter-spark --

Feedback on first commit + jira issue I opened

2018-05-31 Thread Long, Andrew
check my small diff to make sure I wasn’t making any rookie mistakes before I submit a pull request. https://issues.apache.org/jira/browse/SPARK-24442 Cheers Andrew

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Andrew Ash
e on RC3 -- SPARK-23274 > <https://issues.apache.org/jira/browse/SPARK-23274> was resolved > yesterday and tests have been quite healthy throughout this week and the > last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202 > <https://issues.apache.org/jira/browse/SP

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-30 Thread Andrew Ash
I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release as well, due to being a regression from 2.2.0. The ticket has a simple repro included, showing a query that works in prior releases but now fails with an exception in

Re: Kubernetes: why use init containers?

2018-01-12 Thread Andrew Ash
ners, you'll need to mess with the > >> configuration so that spark-submit never sees spark.jars / > >> spark.files, so it doesn't trigger its dependency download code. (YARN > >> does something similar, btw.) That will surely mean different changes > >> in the curre

Re: Kubernetes: why use init containers?

2018-01-10 Thread Andrew Ash
It seems we have two standard practices for resource distribution in place here: - the Spark way is that the application (Spark) distributes the resources *during* app execution, and does this by exposing files/jars on an http server on the driver (or pre-staged elsewhere), and executors

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen

Re: Leveraging S3 select

2017-12-08 Thread Andrew Duffy
Hey Steve, Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done a decent amount of digging, but all I've found is a reference in a slide deck and some jira tickets. From: Steve Loughran Date: Tuesday, December 5, 2017 at 09:44 To: "Lalwani,

Faster and Lower memory implementation toPandas

2017-11-16 Thread Andrew Andrade
y partition (with 100 partitions) had an overhead of 76.30 MM and took almost half of the time to run. I realize that Arrow solves this but the modification is quite small and would greatly assist anyone who isn't able to use Arrow. Would a PR [1] from me to address this issue be welcome? Thanks, An

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Andrew Ash
be an unequivocal +1. Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined with other private APIs, vs pushing forward state of the art for performance. Andrew On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Andrew Ash
+1 (non-binding) We're moving large amounts of infrastructure from a combination of open source and homegrown cluster management systems to unify on Kubernetes and want to bring Spark workloads along with us. On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 wrote: > +1

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
@Reynold no I don't use the HiveCatalog -- I'm using a custom implementation of ExternalCatalog instead. On Thu, Aug 10, 2017 at 3:34 PM, Dong Joon Hyun <dh...@hortonworks.com> wrote: > Thank you, Andrew and Reynold. > > > > Yes, it will reduce the old Hive dependency even

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
I would support moving ORC from sql/hive -> sql/core because it brings me one step closer to eliminating Hive from my Spark distribution by removing -Phive at build time. On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun wrote: > Thank you again for coming and reviewing

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Andrew Ash
issues.apache.org/jira/browse/SPARK-20364> and has a PR fix up at PR #17680 <https://github.com/apache/spark/pull/17680>. I nominate SPARK-20364 <https://issues.apache.org/jira/browse/SPARK-20364> as a release blocker due to the data correctness regression. Thanks! Andrew On Thu

Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

2017-04-03 Thread Andrew Ray
You probably don't want null safe equals (<=>) with a left join. On Mon, Apr 3, 2017 at 5:46 PM gjohnson35 wrote: > The join condition with && is throwing an exception: > > val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=> > baseDF("medical_claim_id") >

Re: Broadcast big dataset

2016-09-28 Thread Andrew Duffy
Have you tried upping executor memory? There's a separate spark conf for that: spark.executor.memory In general driver configurations don't automatically apply to executors. On Wed, Sep 28, 2016 at 7:03 AM -0700, "WangJianfei" wrote: Hi Devs In

Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread Andrew Duffy
Pedantic note about hashCode and equals: the equality doesn't need to be bidirectional, you just need to ensure that a.hashCode == b.hashCode when a.equals(b), the bidirectional case is usually harder to satisfy due to possibility of collisions. Good info:

master snapshots not publishing?

2016-07-21 Thread Andrew Duffy
-snapshots/ Looking at the Jenkins page it says that the master-maven build is disabled, is this purposeful? -Andrew

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Andrew Or
+1, some maintainers are hard to find 2016-05-19 9:03 GMT-07:00 Imran Rashid : > +1 (binding) on removal of maintainers > > I dont' have a strong opinion yet on how to have a system for finding the > right reviewers. I agree it would be nice to have something to help you >

Re: HDFS as Shuffle Service

2016-04-28 Thread Andrew Ray
Yes, HDFS has serious problems with creating lots of files. But we can always just create a single merged file on HDFS per task. On Apr 28, 2016 11:17 AM, "Reynold Xin" wrote: Hm while this is an attractive idea in theory, in practice I think you are substantially

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Andrew Ray
While you can't automatically push the limit *through* the join, we could push it *into* the join (stop processing after generating 10 records). I believe that is what Rajesh is suggesting. On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > I am

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-25 Thread Andrew Ray
for the future (Java 9, Scala 2.12) and Spark 2.0 is the only chance we are going to have to do so for a long time. --Andrew On Thu, Mar 24, 2016 at 10:55 PM, Mridul Muralidharan <mri...@gmail.com> wrote: > > I do agree w.r.t scala 2.10 as well; similar arguments apply (though there &g

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Andrew Ash
Spark 2.x has to be the time for Java 8. I'd rather increase JVM major version on a Spark major version than on a Spark minor version, and I'd rather Spark do that upgrade for the 2.x series than the 3.x series (~2yr from now based on the lifetime of Spark 1.x). If we wait until the next

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-21 Thread Andrew Or
@Nezih, can you try again after setting `spark.memory.useLegacyMode` to true? Can you still reproduce the OOM that way? 2016-03-21 10:29 GMT-07:00 Nezih Yigitbasi : > Hi Spark devs, > I am using 1.6.0 with dynamic allocation on yarn. I am trying to run a >

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-08 Thread Andrew Or
+1 2016-03-08 10:59 GMT-08:00 Yin Huai : > +1 > > On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin wrote: > >> +1 (binding) >> >> >> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov >> wrote: >> >>> +1 >>> >>> Spark ODBC server is

Re: Welcoming two new committers

2016-02-08 Thread Andrew Or
Welcome! 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra : > Congratulations to both. and welcome to group. > > On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia > wrote: > >> Hi all, >> >> The PMC has recently added two new Spark committers --

Spark 1.6: Why Including hive-jdbc in assembly when -Phive-provided is set?

2016-02-03 Thread Andrew Lee
Hi All, I have a question regarding the hive-jdbc library that is being included in the assembly JAR. Build command. mvn -U -X -Phadoop-2.6 -Phadoop-provided -Phive-provided -Pyarn -Phive-thriftserver -Psparkr -DskipTests install In the pom.xml file, the scope for hive JARs are set to

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Andrew Or
+1 2015-12-22 12:43 GMT-08:00 Reynold Xin : > +1 > > > On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust > wrote: > >> I'll kick the voting off with a +1. >> >> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust < >> mich...@databricks.com> wrote:

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Andrew Or
+1 Ran PageRank on standalone mode with 4 nodes and noticed a speedup after the specific commits that were in RC2 but not RC1: c247b6a Dec 10 [SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management 05e441e Dec 9 [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory

Re: Removing the Mesos fine-grained mode

2015-11-23 Thread Andrew Or
@Jerry Lam Can someone confirm if it is true that dynamic allocation on mesos "is > designed to run one executor per slave with the configured amount of > resources." I copied this sentence from the documentation. Does this mean > there is at most 1 executor per node? Therefore, if you have a

Re: Spark 1.4.2 release and votes conversation?

2015-11-16 Thread Andrew Lee
2015 1:31 PM To: Andrew Lee Cc: dev@spark.apache.org Subject: Re: Spark 1.4.2 release and votes conversation? In the interim, you can just build it off branch-1.4 if you want. On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: I actua

Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Andrew Lee
Hi All, I'm wondering if Spark 1.4.2 had been voted by any chance or if I have overlooked and we are targeting 1.4.3? By looking at the JIRA https://issues.apache.org/jira/browse/SPARK/fixforversion/12332833/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel All

Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Andrew Duffy
Relevant link: http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Wed, Nov 11, 2015 at 7:31 PM, Reynold Xin wrote: > Thanks for the email. Can you explain what the difference is between this > and existing formats such as Parquet/ORC? > > > On

Re: Concurrency issue in SQLExecution.withNewExecutionId

2015-09-10 Thread Andrew Or
@Olivier, did you use scala's parallel collections by any chance? If not, what form of concurrency were you using? 2015-09-10 13:01 GMT-07:00 Andrew Or <and...@databricks.com>: > Thanks for reporting this, I have filed > https://issues.apache.org/jira/browse/SPARK-10548. > > 2

Re: Concurrency issue in SQLExecution.withNewExecutionId

2015-09-10 Thread Andrew Or
Thanks for reporting this, I have filed https://issues.apache.org/jira/browse/SPARK-10548. 2015-09-10 9:09 GMT-07:00 Olivier Toupin : > Look at this code: > > >

Re: Flaky test in DAGSchedulerSuite?

2015-09-04 Thread Andrew Or
(merge into master, thanks for the quick fix Pete). 2015-09-04 15:58 GMT-07:00 Cheolsoo Park : > Thank you Pete! > > On Fri, Sep 4, 2015 at 1:40 PM, Pete Robbins wrote: > >> raised https://issues.apache.org/jira/browse/SPARK-10454 and PR >> >> On 4

Re: What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-22 Thread Andrew Or
. Functionally there is currently no difference; the latter came about recently in an ongoing experiment to make unit tests run faster. -Andrew 2015-07-21 22:47 GMT-07:00 Yu Ishikawa yuu.ishikawa+sp...@gmail.com: Hi all, When we send a PR, it seems that two requests to run tests are thrown

  1   2   3   >