Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Andrew Ash
e on RC3 -- SPARK-23274 > <https://issues.apache.org/jira/browse/SPARK-23274> was resolved > yesterday and tests have been quite healthy throughout this week and the > last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202 > <https://issues.apache.org/jira/browse/SP

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-30 Thread Andrew Ash
I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release as well, due to being a regression from 2.2.0. The ticket has a simple repro included, showing a query that works in prior releases but now fails with an exception in

Re: Kubernetes: why use init containers?

2018-01-12 Thread Andrew Ash
+1 on the first release being marked experimental. Many major features coming into Spark in the past have gone through a stabilization process On Fri, Jan 12, 2018 at 1:18 PM, Marcelo Vanzin wrote: > BTW I most probably will not have time to get back to this at any time >

Re: Kubernetes: why use init containers?

2018-01-10 Thread Andrew Ash
It seems we have two standard practices for resource distribution in place here: - the Spark way is that the application (Spark) distributes the resources *during* app execution, and does this by exposing files/jars on an http server on the driver (or pre-staged elsewhere), and executors

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Andrew Ash
+0 (non-binding) I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure. It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread Andrew Ash
+1 (non-binding) We're moving large amounts of infrastructure from a combination of open source and homegrown cluster management systems to unify on Kubernetes and want to bring Spark workloads along with us. On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 wrote: > +1

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
tually, at least, ORC > codes. > > > > And, Spark without `-Phive` can ORC like Parquet. > > > > This is one milestone for `Feature parity for ORC with Parquet > (SPARK-20901)`. > > > > Bests, > > Dongjoon > > > > *From: *Reynold Xin <r.

Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Andrew Ash
I would support moving ORC from sql/hive -> sql/core because it brings me one step closer to eliminating Hive from my Spark distribution by removing -Phive at build time. On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun wrote: > Thank you again for coming and reviewing

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Andrew Ash
-1 due to regression from 2.1.1 In 2.2.0-rc1 we bumped the Parquet version from 1.8.1 to 1.8.2 in commit 26a4cba3ff . Parquet 1.8.2 includes a backport from 1.9.0: PARQUET-389 in commit

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Andrew Ash
Spark 2.x has to be the time for Java 8. I'd rather increase JVM major version on a Spark major version than on a Spark minor version, and I'd rather Spark do that upgrade for the 2.x series than the 3.x series (~2yr from now based on the lifetime of Spark 1.x). If we wait until the next

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-25 Thread Andrew Ash
I would guess that many tickets targeted at 1.4.1 were set that way during the tail end of the 1.4.0 voting process as people realized they wouldn't make the .0 release in time. In that case, they were likely aiming for a 1.4.x release, not necessarily 1.4.1 specifically. Maybe creating a 1.4.x

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Andrew Ash
Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively? Alternatively effort might be better spent in making .withColumn() singular faster. On Tue, Jun 2, 2015 at 3:46 PM, Reynold Xin r...@databricks.com wrote: We improved this

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Andrew Ash
Does the Apache project team have any ability to measure download counts of the various releases? That data could be useful when it comes time to sunset vendor-specific releases, like CDH4 for example. On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan mri...@gmail.com wrote: In ideal

Re: Block Transfer Service encryption support

2015-03-08 Thread Andrew Ash
I'm interested in seeing this data transfer occurring over encrypted communication channels as well. Many customers require that all network transfer occur encrypted to prevent the soft underbelly that's often found inside a corporate network. On Fri, Mar 6, 2015 at 4:20 PM, turp1twin

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-24 Thread Andrew Ash
, though you would probably be unhappy with the overhead. On Wed, Feb 18, 2015 at 9:09 AM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call

Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Andrew Ash
Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call pulls over a partition at a time to the driver, and then streams the RDD out from that partition before pulling in the next one. When you have

Re: talk on interface design

2015-01-26 Thread Andrew Ash
In addition to the references you have at the end of the presentation, there's a great set of practical examples based on the learnings from Qt posted here: http://www21.in.tum.de/~blanchet/api-design.pdf Chapter 4's way of showing a principle and then an example from Qt is particularly

Re: Join implementation in SparkSQL

2015-01-15 Thread Andrew Ash
What Reynold is describing is a performance optimization in implementation, but the semantics of the join (cartesian product plus relational algebra filter) should be the same and produce the same results. On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin r...@databricks.com wrote: It's a bunch of

Maintainer for Mesos

2015-01-05 Thread Andrew Ash
Hi Spark devs, I'm interested in having a committer look at a PR [1] for Mesos, but there's not an entry for Mesos in the maintainers specialties on the wiki [2]. Which Spark committers have expertise in the Mesos features? Thanks! Andrew [1] https://github.com/apache/spark/pull/3074 [2]

Re: Announcing Spark Packages

2014-12-22 Thread Andrew Ash
Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: Dear Spark users and developers, I’m happy to announce Spark

Re: More general submitJob API

2014-12-22 Thread Andrew Ash
Hi Alex, SparkContext.submitJob() is marked as experimental -- most client programs shouldn't be using it. What are you looking to do? For multiplexing jobs, one thing you can do is have multiple threads in your client JVM each submit jobs on your SparkContext job. This is described here in

Re: Spark JIRA Report

2014-12-15 Thread Andrew Ash
is intended to help with. Nick On Sun Dec 14 2014 at 2:49:00 AM Andrew Ash and...@andrewash.com wrote: The goal of increasing visibility on open issues is a good one. How is this different from just a link to Jira though? Some might say this adds noise to the mailing list and doesn't

Governance of the Jenkins whitelist

2014-12-13 Thread Andrew Ash
Jenkins is a really valuable tool for increasing quality of incoming patches to Spark, but I've noticed that there are often a lot of patches waiting for testing because they haven't been approved for testing. Certain users can instruct Jenkins to run on a PR, or add other users to a whitelist.

Re: Spark JIRA Report

2014-12-13 Thread Andrew Ash
The goal of increasing visibility on open issues is a good one. How is this different from just a link to Jira though? Some might say this adds noise to the mailing list and doesn't contain any information not already available in Jira. The idea seems good but the formatting leaves a little to

Re: Tachyon in Spark

2014-12-11 Thread Andrew Ash
I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Andrew Ash
Hi Jay, I just came across SPARK-720 Statically guarantee serialization will succeed https://issues.apache.org/jira/browse/SPARK-720 which sounds like exactly what you're referring to. Like Reynold I think it's not possible at this time but it would be good to get your feedback on that ticket.

Re: Regarding RecordReader of spark

2014-11-16 Thread Andrew Ash
Filed as https://issues.apache.org/jira/browse/SPARK-4437 On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote: I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add

Raise Java dependency from 6 to 7

2014-10-17 Thread Andrew Ash
Hi Spark devs, I've heard a few times that keeping support for Java 6 is a priority for Apache Spark. Given that Java 6 has been publicly EOL'd since Feb 2013 http://www.oracle.com/technetwork/java/eol-135779.html and the last public update was Apr 2013

Re: Spark on Mesos 0.20

2014-10-06 Thread Andrew Ash
Hi Gurvinder, Is there a SPARK ticket tracking the issue you describe? On Mon, Oct 6, 2014 at 2:44 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 10/06/2014 08:19 AM, Fairiz Azizi wrote: The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 I've gotten it to

Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash
Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between raw parquet and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Andrew Ash
FWIW we use CDH4 extensively and would very much appreciate having a prebuilt version of Spark for it. We're doing a CDH 4.4 to 4.7 upgrade across all the clusters now and have plans for a 5.x transition after that. On Aug 28, 2014 11:57 PM, Sean Owen so...@cloudera.com wrote: On Fri, Aug 29,

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Hi Paul, I agree that jumping straight from reading N rows from 1 partition to N rows from ALL partitions is pretty aggressive. The exponential growth strategy of doubling the partition count every time seems better -- 1, 2, 4, 8, 16, ... will be much more likely to prevent OOMs than the 1 - ALL

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Yep, anyone can create a bug at https://issues.apache.org/jira/browse/SPARK Then if you make a pull request on GitHub and have the bug number in the header like [SPARK-1234] Make take() less OOM-prone, then the PR gets linked to the Jira ticket. I think that's the best way to get feedback on a

Hang on Executor classloader lookup for the remote REPL URL classloader

2014-08-21 Thread Andrew Ash
Hi Spark devs, I'm seeing a stacktrace where the classloader that reads from the REPL is hung, and blocking all progress on that executor. Below is that hung thread's stacktrace, and also the stacktrace of another hung thread. I thought maybe there was an issue with the REPL's JVM on the other

FileNotFoundException with _temporary in the name

2014-08-12 Thread Andrew Ash
Hi Spark devs, Several people on the mailing list have seen issues with FileNotFoundExceptions related to _temporary in the name. I've personally observed this several times, as have a few of my coworkers on various Spark clusters. Any ideas what might be going on? I've collected the various

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Andrew Ash
, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote: Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro

Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-07-31 Thread Andrew Ash
Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro files in HDFS, do processing, and write back to Avro files. It does this

Re:[VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-27 Thread Andrew Ash
Is that a regression since 1.0.0? On Jul 27, 2014 10:43 AM, witgo wi...@qq.com wrote: -1 The following bug should be fixed: https://issues.apache.org/jira/browse/SPARK-2677‍ -- Original -- From: Tathagata Das;tathagata.das1...@gmail.com; Date: Sat, Jul

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Andrew Ash
Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com

Re: Hadoop's Configuration object isn't threadsafe

2014-07-17 Thread Andrew Ash
PM, Andrew Ash and...@andrewash.com wrote: Hi Patrick, thanks for taking a look. I filed as https://issues.apache.org/jira/browse/SPARK-2546 Would you recommend I pursue the cloned Configuration object approach now and send in a PR? Reynold's recent announcement of the broadcast RDD

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
at the same time. It won't deal with reader writer conflicts where some of our initialization code touches state that is needed during normal execution of other tasks. - Patrick On Tue, Jul 15, 2014 at 12:56 PM, Andrew Ash and...@andrewash.com wrote: Hi Shengzhe, Even if we did make

Re: Hadoop's Configuration object isn't threadsafe

2014-07-15 Thread Andrew Ash
). -Shengzhe On Mon, Jul 14, 2014 at 10:22 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, We discovered a very interesting bug in Spark at work last week in Spark 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to thread safety issues. I believe it still

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling today). On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote: hi,Cody i met this issue days before and i post a PR for

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
is addressing regressions between these two releases. On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote: I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling

Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-30 Thread Andrew Ash
bugs. For this reason, it probably can't block a release (I'm not even sure if it should go into a maintenance release where we fix critical bugs for Spark core). We should definitely include them for 1.1.0 though (~Aug). On Sun, Jun 29, 2014 at 11:09 PM, Andrew Ash and...@andrewash.com wrote

Re: Contributing to MLlib on GLM

2014-06-17 Thread Andrew Ash
Hi Xiaokai, Also take a look through Xiangrui's slides from HadoopSummit a few weeks back: http://www.slideshare.net/xrmeng/m-llib-hadoopsummit The roadmap starting at slide 51 will probably be interesting to you. Andrew On Tue, Jun 17, 2014 at 7:37 PM, Sandy Ryza sandy.r...@cloudera.com

Compile failure with SBT on master

2014-06-16 Thread Andrew Ash
I can't run sbt/sbt gen-idea on a clean checkout of Spark master. I get resolution errors on junit#junit;4.10!junit.zip(source) As shown below: aash@aash-mbp /tmp/git/spark$ sbt/sbt gen-idea Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home as default JAVA_HOME. Note, this

Re: Compile failure with SBT on master

2014-06-16 Thread Andrew Ash
, 2014 at 9:29 PM, Andrew Ash and...@andrewash.com wrote: I can't run sbt/sbt gen-idea on a clean checkout of Spark master. I get resolution errors on junit#junit;4.10!junit.zip(source) As shown below: aash@aash-mbp /tmp/git/spark$ sbt/sbt gen-idea Using /Library/Java

Re: Implementing rdd.scanLeft()

2014-06-05 Thread Andrew Ash
I that something that documentation on the method can solve? On Thu, Jun 5, 2014 at 10:47 AM, Reynold Xin r...@databricks.com wrote: I think the main concern is this would require scanning the data twice, and maybe the user should be aware of it ... On Thu, Jun 5, 2014 at 10:29 AM, Andrew

Re: Timestamp support in v1.0

2014-05-29 Thread Andrew Ash
I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep

Re: all values for a key must fit in memory

2014-05-25 Thread Andrew Ash
Hi Nilesh, That change from Matei to change (Key, Seq[Value]) into (Key, Iterable[Value]) was to enable the optimization in future releases without breaking the API. Currently though, all values on a single key are still held in memory on a single machine. The way I've gotten around this is by

Re: Should SPARK_HOME be needed with Mesos?

2014-05-22 Thread Andrew Ash
executor in place. -kr, Gerard. [1] https://issues.apache.org/jira/browse/SPARK-1110 On Thu, May 22, 2014 at 6:19 AM, Andrew Ash and...@andrewash.com wrote: Hi Gerard, I agree that your second option seems preferred. You shouldn't have to specify a SPARK_HOME if the executor is going

Re: Should SPARK_HOME be needed with Mesos?

2014-05-21 Thread Andrew Ash
Hi Gerard, I agree that your second option seems preferred. You shouldn't have to specify a SPARK_HOME if the executor is going to use the spark.executor.uri instead. Can you send in a pull request that includes your proposed changes? Andrew On Wed, May 21, 2014 at 10:19 AM, Gerard Maas

Re: Sorting partitions in Java

2014-05-20 Thread Andrew Ash
Voted :) https://issues.apache.org/jira/browse/SPARK-983 On Tue, May 20, 2014 at 10:21 AM, Sandy Ryza sandy.r...@cloudera.comwrote: There is: SPARK-545 On Tue, May 20, 2014 at 10:16 AM, Andrew Ash and...@andrewash.com wrote: Sandy, is there a Jira ticket for that? On Tue, May 20

TorrentBroadcast aka Cornet?

2014-05-19 Thread Andrew Ash
Hi Spark devs, Is the algorithm for TorrentBroadcasthttps://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scalathe same as Cornet from the below paper? http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf If so it would be nice

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Andrew Ash
Sounds like the problem is that classloaders always look in their parents before themselves, and Spark users want executors to pick up classes from their custom code before the ones in Spark plus its dependencies. Would a custom classloader that delegates to the parent after first checking itself

Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Andrew Ash
The nice thing about putting discussion on the Jira is that everything about the bug is in one place. So people looking to understand the discussion a few years from now only have to look on the jira ticket rather than also search the mailing list archives and hope commenters all put the string

Re: Matrix Multiplication of two RDD[Array[Double]]'s

2014-05-18 Thread Andrew Ash
Hi Liquan, There is some working being done on implementing linear algebra algorithms on Spark for use in higher-level machine learning algorithms. That work is happening in the MLlib project, which has a org.apache.spark.mllib.linalgpackage you may find useful. See

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Andrew Ash
+1 on the next release feeling more like a 0.10 than a 1.0 On May 17, 2014 4:38 AM, Mridul Muralidharan mri...@gmail.com wrote: I had echoed similar sentiments a while back when there was a discussion around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api changes, add missing

Updating docs for running on Mesos

2014-05-15 Thread Andrew Ash
The docs for how to run Spark on Mesos have changed very little since 0.6.0, but setting it up is much easier now than then. Does it make sense to revamp with the below changes? You no longer need to build mesos yourself as pre-built versions are available from Mesosphere:

Re: Requirements of objects stored in RDDs

2014-05-13 Thread Andrew Ash
An RDD can hold objects of any type. If you generally think of it as a distributed Collection, then you won't ever be that far off. As far as serialization, the contents of an RDD must be serializable. There are two serialization libraries you can use with Spark: normal Java serialization or

Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
...@gmail.com wrote: Andrew, Updating these docs would be great! I think this would be a welcome change. In terms of packaging, it would be good to mention the binaries produced by the upstream project as well, in addition to Mesosphere. - Patrick On Thu, May 8, 2014 at 12:51 AM, Andrew Ash

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust mich...@databricks.comwrote: It looks like currently the .count() on parquet is handled incredibly inefficiently and all the columns are materialized.

Re: Kryo not default?

2014-05-12 Thread Andrew Ash
As an example of where it sometimes doesn't work, in older versions of Kryo / Chill the Joda LocalDate class didn't serialize properly -- https://groups.google.com/forum/#!topic/cascalog-user/35cdnNIamKU On Mon, May 12, 2014 at 4:39 PM, Reynold Xin r...@databricks.com wrote: The main reason is

Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Andrew Ash
Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
Are you using the PatternInputFormat from this blog post? https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/ If so you need to set the pattern in the configuration before attempting to read data with that InputFormat: String regex =

Re: SPARK-942 patch review

2014-02-25 Thread Andrew Ash
if your PR is being ignored. We'll implement some kind of cleanup (at least manually) to close the old ones. Matei On Feb 24, 2014, at 1:30 PM, Andrew Ash and...@andrewash.com wrote: Yep that's the one thanks! That's quite a few more people than I thought Sent from my mobile phone