Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Josh Rosen
+1 On Mon, Apr 15, 2024 at 11:26 AM Maciej wrote: > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/15/24 8:16 PM, Rui Wang wrote: > > +1, non-binding. > > Thanks Dongjoon to drive this! > > > -Rui > > On Mon, Apr 15, 2024 at 10:10 AM

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Josh Rosen
ration / sync script. On Wed, Apr 29, 2020 at 6:21 PM Hyukjin Kwon wrote: > Let actually me just take a look by myself and bring some updates soon. > > 2020년 4월 30일 (목) 오전 9:13, Hyukjin Kwon 님이 작성: > >> WDYT @Josh Rosen ? >> Seems >> https://github.co

Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Josh Rosen
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark documentation: these are valuable resources and I think we could increase that value by making these docs easier to discover and by adding a bit more structure to the existing content. For folks who aren't familiar with

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Josh Rosen
+1 in favor of some sort of JIRA cleanup. My only request is that we attach some sort of 'bulk-closed' label to issues that we close via JIRA filter batch operations (and resolve the issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label makes it easier to audit what was closed,

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-25 Thread Josh Rosen
The code for this runs in http://spark-prs.appspot.com (see https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137 ) I checked the AppEngine logs and it looks like we're getting error responses, possibly due to a

Re: spark-tests.appspot status?

2017-12-14 Thread Josh Rosen
Yep, it turns out that there was a problem with the Jenkins job. I've restarted it and it should be backfilling now (this might take a while). On Thu, Dec 14, 2017 at 1:57 PM Xin Lu wrote: > Most likely the job that uploads this stuff at databricks is broken. > > On Thu,

Re: Spark build is failing in amplab Jenkins

2017-11-05 Thread Josh Rosen
ines are hosted. the master is on UPS but the workers > aren't... and when they come back, the PATH variable specified in the > workers' configs get dropped and we see behavior like this. > > josh rosen (whom i am talking with over chat) will be restarting the > ssh/worker processes o

Re: Raise Jenkins timeout?

2017-10-09 Thread Josh Rosen
I bumped the timeouts up to 255 minutes (to exceed https://github.com/apache/spark/blame/master/dev/run-tests-jenkins.py#L185). Let's see if this resolves the problem. On Mon, Oct 9, 2017 at 9:30 AM shane knapp wrote: > ++joshrosen > > On Mon, Oct 9, 2017 at 1:48 AM, Sean

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-30 Thread Josh Rosen
ails. > > On Mon, Aug 28, 2017 at 2:34 PM, Josh Rosen <joshro...@databricks.com> > wrote: > > This should be fixed now. The problem was that debug code had been pushed > > while investigating the JIRA linkage failure but was not removed and this > > problem went unnotice

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-28 Thread Josh Rosen
This should be fixed now. The problem was that debug code had been pushed while investigating the JIRA linkage failure but was not removed and this problem went unnoticed because linking was failing well before the debug code was hit. Once the JIRA connectivity issues were resolved, the

Re: Some PRs not automatically linked to JIRAs

2017-08-02 Thread Josh Rosen
Usually the backend of https://spark-prs.appspot.com does the linking while processing PR update tasks. It appears that the site's connections to JIRA have started failing: ConnectionError: ('Connection aborted.', HTTPException('Deadline exceeded while waiting for HTTP response from URL:

Crowdsourced triage Scapegoat compiler plugin warnings

2017-05-24 Thread Josh Rosen
t itself and eliminate common false-positives. Thanks and happy bug-hunting, Josh Rosen

Re: New Optimizer Hint

2017-05-01 Thread Josh Rosen
The issue of UDFS which return structs being evaluated many times when accessing the returned struct's fields sounds like https://issues.apache.org/jira/browse/SPARK-17728; that issue mentions a trick of using *array* and *explode* to prevent project collapsing. On Thu, Apr 20, 2017 at 8:55 AM

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

2017-03-24 Thread Josh Rosen
I think that it should be safe to remove JobProgressListener but I'd like to keep the SparkStatusTracker API. SparkStatusTracker was originally developed to provide a stable programmatic status API for use by Hive on Spark. SparkStatusTracker predated the Spark REST APIs for status tracking which

Re: Nightly builds for master branch have been failing

2017-02-24 Thread Josh Rosen
I spotted the problem and it appears to be a misconfiguration / missing entry in the template which generates the packaging jobs. I've corrected the problem but now the jobs appear to be hanging / flaking on the Git clone. Hopefully this is just a transient issue, so let's retry tonight and see

Re: File JIRAs for all flaky test failures

2017-02-15 Thread Josh Rosen
A useful tool for investigating test flakiness is my Jenkins Test Explorer service, running at https://spark-tests.appspot.com/ This has some useful timeline views for debugging flaky builds. For instance, at https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may be slow to

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Josh Rosen
He pushed the 2.0.2 release docs but there's a problem with Git mirroring of the Spark website repo which is interfering with the publishing: https://issues.apache.org/jira/browse/INFRA-12913 On Mon, Nov 14, 2016 at 1:15 PM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > The

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Josh Rosen
+1 On Sun, Sep 25, 2016 at 1:16 PM Yin Huai wrote: > +1 > > On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun > wrote: > >> +1 (non binding) >> >> RC3 is compiled and tested on the following two systems, too. All tests >> passed. >> >> * CentOS 7.2 /

Re: Unable to run docker jdbc integrations test ?

2016-09-07 Thread Josh Rosen
I think that these tests are valuable so I'd like to keep them. If possible, though, we should try to get rid of our dependency on the Spotify docker-client library, since it's a dependency hell nightmare. Given our relatively simple use of Docker here, I wonder whether we could just write some

Re: master snapshots not publishing?

2016-07-24 Thread Josh Rosen
be working. On Thu, Jul 21, 2016 at 3:36 PM Andrew Duffy <r...@aduffy.org> wrote: > Gotcha, that'd be great! > > On Thu, Jul 21, 2016 at 8:52 PM, Josh Rosen <joshro...@databricks.com> > wrote: > >> Yeah, it's on purpose: we had to disable it back when both the mast

Re: master snapshots not publishing?

2016-07-21 Thread Josh Rosen
Yeah, it's on purpose: we had to disable it back when both the master and branch-2.0 branches had the same versions in their POMs because that was causing the master snapshots to overwrite the 2.0.0-SNAPSHOTS which are generated off of branch-2.0. I can go ahead and re-enable it later today. On

Re: RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Josh Rosen
+1; I think that it's preferable for code examples, especially third-party integration examples, to live outside of Spark. On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin wrote: > Yea in general I feel examples that bring in a large amount of > dependencies should be outside

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Josh Rosen
at > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > > $ ls -l !$ > ls -l spark-1.6.1-bin-hadoop2.4.tgz > -rw-r--r--. 1 hbase hadoop 323614720 Apr 5 19:25 > spark-1.6.1-bin-hadoop2.4.tgz > > Thanks > > On Wed, Apr 6, 2016 at 12:19 PM, Josh Ro

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Josh Rosen
I downloaded the Spark 1.6.1 artifacts from the Apache mirror network and re-uploaded them to the spark-related-packages S3 bucket, so hopefully these packages should be fixed now. On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas wrote: > Thanks, that was the command.

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen
to the platform's default javac, which happens to be Java 7. To fix this, I'm going to modify the build to just prepend $JAVA_HOME/bin to $PATH while setting up the test environment On Tue, Apr 5, 2016 at 5:09 PM Josh Rosen <joshro...@databricks.com> wrote: > I've reverted the bulk of the conf change

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen
have noticed the following error ( > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console > ): > > [error] javac: invalid source release: 1.8 > [error] Usage: javac > [error] use -help for a list of possible options > > > On Tue, Apr

Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen
In order to be able to run Java 8 API compatibility tests, I'm going to push a new set of Jenkins configurations for Spark's test and PR builders so that those jobs use a Java 8 JDK. I tried this once in the past and it seemed to introduce some rare, transient flakiness in certain tests, so if

Re: Understanding PySpark Internals

2016-03-30 Thread Josh Rosen
One clarification: there *are* Python interpreters running on executors so that Python UDFs and RDD API code can be executed. Some slightly-outdated but mostly-correct reference material for this can be found at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. See also: search

Re: Spark build with scala-2.10 fails ?

2016-03-20 Thread Josh Rosen
It looks like the Scala 2.10 Jenkins build is working: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/ Can you share more details about how you're compiling with 2.10 (e.g. which commands you ran, git SHA, etc)? On Wed, Mar 16, 2016 at

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen <joshro...@databricks.com> wrote: > See the

Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen
Does anyone implement Spark's serializer interface (org.apache.spark.serializer.Serializer) in your own third-party code? If so, please let me know because I'd like to change this interface from a DeveloperAPI to private[spark] in Spark 2.0 in order to do some cleanup and refactoring. I think that

Re: Spark 1.6.1

2016-02-26 Thread Josh Rosen
I updated the release packaging scripts to use SFTP via the *lftp* client: https://github.com/apache/spark/pull/11350 I'm starting the process of cutting a 1.6.1-RC1 tag and release artifacts right now, so please be extra careful about merging into branch-1.6 until after the release. Once the RC

Re: BUILD FAILURE...again?! :( Spark Project External Flume on fire

2016-01-11 Thread Josh Rosen
I've got a hotfix which should address it: https://github.com/apache/spark/pull/10693 On Sun, Jan 10, 2016 at 11:50 PM, Jacek Laskowski wrote: > Hi, > > It appears that the last commit [1] broke the build. Is anyone working > on it? I can when told so. > > ➜ spark

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without

New processes / tools for changing dependencies in Spark

2015-12-30 Thread Josh Rosen
I just merged https://github.com/apache/spark/pull/10461, a PR that adds new automated tooling to help us reason about dependency changes in Spark. Here's a summary of the changes: - The dev/run-tests script (used in the SBT Jenkins builds and for testing Spark pull requests) now generates

Re: Is there any way to stop a jenkins build

2015-12-29 Thread Josh Rosen
Yeah, I thought that my quick fix might address the HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work so I'll now have to do the more principled fix of using a UDF which sleeps for some amount of time. In order to stop builds, you need to have a Jenkins account with the

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Josh Rosen
+1 On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang wrote: > +1 > > On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra > wrote: > >> +1 >> >> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust < >> mich...@databricks.com> wrote: >> >>> Please vote on releasing

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Josh Rosen
Would you mind copying this information into a JIRA ticket to make it easier to discover / track? Thanks! On Sun, Dec 20, 2015 at 11:35 AM Alexander Pivovarov wrote: > Usually Spark EMR job fails with the following exception in 1 hour 40 min > - Job cancelled because

Re: JIRA: Wrong dates from imported JIRAs

2015-12-16 Thread Josh Rosen
Personally, I'd rather avoid the risk of breaking things during the reimport. In my experience we've had a lot of unforeseen problems with JIRA import/export and the benefit here doesn't seem huge (this issue only impacts people that are searching for the oldest JIRAs across all projects, which I

Re: Fastest way to build Spark from scratch

2015-12-09 Thread Josh Rosen
time over many months of Spark > development. Is that right? > > > On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <joshro...@databricks.com> > wrote: > >> @Nick, on a fresh EC2 instance a significant chunk of the initial build >> time might be due to artifact resolution +

Re: Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-06 Thread Josh Rosen
I agree that we should unset this in our tests. Want to file a JIRA and submit a PR to do this? On Thu, Dec 3, 2015 at 6:40 PM Jeff Zhang wrote: > I try to do test on HiveSparkSubmitSuite on local box, but fails. The > cause is that spark is still using my local single node

Re: IntelliJ license for committers?

2015-12-02 Thread Josh Rosen
Yep, I'm the point of contact between us and JetBrains. I forwarded the 2015 license renewal email to the private@ list, so it should be accessible via the archives. I'll go ahead and forward you a copy of our project license, which will have to be renewed in January of next year. On Wed, Dec 2,

Re: Bringing up JDBC Tests to trunk

2015-11-30 Thread Josh Rosen
estion about > how the jdbc drivers are actually being setup for the other datasources > (MySQL and PostgreSQL), are these setup directly on the Jenkins slaves ? I > didn't see the jars or anything specific on the pom or other files... > > > Thanks > > On Wed, Oct 21, 2015 at

Re: VerifyError running Spark SQL code?

2015-11-25 Thread Josh Rosen
I think I've also seen this issue as well, but in a different suite. I wasn't able to easily get to the bottom of it, though. What JDK / JRE are you using? I'm on Java(TM) SE Runtime Environment (build 1.7.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) on OSX. On

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Josh Rosen
Can you file a JIRA issue to help me triage this further? Thanks! On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang <zjf...@gmail.com> wrote: > Sure, hive profile is enabled. > > On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen <joshro...@databricks.com> > wrote: > >> Is

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Josh Rosen
tadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>> at >>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>> at >&

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Josh Rosen
As of https://github.com/apache/spark/pull/9575, Spark's build will no longer place every dependency JAR into lib_managed. Can you say more about how this affected spark-shell for you (maybe share a stacktrace)? On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: > >

Re: A proposal for Spark 2.0

2015-11-10 Thread Josh Rosen
There's a proposal / discussion of the assembly-less distributions at https://github.com/vanzin/spark/pull/2/files / https://issues.apache.org/jira/browse/SPARK-11157. On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin wrote: > > On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas

Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen
I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's docs say that we only support PyPy 2.3+. Could you try using a newer PyPy version to see if that works? I just checked and it looks like our Jenkins tests are running against PyPy 2.5.1, so that version is known to work. I'm

Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen
t; On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan <sumti...@gmail.com> wrote: > >> Thanks for your quickly reply. >> >> I will test several pypy versions and report the result later. >> >> On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen <rosenvi...@gmail.com&g

Re: Bringing up JDBC Tests to trunk

2015-10-21 Thread Josh Rosen
Hey Luciano, This sounds like a reasonable plan to me. One of my colleagues has written some Dockerized MySQL testing utilities, so I'll take a peek at those to see if there are any specifics of their solution that we should adapt for Spark. On Wed, Oct 21, 2015 at 1:16 PM, Luciano Resende

Re: Spark Event Listener

2015-10-16 Thread Josh Rosen
The reason for having two separate interfaces is developer API backwards-compatibility, as far as I know. SparkFirehoseListener came later. On Tue, Oct 13, 2015 at 4:36 PM, Jakob Odersky wrote: > the path of the source file defining the event API is >

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-15 Thread Josh Rosen
To clarify, we're asking about the *spark.sql.tungsten.enabled* flag, which was introduced in Spark 1.5 and enables Project Tungsten optimizations in Spark SQL. This option is set to *true* by default in Spark 1.5+ and exists primarily to allow users to disable the new code paths if they encounter

Re: Spark Event Listener

2015-10-13 Thread Josh Rosen
Check out SparkFirehoseListener, an adapter which forwards all events to a single `onEvent` method in order to let you do pattern-matching as you have described: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/SparkFirehoseListener.java On Tue, Oct 13, 2015 at 4:29

Re: Pyspark dataframe read

2015-10-06 Thread Josh Rosen
Could someone please file a JIRA to track this? https://issues.apache.org/jira/browse/SPARK On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers wrote: > i ran into the same thing in scala api. we depend heavily on comma > separated paths, and it no longer works. > > > On Tue, Oct

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Josh Rosen
I'm working on a fix for this right now. I'm planning to re-run a modified copy of the release packaging scripts which will emit only the missing artifacts (so we won't upload new artifacts with different SHAs for the builds which *did* succeed). I expect to have this finished in the next day or

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Josh Rosen
I ran into a similar problem while working on the spark-redshift library and was able to fix it by bumping that library's ScalaTest version. I'm still fighting some mysterious Scala issues while trying to test the spark-csv library against 1.5.0-RC1, so it's possible that a build or dependency

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
, since our custom SparkQA provides nicer output. On Fri, Aug 14, 2015 at 1:57 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen rosenvi...@gmail.com wrote: Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59 On Wed, Aug 12

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
The updated prototype listed in https://github.com/databricks/spark-pr-dashboard/pull/59 is now running live on spark-prs as part of its PR comment update task. On Fri, Aug 14, 2015 at 10:51 AM, Josh Rosen rosenvi...@gmail.com wrote: I think that I'm still going to want some custom code

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Josh Rosen
Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59 On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com wrote: *TL;DR*: would anyone object if I wrote a script to auto-delete pull request comments from AmplabJenkins? Currently there are two bots which post

Re: Avoiding unnecessary build changes until tests are in better shape

2015-08-05 Thread Josh Rosen
+1. I've been holding off on reviewing / merging patches like the run-tests-jenkins Python refactoring for exactly this reason. On 8/5/15 11:24 AM, Patrick Wendell wrote: Hey All, Was wondering if people would be willing to avoid merging build changes until we have put the tests in better

Master JIRA ticket for tracking Spark 1.5.0 configuration renames, defaults changes, and configuration deprecation

2015-08-02 Thread Josh Rosen
To help us track planned / finished configuration renames, defaults changes, and configuration deprecation for the upcoming 1.5.0 release, I have created https://issues.apache.org/jira/browse/SPARK-9550. As you make configuration changes or think of configurations that need to be audited, please

Re: Should spark-ec2 get its own repo?

2015-08-01 Thread Josh Rosen
I don't think that using git submodules is a good idea here: - The extra `git submodule init git submodule update` step can lead to confusing problems in certain workflows. - We'd wind up with many commits that serve only to bump the submodule SHA; these commits will be hard to

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Josh Rosen
It would also be great to test this with codegen and unsafe enabled but while continuing to use sort shuffle manager instead of the new tungsten-sort one. On Fri, Jul 31, 2015 at 1:39 AM, Reynold Xin r...@databricks.com wrote: Is this deterministically reproducible? Can you try this on the

Re: Worker memory leaks?

2015-07-27 Thread Josh Rosen
. That is we create a single context per application submitted and close it upon success/failure completion of the application. Thanks, On Mon, Jul 20, 2015 at 3:20 PM, Josh Rosen joshro...@databricks.com wrote: Hi Richard, Thanks for your detailed investigation of this issue. I agree

Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-26 Thread Josh Rosen
Given that 2.11 may be more stringent with respect to warnings, we might consider building with 2.11 instead of 2.10 in the pull request builder. This would also have some secondary benefits in terms of letting us use tools like Scapegoat or SCoverage highlighting. On Sat, Jul 25, 2015 at 8:52

Re: Worker memory leaks?

2015-07-20 Thread Josh Rosen
Hi Richard, Thanks for your detailed investigation of this issue. I agree with your observation that the finishedExecutors hashmap is a source of memory leaks for very-long-lived clusters. It looks like the finishedExecutors map is only read when rendering the Worker Web UI and in constructing

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Josh Rosen
Yep, I emailed TD about it; I think that we may need to make a change to the pull request builder to fix this. Pending that, we could just revert the commit that added this. On Sun, Jul 19, 2015 at 5:32 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I noticed that KinesisStreamSuite fails for both

Re: KryoSerializer gives class cast exception

2015-07-17 Thread Josh Rosen
We've run into other problems caused by our old Kryo versions. I agree that the Chill dependency is one of the main blockers to upgrading Kryo, but I don't think that it's insurmountable: if necessary, we could just publish our own forked version of Chill under our own namespace, similar to what

Re: why doesn't jenkins like me?

2015-07-17 Thread Josh Rosen
The It is not a test failed test message means that something went wrong in a suite-wide setup or teardown method. This could be some sort of race or flakiness. If this problem persists, we should file a JIRA and label it with flaky-test so that we can find it later. On Thu, Jul 16, 2015 at

Re: problems with build of latest the master

2015-07-15 Thread Josh Rosen
We may be able to fix this from the Spark side by adding appropriate exclusions in our Hadoop dependencies, right? If possible, I think that we should do this. On Wed, Jul 15, 2015 at 7:10 AM, Ted Yu yuzhih...@gmail.com wrote: I attached a patch for HADOOP-12235 BTW openstack was not

Re: Joining Apache Spark

2015-07-13 Thread Josh Rosen
Also, check out https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Mon, Jul 13, 2015 at 4:08 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello, welcome, and please start by going through the web site ( http://spark.apache.org/), especially the Contributors section at

Re: Spark master broken?

2015-07-12 Thread Josh Rosen
I think it is just broken for 2.11 since pull requests are building properly. Sent from my phone On Jul 12, 2015, at 8:22 AM, René Treffer rtref...@gmail.com wrote: Java 8, make-distribution Jenkins does show the same error, though:

Re: The latest master branch didn't compile with -Phive?

2015-07-09 Thread Josh Rosen
Jenkins runs compile-only builds for Maven as an early warning system for this type of issue; you can see from https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/ that the Maven compilation is now broken in master. On Thu, Jul 9, 2015 at 8:48 AM, Ted Yu yuzhih...@gmail.com wrote: I

Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-08 Thread Josh Rosen
I've filed https://issues.apache.org/jira/browse/SPARK-8903 to fix the DataFrameStatSuite test failure. The problem turned out to be caused by a mistake made while resolving a merge-conflict when backporting that patch to branch-1.4. I've submitted https://github.com/apache/spark/pull/7295 to fix

Re: Spark 1.5.0-SNAPSHOT broken with Scala 2.11

2015-06-28 Thread Josh Rosen
The 2.11 compile build is going to be green because this is an issue with tests, not compilation. On Sun, Jun 28, 2015 at 6:30 PM, Ted Yu yuzhih...@gmail.com wrote: Spark-Master-Scala211-Compile build is green. However it is not clear what the actual command is: [EnvInject] - Variables

Re: [SQL] codegen on wide dataset throws StackOverflow

2015-06-26 Thread Josh Rosen
Which Spark version are you using? Can you file a JIRA for this issue? On Thu, Jun 25, 2015 at 6:35 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi, i have a small but very wide dataset (2000 columns). Trying to optimize Dataframe pipeline for it, since it behaves very poorly comparing

Re: [jenkins] ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error?

2015-06-21 Thread Josh Rosen
This is a side effect of the new pull request tester script interacting badly with a Jenkins plugin, not anything caused by your changes. I'm working on a fix but in the meantime I'd just trust what SparkQA says. Sent from my phone On Jun 21, 2015, at 1:54 PM, Yu Ishikawa

Re: [Tungsten] NPE in UnsafeShuffleWriter.java

2015-06-20 Thread Josh Rosen
I've filed https://issues.apache.org/jira/browse/SPARK-8498 to fix this error-handling code. On Fri, Jun 19, 2015 at 11:51 AM, Josh Rosen rosenvi...@gmail.com wrote: Hey Peter, I think that this is actually due to an error-handling issue: if you look at the stack trace that you posted

Re: [Tungsten] NPE in UnsafeShuffleWriter.java

2015-06-19 Thread Josh Rosen
Hey Peter, I think that this is actually due to an error-handling issue: if you look at the stack trace that you posted, the NPE is being thrown from an error-handling branch of a `finally` block: @Override public void write(scala.collection.IteratorProduct2K, V records) throws IOException {

Re: Sidebar: issues targeted for 1.4.0

2015-06-16 Thread Josh Rosen
Whatever you do, DO NOT use the built-in JIRA 'releases' feature to migrate issues from 1.4.0 to another version: the JIRA feature will have the side-effect of automatically changing the target versions for issues that have been closed, which is going to be really confusing. I've made this mistake

Re: PySpark on PyPi

2015-06-05 Thread Josh Rosen
This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of

Re: Possible space improvements to shuffle

2015-06-02 Thread Josh Rosen
The relevant JIRA that springs to mind is https://issues.apache.org/jira/browse/SPARK-2926 If an aggregator and ordering are both defined, then the map side of sort-based shuffle will sort based on the key ordering so that map-side spills can be efficiently merged. We do not currently do a

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-29 Thread Josh Rosen
Hey, want to file a JIRA for this? This will make it easier to track progress on this issue. Definitely upload the profiler screenshots there, too, since that's helpful information. https://issues.apache.org/jira/browse/SPARK On Wed, May 27, 2015 at 11:12 AM, Nitin Goyal

Re: Kryo option changed

2015-05-23 Thread Josh Rosen
Which commit of master are you building off? It looks like there was a bugfix for an issue related to KryoSerializer buffer configuration: https://github.com/apache/spark/pull/5934 That patch was committed two weeks ago, but you mentioned that you're building off a newer version of master.

Re: Testing spark applications

2015-05-22 Thread Josh Rosen
I think that @holdenk's *spark-testing-base* project publishes some of these test classes as well as some helper classes for testing streaming jobs: https://github.com/holdenk/spark-testing-base On Thu, May 21, 2015 at 10:39 PM, Reynold Xin r...@databricks.com wrote: It is just 15 lines of code

Re: [build system] scheduled datacenter downtime, sunday may 17th

2015-05-17 Thread Josh Rosen
Reminder: the network migration has started this morning, so Jenkins is currently down. Status updates on the migration are being published at http://ucbsystems.org/ On Wed, May 13, 2015 at 5:12 PM, shane knapp skn...@berkeley.edu wrote: our datacenter is rejiggering our network (read: fully

Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Josh Rosen
Spark PRs didn't always used to handle the JIRA linking. We used to rely on a Jenkins job that ran https://github.com/apache/spark/blob/master/dev/github_jira_sync.py. We switched this over to Spark PRs at a time when the Jenkins GitHub Pull Request Builder plugin was having flakiness issues,

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

Re: Github auth problems = some test results not posting

2015-04-05 Thread Josh Rosen
Thanks for catching this. It looks like a recent Jenkins job configuration change inadvertently renamed the GITHUB_OAUTH_KEY environment variable to something else, causing this to break. I've rolled back that change, so hopefully the GitHub posting should start working again. - Josh On Sun,

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Josh Rosen
It looks like this was fixed in https://issues.apache.org/jira/browse/SPARK-4743 / https://github.com/apache/spark/pull/3605. Can you see whether that patch fixes this issue for you? On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, I was using

Re: Unit tests

2015-02-09 Thread Josh Rosen
Hi Iulian, I think the AkakUtilsSuite failure that you observed has been fixed in  https://issues.apache.org/jira/browse/SPARK-5548 /  https://github.com/apache/spark/pull/4343 On February 9, 2015 at 5:47:59 AM, Iulian Dragoș (iulian.dra...@typesafe.com) wrote: Hi Patrick, Thanks for the

Re: Temporary jenkins issue

2015-02-08 Thread Josh Rosen
It looks like this may be fixed soon in Jenkins: https://issues.jenkins-ci.org/browse/JENKINS-25446 https://github.com/jenkinsci/flaky-test-handler-plugin/pull/1 On February 2, 2015 at 7:38:19 PM, Patrick Wendell (pwend...@gmail.com) wrote: Hey All, I made a change to the Jenkins

Re: Results of tests

2015-01-09 Thread Josh Rosen
The Test Result pages for Jenkins builds shows some nice statistics for the test run, including individual test times: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/ Currently this only covers

Re: jenkins redirect down (but jenkins is up!), lots of potential

2015-01-05 Thread Josh Rosen
The pull request builder and SCM-polling builds appear to be working fine, but the links in pull request comments won't work because the AMP Lab webserver is still down. In the meantime, though, you can continue to access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/ On Mon, Jan 5,

Re: Is there any way to tell if compute is being called from a retry?

2014-12-30 Thread Josh Rosen
This is timely, since I just ran into this issue myself while trying to write a test to reproduce a bug related to speculative execution (I wanted to configure a job so that the first attempt to compute a partition would run slow so that a second, fast speculative copy would be launched). I've

Re: cleaning up cache files left by SPARK-2713

2014-12-24 Thread Josh Rosen
I reviewed and merged that PR, in case you want to try out the fix. - Josh On December 22, 2014 at 10:40:35 AM, Marcelo Vanzin (van...@cloudera.com) wrote: https://github.com/apache/spark/pull/3705 On Mon, Dec 22, 2014 at 10:19 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason

Re: Confirming race condition in DagScheduler (NoSuchElementException)

2014-12-24 Thread Josh Rosen
I’m investigating this issue and left some comments on the proposed fix:  https://github.com/apache/spark/pull/3345#issuecomment-68014353 To summarize, I agree with your description of the problem but think that the right fix may be a bit more involved than what’s proposed in that PR (that PR’s

  1   2   >