from:"Josh Rosen"

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Josh Rosen

+1

On Mon, Apr 15, 2024 at 11:26 AM Maciej  wrote:

> +1
>
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
> On 4/15/24 8:16 PM, Rui Wang wrote:
>
> +1, non-binding.
>
> Thanks Dongjoon to drive this!
>
>
> -Rui
>
> On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Dongjoon Hyun  !
>>
>> On Mon, Apr 15, 2024 at 6:33 AM beliefer  wrote:
>>
>>> +1
>>>
>>>
>>> 在 2024-04-15 15:54:07，"Peter Toth"  写道：
>>>
>>> +1
>>>
>>> Wenchen Fan  ezt írta (időpont: 2024. ápr. 15., H,
>>> 9:08):
>>>
 +1

 On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun 
 wrote:

> I'll start from my +1.
>
> Dongjoon.
>
> On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
> > Please vote on SPARK-4 to use ANSI SQL mode by default.
> > The technical scope is defined in the following PR which is
> > one line of code change and one line of migration guide.
> >
> > - DISCUSSION:
> > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> > - PR: https://github.com/apache/spark/pull/46013
> >
> > The vote is open until April 17th 1AM (PST) and passes
> > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Use ANSI SQL mode by default
> > [ ] -1 Do not use ANSI SQL mode by default because ...
> >
> > Thank you in advance.
> >
> > Dongjoon
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Josh Rosen

(Catching up on a backlog of emails, hence my belated reply)

I just checked the spark-prs app engine logs and it appears that our JIRA
API calls are failing due to a CAPTCHA check (same issue as before). Based
on https://jira.atlassian.com/browse/JRASERVER-40362, it sounds like this
is a fairly common problem.

I manually logged into the 'apachespark' JIRA account and completed the
CPATCHA, so *hopefully* things should be temporarily unbroken.

To permanently fix this issue, we might need to use OAuth tokens for
connecting to JIRA (instead of basic username + password auth). It looks
like the Python JIRA library supports this (
https://jira.readthedocs.io/en/master/examples.html#oauth) and I found some
promising-looking instructions on how to generate the OAuth tokens:
https://www.redradishtech.com/display/KB/How+to+write+a+Python+script+authenticating+with+Jira+via+OAuth
.
However, it looks like you need to be a JIRA administrator in order to
configure the applink so I can't fix this by myself.

It would be great if the native GitHub <-> JIRA integration could meet our
needs; this probably either didn't exist or wasn't configurable by us when
we first wrote our own integration / sync script.

On Wed, Apr 29, 2020 at 6:21 PM Hyukjin Kwon  wrote:

> Let actually me just take a look by myself and bring some updates soon.
>
> 2020년 4월 30일 (목) 오전 9:13, Hyukjin Kwon 님이 작성:
>
>> WDYT @Josh Rosen ?
>> Seems
>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L131-L142
>>  this
>> isn't working anymore.
>> Does it make sense to move it to native Jira-GitHub integration
>> <https://confluence.atlassian.com/adminjiracloud/connect-jira-cloud-to-github-814188429.html>
>> ?
>> It won't change JIRA status as we used to do but it might be better from
>> a cursory look. However, maybe I missed some context.
>>
>>
>> 2020년 4월 30일 (목) 오전 2:46, Nicholas Chammas 님이
>> 작성:
>>
>>> Not sure what you mean. The native integration will auto-link from a
>>> Jira ticket to the PRs that mention that ticket. I don't think it will
>>> update the ticket's status, though.
>>>
>>> Would you like me to file a ticket with Infra and see what they say?
>>>
>>> On Tue, Apr 28, 2020 at 12:21 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Maybe it's time to switch. Do you know if we can still link the JIRA
>>>> against Github?
>>>> The script used to change the status of JIRA too but it stopped working
>>>> for a long time - I suspect this isn't a big deal.
>>>>
>>>> 2020년 4월 25일 (토) 오전 10:31, Nicholas Chammas 님이
>>>> 작성:
>>>>
>>>>> Have we asked Infra recently about enabling the native Jira-GitHub
>>>>> integration
>>>>> <https://confluence.atlassian.com/adminjiracloud/connect-jira-cloud-to-github-814188429.html>?
>>>>> Maybe we can deprecate the part of this script that updates Jira tickets
>>>>> with links to the PR and rely on the native integration instead. We use it
>>>>> at my day job, for example.
>>>>>
>>>>> On Fri, Apr 24, 2020 at 12:39 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Seems like this github_jira_sync.py
>>>>>> <https://github.com/apache/spark/blob/master/dev/github_jira_sync.py> 
>>>>>> script
>>>>>> seems stopped working completely now.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-31532 <>
>>>>>> https://github.com/apache/spark/pull/28316
>>>>>> https://issues.apache.org/jira/browse/SPARK-31529 <>
>>>>>> https://github.com/apache/spark/pull/28315
>>>>>> https://issues.apache.org/jira/browse/SPARK-31528 <>
>>>>>> https://github.com/apache/spark/pull/28313
>>>>>>
>>>>>> Josh, would you mind taking a look please when you find some time?
>>>>>> There is a bunch of JIRAs now, and it is very confusing which JIRA is
>>>>>> in progress with a PR or not.
>>>>>>
>>>>>>
>>>>>> 2019년 7월 26일 (금) 오후 1:20, Hyukjin Kwon 님이 작성:
>>>>>>
>>>>>>> Just FYI, I had to come up with a better JQL to filter out the JIRAs
>>>>>>> that already have linked PRs.
>>>>>>> In case it helps someone, I use this JQL now to look through the
>>&g

Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Josh Rosen

I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
documentation: these are valuable resources and I think we could increase
that value by making these docs easier to discover and by adding a bit more
structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL
Migration Guide" which lists the deprecations and changes of behavior in
each release:

   - Latest published version:
   https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
   - Master branch version (will become 3.0):
   
https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md

A lot of community work went into crafting this doc and I really appreciate
those efforts.

This doc is a little hard to find, though, because it's not consistently
linked from release notes pages: the 2.4.0 page links it under "Changes of
Behavior" (
https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
but subsequent maintenance releases do not link to it (
https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not
very cross-linked from the rest of the Spark docs (e.g. the Overview doc,
doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as
opposed to Spark developers):

   - *Entries aren't grouped by component*, so users need to read the
   entire document to spot changes relevant to their use of Spark (for
   example, PySpark changes are not grouped together).
   - *Entries aren't ordered by size / risk of change,* e.g. performance
   impact vs. loud behavior change (stopping with an explicit exception) vs.
   silent behavior changes (e.g. changing default rounding behavior). If we
   assume limited reader attention then it may be important to prioritize the
   order in which we list entries, putting the highest-expected-impact /
   lowest-organic-discoverability changes first.
   - *We don't link JIRAs*, forcing users to do their own archaeology to
   learn more about a specific change.

The existing ML migration guide addresses some of these issues, so maybe we
can emulate it in the SQL guide:
https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0
around the corner: many folks will seek out this information when they
upgrade, so improving this guide can be a high-leverage, high-impact
activity.

What do folks think? Does anyone have examples from other projects which do
a notably good job of crafting release notes / migration guides? I'd be
glad to help with pre-release editing after we decide on a structure and
style.

Cheers,
Josh

Re: Resolving all JIRAs affecting EOL releases

2019-05-15 Thread Josh Rosen

+1 in favor of some sort of JIRA cleanup.

My only request is that we attach some sort of 'bulk-closed' label to
issues that we close via JIRA filter batch operations (and resolve the
issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
makes it easier to audit what was closed, simplifying the process of
identifying and re-opening valid issues caught in our dragnet.


On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:

> I gave up looking through JIRAs a long time ago, so, big respect for
> continuing to try to triage them. I am afraid we're missing a few
> important bug reports in the torrent, but most JIRAs are not
> well-formed, just questions, stale, or simply things that won't be
> added. I do think it's important to reflect that reality, and so I'm
> always in favor of more aggressively closing JIRAs. I think this is
> more standard practice, from projects like TensorFlow/Keras, pandas,
> etc to just automatically drop Issues that don't see activity for N
> days. We won't do that, but, are probably on the other hand far too
> lax in closing them.
>
> Remember that JIRAs stay searchable and can be reopened, so it's not
> like we lose much information.
>
> I'd close anything that hasn't had activity in 2 years (?), as a start.
> I like the idea of closing things that only affect an EOL release,
> but, many items aren't marked, so may need to cast the net wider.
>
> I think only then does it make sense to look at bothering to reproduce
> or evaluate the 1000s that will still remain.
>
> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon  wrote:
> >
> > Hi all,
> >
> > I would like to propose to resolve all JIRAs that affects EOL releases -
> 2.2 and below. and affected version
> > not specified. I was rather against this way and considered this as last
> resort in roughly 3 years ago
> > when we discussed. Now I think we should go ahead with this. See below.
> >
> > I have been talking care of this for so long time almost every day those
> 3 years. The number of JIRAs
> > keeps increasing and it does never go down. Now the number is going over
> 2500 JIRAs.
> > Did you guys know? in JIRA, we can only go through page by page up to
> 1000 items. So, currently we're even
> > having difficulties to go through every JIRA. We should manually filter
> out and check each.
> > The number is going over the manageable size.
> >
> > I am not suggesting this without anything actually trying. This is what
> we have tried within my visibility:
> >
> >   1. In roughly 3 years ago, Sean tried to gather committers and even
> non-committers people to sort
> > out this number. At that time, we were only able to keep this number
> as is. After we lost this momentum,
> > it kept increasing back.
> >   2. At least I scanned _all_ the previous JIRAs at least more than two
> times and resolved them. Roughly
> > once a year. The rest of them are mostly obsolete but not enough
> information to investigate further.
> >   3. I strictly stick to "Contributing to JIRA Maintenance"
> https://spark.apache.org/contributing.html and
> > resolve JIRAs.
> >   4. Promoting other people to comment on JIRA or actively resolve them.
> >
> > One of the facts I realised is the increasing number of committers
> doesn't virtually help this much (although
> > it might be helpful if somebody active in JIRA becomes a committer.)
> >
> > One of the important thing I should note is that, it's now almost pretty
> difficult to reproduce and test the
> > issues found in EOL releases. We should git clone, checkout, build and
> test. And then, see if that issue
> > still exists in upstream, and fix. This is non-trivial overhead.
> >
> > Therefore, I would like to propose resolving _all_ the JIRAs that
> targets EOL releases - 2.2 and below.
> > Please let me know if anyone has some concerns or objections.
> >
> > Thanks.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-04-25 Thread Josh Rosen

The code for this runs in http://spark-prs.appspot.com (see
https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
)

I checked the AppEngine logs and it looks like we're getting error
responses, possibly due to a credentials issue:

Exception when starting progress on JIRA issue SPARK-27355 (
> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
> )
> Traceback (most recent call last): File
> Traceback (most recent call last):
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
> line 138
> ,
> in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
> issue_number)) File
> start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'], issue_number))
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
> line 27
> ,
> in start_issue_progress jira_client = get_jira_client() File
> jira_client = get_jira_client()
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
> line 18
> ,
> in get_jira_client app.config['JIRA_PASSWORD'])) File
> app.config['JIRA_PASSWORD']))
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
> line 472
> ,
> in __init__ si = self.server_info() File
> si = self.server_info()
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
> line 2133
> ,
> in server_info j = self._get_json('serverInfo') File
> j = self._get_json('serverInfo')
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
> line 2549
> ,
> in _get_json r = self._session.get(url, params=params) File
> r = self._session.get(url, params=params)
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/resilientsession.py",
> line 151
> ,
> in get return self.__verb('GET', url, **kwargs) File
> return self.__verb('GET', url, **kwargs)
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/resilientsession.py",
> line 147
> ,
> in __verb raise_on_error(response, verb=verb, **kwargs) File
> raise_on_error(response, verb=verb, **kwargs)
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/resilientsession.py",
> line 57
> ,
> in raise_on_error r.status_code, error, r.url, request=request, response=r,
> **kwargs) JIRAError: JiraError HTTP 403 url:
> https://issues.apache.org/jira/rest/api/2/serverInfo text:
> CAPTCHA_CHALLENGE; login-url=https://issues.apache.org/jira/login.jsp 
> r.status_code,
> error, r.url,

Re: spark-tests.appspot status?

2017-12-14 Thread Josh Rosen

Yep, it turns out that there was a problem with the Jenkins job. I've
restarted it and it should be backfilling now (this might take a while).

On Thu, Dec 14, 2017 at 1:57 PM Xin Lu  wrote:

> Most likely the job that uploads this stuff at databricks is broken.
>
> On Thu, Dec 14, 2017 at 12:41 PM, Imran Rashid 
> wrote:
>
>> Hi,
>>
>> I was trying to look at some flaky tests and old jiras, and noticed that
>> spark-tests.appspot.com is still live, but hasn't updated with any
>> builds from the last 2 months.  I was curious what the status is --
>> intentionally deprecated?  just needs a restart?  more dev work required?
>>
>> its pretty handy for dealing with flaky tests, could help get it up again
>> if its something small.
>>
>> thanks,
>> Imran
>>
>
>

Re: Spark build is failing in amplab Jenkins

2017-11-05 Thread Josh Rosen

Disconnecting and reconnecting each Jenkins worker appears to have resolved
the PATH issue: in the System Info page for each worker, I now see a PATH
which includes Anaconda.

To restart the worker processes, I only needed to hit the "Disconnect"
button in the Jenkins master UI for each worker, wait a few seconds, then
hit the "Relaunch slave agent" button. It's fortunate that this could be
done entirely from the Jenkins UI without having to actually SSH into the
individual worker machines.

It looks like all workers should now be in a good state. If you see any new
failures due to PATH issues, though, then please ping this thread.

On Sun, Nov 5, 2017 at 9:21 AM shane knapp <skn...@berkeley.edu> wrote:

> hello from the canary islands!  ;)
>
> i just saw this thread, and another one about a quick power loss at the
> colo where our machines are hosted.  the master is on UPS but the workers
> aren't...  and when they come back, the PATH variable specified in the
> workers' configs get dropped and we see behavior like this.
>
> josh rosen (whom i am talking with over chat) will be restarting the
> ssh/worker processes on all of the worker nodes immediately.  this will fix
> the problem.
>
> now, back to my holiday!  :)
>
> On Sun, Nov 5, 2017 at 5:01 PM, Xin Lu <x...@salesforce.com> wrote:
>
>> Also another thing to look at is if you guys have any kinda of nightly
>> cleanup scripts for these workers that completely nuke the conda
>> environments.  If there is one maybe that's why some of them recover after
>> a while.  I don't know enough about your infra right now to understand all
>> the things that could cause the current unstable behavior so these are just
>> some guesses.  Anyway, I sent a previous email about running spark tests in
>> docker and noone responded.  At Databricks the whole build infra to run
>> spark tests was very different.  Spark tests were run in docker and had a
>> jenkins that was dedicated to it.  Perhaps that's something that can be
>> replicated for OSS.
>>
>> On Sun, Nov 5, 2017 at 8:45 AM, Xin Lu <x...@salesforce.com> wrote:
>>
>>> So, right now it looks like 2 and 6 are still broken, but 7 has
>>> recovered:
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/SparkPullRequestBuilder/buildTimeTrend
>>>
>>> What I am suggesting is to just perhaps modify the
>>> SparkPullRequestBuilder configuration and run "which python" and then
>>> "python -V" to see what the pull request builder is seeing before it exits.
>>> Perhaps the sparkpullrequest builders are erroneously targeting a different
>>> conda environment because you have multiple nodes on each worker.   It
>>> looks like there is some build that's changing the environment and that's
>>> causing the workers to break and recover somewhat randomly.
>>>
>>> Xin
>>>
>>> On Sun, Nov 5, 2017 at 8:29 AM, Alyssa Morrow <morrowaly...@gmail.com>
>>> wrote:
>>>
>>>> Hi Xin,
>>>>
>>>> The extent of which our projects set exports are:
>>>>
>>>> export JAVA_HOME=/usr/java/jdk1.8.0_60
>>>> export CONDA_BIN=/home/anaconda/bin/
>>>> export
>>>> MVN_BIN=/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.1.1/bin/
>>>> export PATH=${JAVA_HOME}/bin/:${MVN_BIN}:${CONDA_BIN}:${PATH}
>>>>
>>>> As for python, *which python* gives us python installed in the conda
>>>> virtual environment:
>>>> ~/.conda/envs/build/bin/python
>>>>
>>>> These steps look similar to how spark sets up its build. Not sure if
>>>> this helps. Let me know if any other information would be helpful.
>>>>
>>>> Best,
>>>>
>>>> Alyssa Morrow
>>>> akmor...@berkeley.edu
>>>> 414-254-6645 <(414)%20254-6645>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Nov 5, 2017, at 8:15 AM, Xin Lu <x...@salesforce.com> wrote:
>>>>
>>>> Thanks, I actually don't have access to the machines or build configs
>>>> to do proper debugging on this.  It looks like these  workers are shared
>>>> with other build configurations  like avocado and cannoli as well and
>>>> really any of the shared configs could be changing your JAVA_HOME and
>>>> python environments.   It is fairly easy to debug if you can just change
>>>> the spark build to run "which python"  and run it on on

Re: Raise Jenkins timeout?

2017-10-09 Thread Josh Rosen

I bumped the timeouts up to 255 minutes (to exceed
https://github.com/apache/spark/blame/master/dev/run-tests-jenkins.py#L185).
Let's see if this resolves the problem.

On Mon, Oct 9, 2017 at 9:30 AM shane knapp  wrote:

> ++joshrosen
>
> On Mon, Oct 9, 2017 at 1:48 AM, Sean Owen  wrote:
>
>> I'm seeing jobs killed regularly, presumably because the time out (210
>> minutes?)
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3907/console
>>
>> Possibly related: this master-SBT-2.7 build hasn't passed in weeks:
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/
>>
>> All seem to be timeouts, explicilty or implicitly.
>>
>> I know the answer is make tests faster, but failing that, can we raise
>> the timeout again to ... 4 hours ? or maybe I misread why these are being
>> killed?
>>
>> If somehow it's load on Jenkins servers, we could consider getting rid of
>> the two different Hadoop builds as I think it serves little purpose to
>> build separately for the two (or even support 2.6 specially).
>>
>
>

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-30 Thread Josh Rosen

I think that's because https://issues.apache.org/jira/browse/SPARK-21728 was
re-opened in JIRA and had a new PR associated with it, so the bot did the
temporary issue re-assignment in order to be able to transition the issue
status from "reopened" to "in progress".

On Wed, Aug 30, 2017 at 1:18 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> I'm still seeing some odd behavior.
>
> I just deleted my repo's branch for
> https://github.com/apache/spark/pull/19013 and the script seems to
> have done some update to the bug, since I got a bunch of e-mails.
>
> On Mon, Aug 28, 2017 at 2:34 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
> > This should be fixed now. The problem was that debug code had been pushed
> > while investigating the JIRA linkage failure but was not removed and this
> > problem went unnoticed because linking was failing well before the debug
> > code was hit. Once the JIRA connectivity issues were resolved, the
> > problematic code was running and causing the linking operation to fail
> > mid-way through, triggering a finally block which undid the JIRA
> assignment.
> >
> > I've rolled back the bad code and enabled additional monitoring in
> > StackDriver to raise an alert if we see new linking failures.
> >
> > On Mon, Aug 28, 2017 at 12:02 PM Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> It seems a little wonky, though. Feels like it's updating JIRA every
> >> time you comment on a PR. Or maybe it's still working through the
> >> backlog...
> >>
> >> On Mon, Aug 28, 2017 at 9:57 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >> > The process for doing that was down before, and might've come back up
> >> > and
> >> > are going through the huge backlog.
> >> >
> >> >
> >> > On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >> >>
> >> >> Like whatever reassigns JIRAs after a PR is closed?
> >> >>
> >> >> It seems to be going crazy, or maybe there are many running. Not sure
> >> >> who
> >> >> owns that, but can he/she take a look?
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>

Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-28 Thread Josh Rosen

This should be fixed now. The problem was that debug code had been pushed
while investigating the JIRA linkage failure but was not removed and this
problem went unnoticed because linking was failing well before the debug
code was hit. Once the JIRA connectivity issues were resolved, the
problematic code was running and causing the linking operation to fail
mid-way through, triggering a finally block which undid the JIRA assignment.

I've rolled back the bad code and enabled additional monitoring in
StackDriver to raise an alert if we see new linking failures.

On Mon, Aug 28, 2017 at 12:02 PM Marcelo Vanzin  wrote:

> It seems a little wonky, though. Feels like it's updating JIRA every
> time you comment on a PR. Or maybe it's still working through the
> backlog...
>
> On Mon, Aug 28, 2017 at 9:57 AM, Reynold Xin  wrote:
> > The process for doing that was down before, and might've come back up and
> > are going through the huge backlog.
> >
> >
> > On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen  wrote:
> >>
> >> Like whatever reassigns JIRAs after a PR is closed?
> >>
> >> It seems to be going crazy, or maybe there are many running. Not sure
> who
> >> owns that, but can he/she take a look?
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Some PRs not automatically linked to JIRAs

2017-08-02 Thread Josh Rosen

Usually the backend of https://spark-prs.appspot.com does the linking while
processing PR update tasks. It appears that the site's connections to JIRA
have started failing:

ConnectionError: ('Connection aborted.', HTTPException('Deadline exceeded
while waiting for HTTP response from URL:
https://issues.apache.org/jira/rest/api/2/serverInfo',))

>From Stackdriver's log-based metrics, I can spot that this problem started
around July 24th. We're already using a much-higher-than-default URL fetch
timeout, so it's possible that the problem is related to access
credentials, IP blocks, outdated client libraries, or something else.

On Wed, Aug 2, 2017 at 1:10 PM Bryan Cutler  wrote:

> Thanks Hyukjin!  I didn't see your previous message..  It looks like your
> manual run worked pretty well for the JIRAs I'm following, the only thing
> is that it didn't mark them as "in progress", but that's not a big deal.
> Otherwise that helps until we can find out why it's not doing this
> automatically.  I'm not familiar with that script, can anyone run it to
> apply to a single JIRA they are working on?
>
> On Wed, Aug 2, 2017 at 12:09 PM, Hyukjin Kwon  wrote:
>
>> I was wondering about this too..
>>
>>
>> Yes, actually, I have been manually adding some links by resembling the
>> same steps in the script before.
>>
>> I was thinking it'd rather be nicer to run this manually once and then I
>> ran this against single JIRA
>>
>> first - https://issues.apache.org/jira/browse/SPARK-21526 to show how it
>> looks like and check if there
>>
>> is any issue or objection just in case.
>>
>>
>> Will run this manually now once. I will revert all my action manually if
>> there is any issue by doing this.
>>
>>
>> 2017-08-03 3:50 GMT+09:00 Sean Owen :
>>
>>> Hyukjin mentioned this here earlier today and had run it manually, but
>>> yeah I'm not sure where it normally runs or why it hasn't. Shane not sure
>>> if you're the person to ask?
>>>
>>>
>>> On Wed, Aug 2, 2017 at 7:47 PM Bryan Cutler  wrote:
>>>
 Hi Devs,

 I've noticed a couple PRs recently have not been automatically linked
 to the related JIRAs.  This was one of mine (I linked it manually)
 https://issues.apache.org/jira/browse/SPARK-21583, but I've seen it
 happen elsewhere.  I think this is the script that does it, but it hasn't
 been changed recently
 https://github.com/apache/spark/blob/master/dev/github_jira_sync.py.
 Anyone else seen this or know what's going on?

 Thanks,
 Bryan

>>>
>>
>

Crowdsourced triage Scapegoat compiler plugin warnings

2017-05-24 Thread Josh Rosen

I'm interested in using the Scapegoat
<https://github.com/sksamuel/scapegoat> Scala compiler plugin to find
potential bugs and performance problems in Spark. Scapegoat has a useful
built-in set of inspections and is pretty easy to extend with custom ones.
For example, I added an inspection to spot places where we call *.apply()* on
a Seq which is not an IndexedSeq
<https://github.com/sksamuel/scapegoat/pull/159> in order to make it easier
to spot potential O(n^2) performance bugs.

There are lots of false-positives and benign warnings (as with any linter /
static analyzer) so I don't think it's feasible to us to include this as a
blocking step in our regular build. I am planning to build tooling to
surface only new warnings so going forward this can become a useful
code-review aid.

The current codebase has roughly 1700 warnings that I would like to triage
and categorize as false-positives or real bugs. I can't do this alone, so
here's how you can help:

- Visit the Google Docs spreadsheet at

https://docs.google.com/spreadsheets/d/1z7xNMjx7VCJLCiHOHhTth7Hh4R0F6LwcGjEwCDzrCiM/edit?usp=sharing
and
find an un-triaged warning.
- In the columns at the right of the sheet, enter your name in the
appropriate column to mark a warning as a false-positive or as a real bug
and/or performance issue. If think a warning is a real issue then use the
"comments" column for providing additional detail.
- Please don't file JIRAs or PRs for individual warnings; I suspect that
we'll find clusters of issues which are best fixed in a few larger PRs vs.
lots of smaller ones. Certain warnings are probably simply style issues so
we should discuss those before trying to fix them.

The sheet has hidden columns capturing the Spark revision and Scapegoat
revision. I can use this to programmatically update the sheet and remap
lines after updating either Scapegoat (to suppress false-positives) or
Spark (to incorporate fixes and surface new warnings). For those who are
interested, the sheet was produced with this script:
https://gist.github.com/JoshRosen/1ae12a979880d9a98988aa87d70ff2a8

Depending on the results of this experiment we might want to integrate a
high-signal subset of the Scapegoat warnings into our build. I'm also
hoping that we'll be able to build a useful corpus of triaged warnings in
order to help improve Scapegoat itself and eliminate common false-positives.

Thanks and happy bug-hunting,
Josh Rosen

Re: New Optimizer Hint

2017-05-01 Thread Josh Rosen

The issue of UDFS which return structs being evaluated many times when
accessing the returned struct's fields sounds like
https://issues.apache.org/jira/browse/SPARK-17728; that issue mentions a
trick of using *array* and *explode* to prevent project collapsing.

On Thu, Apr 20, 2017 at 8:55 AM Reynold Xin  wrote:

> Doesn't common sub expression elimination address this issue as well?
>
> On Thu, Apr 20, 2017 at 6:40 AM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> Hi Michael,
>>
>> This sounds like a good idea. Can you open a JIRA to track this?
>>
>> My initial feedback on your proposal would be that you might want to
>> express the no_collapse at the expression level and not at the plan
>> level.
>>
>> HTH
>>
>> On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles <
>> michael.sty...@shopify.com> wrote:
>>
>>> Hello,
>>>
>>> I am in the process of putting together a PR that introduces a new hint
>>> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE
>>> hint.
>>>
>>> Let me first give an example of why I am proposing this.
>>>
>>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>>> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"]))
>>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>>> df2["ua"].browser_version.alias("c2"))
>>> df3.explain(True)
>>>
>>> == Parsed Logical Plan ==
>>> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS
>>> c2#91]
>>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>>+- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Analyzed Logical Plan ==
>>> c1: string, c2: string
>>> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS
>>> c2#91]
>>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>>+- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Optimized Logical Plan ==
>>> Project [UDF(user_agent#81).device_form_factor AS c1#90,
>>> UDF(user_agent#81).browser_version AS c2#91]
>>> +- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Physical Plan ==
>>> *Project [UDF(user_agent#81).device_form_factor AS c1#90,
>>> UDF(user_agent#81).browser_version AS c2#91]
>>> +- Scan ExistingRDD[id#80L,user_agent#81]
>>>
>>> user_agent_details is a user-defined function that returns a struct. As
>>> can be seen from the generated query plan, the function is being executed
>>> multiple times which could lead to performance issues. This is due to the
>>> CollapseProject optimizer rule that collapses adjacent projections.
>>>
>>> I'm proposing a hint that prevent the optimizer from collapsing adjacent
>>> projections. A new function called 'no_collapse' would be introduced for
>>> this purpose. Consider the following example and generated query plan.
>>>
>>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>>> df2 = F.no_collapse(df1.withColumn("ua",
>>> user_agent_details(df1["user_agent"])))
>>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>>> df2["ua"].browser_version.alias("c2"))
>>> df3.explain(True)
>>>
>>> == Parsed Logical Plan ==
>>> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS
>>> c2#76]
>>> +- NoCollapseHint
>>>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>>   +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Analyzed Logical Plan ==
>>> c1: string, c2: string
>>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- NoCollapseHint
>>>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>>   +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Optimized Logical Plan ==
>>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- NoCollapseHint
>>>+- Project [UDF(user_agent#65) AS ua#69]
>>>   +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Physical Plan ==
>>> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- *Project [UDF(user_agent#65) AS ua#69]
>>>+- Scan ExistingRDD[id#64L,user_agent#65]
>>>
>>> As can be seen from the query plan, the user-defined function is now
>>> evaluated once per row.
>>>
>>> I would like to get some feedback on this proposal.
>>>
>>> Thanks.
>>>
>>>
>>
>>
>> --
>>
>> Herman van Hövell
>>
>> Software Engineer
>>
>> Databricks Inc.
>>
>> hvanhov...@databricks.com
>>
>> +31 6 420 590 27
>>
>> databricks.com
>>
>> [image: http://databricks.com] 
>>
>>
>> [image: Join Databricks at Spark Summit 2017 in San Francisco, the
>> world's largest event for the Apache Spark community.]
>> 
>>
>

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

2017-03-24 Thread Josh Rosen

I think that it should be safe to remove JobProgressListener but I'd like
to keep the SparkStatusTracker API.

SparkStatusTracker was originally developed to provide a stable
programmatic status API for use by Hive on Spark. SparkStatusTracker
predated the Spark REST APIs for status tracking which is why there's some
overlap of functionality between those APIs. Given that SparkStatusTracker
is a longstanding stable public API I'd prefer to not remove it because
there may be a lot of existing user code that depends on it. It's also a
relatively easy-to-support API because it presents a clean query
abstraction and doesn't expose mutable data structures via its public
interface, so we should be able to support this interface with an
implementation based on the new UI database.

JobProgressListener, on the other hand, has a messy interface which was not
designed for use by code outside of Spark. This interface was marked as
@DeveloperAPI as part of Spark 1.0 (see
https://github.com/apache/spark/pull/648) but I think that decision was a
mistake because the interface exposes mutable internal state. For example,
if a user wanted to query completed stages using JobProgressListener then
they would access a field declared as

  val completedStages = ListBuffer[StageInfo]()

which requires the user to explicitly synchronize on the
JobProgressListener instance in order to safely access this field. This is
a bad API and it's not really possible to cleanly present this same
interface with a database-backed implementation. In addition, this
interface has not been fully stable over time and there's currently no
public / DeveloperAPI mechanism to get access to the Spark-constructed
instance of JobProgressListener.

Given all of this this, I think that it's unlikely that users are relying
on JobProgressListener since Spark has other APIs for status tracking which
are more stable and easier to work with. If anyone is relying on this then
they could inline the JobProgressListener source in their own project and
instantiate and register the listener themselves.

Thus I think it's fine to remove JobProgressListener but think we should
keep SparkStatusTracker. I think that the decision of whether we want to
make a next-generation "V2" programmatic status API based on the REST API
types can happen later / independently.

On Thu, Mar 23, 2017 at 1:32 PM Marcelo Vanzin  wrote:

> Hello all,
>
> For those not following, I'm working on SPARK-18085, where my goal is
> to decouple the storage of UI data from the actual UI implementation.
> This is mostly targeted at the history server, so that it's possible
> to quickly load a "database" with UI information instead of the
> existing way of re-parsing event logs, but I think it also helps with
> the live UI, since it doesn't require storing UI information in memory
> and thus relieves some memory pressure on the driver. (I may still add
> an in-memory database in that project, but that's digressing from the
> topic at hand.)
>
> One of my (unwritten) goals in that project was to get rid of
> JobProgressListener. Now that I'm at a point where I can do that from
> the UI's p.o.v., I ran into SparkStatusTracker. So I'd like to get
> people's views on two topics.
>
> (i) deprecate SparkStatusTracker, provide a new API based on the
> public REST types.
>
> SparkStatusTracker provides yet another way of getting job, stage and
> executor information (aside from the UI and the API). It has its own
> types that model those, which are based on the existing UI types but
> not the same. It could be replaced by making REST calls to the UI
> endpoint, but that's sub-optimal since it doesn't make a lot of sense
> to do that when you already have an instance of SparkContext to play
> with.
>
> Since that's a public, stable API, it can't be removed right away. But
> I'd like to propose that we deprecate it, and provide a new API that
> is based on the REST types (which, with my work, are also used in the
> UI). The existing "SparkStatusTracker" would still exist until we can
> remove it, of course.
>
> What do people think about this approach? Another option is to not add
> the new API, but keep SparkStatusTracker around using the new UI
> database to back it.
>
> (ii) Remove JobProgressListener
>
> I didn't notice it before, but JobProgressListener is public-ish
> (@DeveloperApi). I'm not sure why that is, and it's a weird thing
> because it exposes non-public types (from UIData.scala) in its API.
> With the work I'm doing, and the above suggestion about
> SparkStatusTracker, JobProgressListener becomes unused in Spark
> itself, and keeping it would just mean the driver keeps using unneeded
> memory.
>
> Are there concerns about removing that class? Its functionality is
> available in both SparkStatusTracker and the REST API, so it's mostly
> redundant.
>
>
> So, thoughts?
>
>
> Note to self: (i) above means I'd have to scale back some of my goals
> for SPARK-18085. More

Re: Nightly builds for master branch have been failing

2017-02-24 Thread Josh Rosen

I spotted the problem and it appears to be a misconfiguration / missing
entry in the template which generates the packaging jobs. I've corrected
the problem but now the jobs appear to be hanging / flaking on the Git
clone. Hopefully this is just a transient issue, so let's retry tonight and
see whether things are fixed.

On Fri, Feb 24, 2017 at 7:26 AM Sean Owen  wrote:

> This job is using Java 7 still. Josh are you the person to ask? I think it
> needs to set JAVA_HOME to Java 8.
>
> On Fri, Feb 24, 2017, 12:37 Liwei Lin  wrote:
>
> Nightly builds for master branch have been failing:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/buildTimeTrend
>
> It'd be great if we can get this fixed. Thanks.
>
>
>
> Cheers,
> Liwei
>
>

Re: File JIRAs for all flaky test failures

2017-02-15 Thread Josh Rosen

A useful tool for investigating test flakiness is my Jenkins Test Explorer
service, running at https://spark-tests.appspot.com/

This has some useful timeline views for debugging flaky builds. For
instance, at
https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.6 (may
be slow to load) you can see this chart: https://i.imgur.com/j8LV3pX.png.
Here, each column represents a test run and each row represents a test
which failed at least once over the displayed time period.

In that linked example screenshot you'll notice that a few columns have
grey squares indicating that tests were skipped but lack any red squares to
indicate test failures. This usually indicates that the build failed due to
a problem other than an individual test failure. For example, I clicked
into one of those builds and found that one test suite failed in test setup
because the previous suite had not properly cleaned up its SparkContext
(I'll file a JIRA for this).

You can click through the interface to drill down to reports on individual
builds, tests, suites, etc. As an example of an individual test's detail
page,
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
shows
the patterns of flakiness in a streaming checkpoint test.

Finally, there's an experimental "interesting new test failures" report
which tries to surface tests which have started failing very recently:
https://spark-tests.appspot.com/failed-tests/new. Specifically, entries in
this feed are test failures which a) occurred in the last week, b) were not
part of a build which had 20 or more failed tests, and c) were not observed
to fail in during the previous week (i.e. no failures from [2 weeks ago, 1
week ago)), and d) which represent the first time that the test failed this
week (i.e. a test case will appear at most once in the results list). I've
also exposed this as an RSS feed at
https://spark-tests.appspot.com/rss/failed-tests/new.

On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal
wrote:

I would recommend we just open JIRA's for unit tests based on module
(core/ml/sql etc) and we fix this one module at a time, this at least keeps
the number of unit tests needing fixing down to a manageable number.

--
*From:* Armin Braun
*Sent:* Wednesday, February 15, 2017 12:48 PM
*To:* Saikat Kanjilal
*Cc:* Kay Ousterhout; dev@spark.apache.org
*Subject:* Re: File JIRAs for all flaky test failures

I think one thing that is contributing to this a lot too is the general
issue of the tests taking up a lot of file descriptors (10k+ if I run them
on a standard Debian machine).
There are a few suits that contribute to this in particular like
`org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others,
appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce
the resource consumption of these tests?
Seems to me these can cause a lot of unpredictable behavior (making the
reason for flaky tests hard to identify especially when there's timeouts
etc. involved) + they make it prohibitively expensive for many to test
locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal
wrote:

I was working on something to address this a while ago
https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in
testing locally made things a lot more complicated to fix for each of the
unit tests, should we resurface this JIRA again, I would whole heartedly
agree with the flakiness assessment of the unit tests.
[SPARK-9487] Use the same num. worker threads in Scala ...

issues.apache.org
In Python we use `local[4]` for unit tests, while in Scala/Java we use
`local[2]` and `local` for some unit tests in SQL, MLLib, and other
components. If the ...

--
*From:* Kay Ousterhout
*Sent:* Wednesday, February 15, 2017 12:10 PM
*To:* dev@spark.apache.org
*Subject:* File JIRAs for all flaky test failures

Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more
common than not now that the tests need to be re-run at least once on PRs
before they pass. This is both annoying and problematic because it makes
it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
fails on a PR (for a reason unrelated to the PR). Just provide a quick
description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
"Tests failed because 250m timeout expired", a link to the failed build,
and include the "Tests" component. If there's already a JIRA for the
issue, just comment with a link to the latest failure. I know folks don't
always have time to track down why a test failed, but this it at least
helpful to someone else who, later on,

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Josh Rosen

He pushed the 2.0.2 release docs but there's a problem with Git mirroring
of the Spark website repo which is interfering with the publishing:
https://issues.apache.org/jira/browse/INFRA-12913


On Mon, Nov 14, 2016 at 1:15 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The release is available on http://www.apache.org/dist/spark/ and its
> on Maven central
> http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.0.2/
>
> I guess Reynold hasn't yet put together the release notes / updates to
> the website.
>
> Thanks
> Shivaram
>
> On Mon, Nov 14, 2016 at 12:49 PM, Nicholas Chammas
>  wrote:
> > Has the release already been made? I didn't see any announcement, but
> > Homebrew has already updated to 2.0.2.
> > 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
> >>
> >> The vote has passed with the following +1s and no -1. I will work on
> >> packaging the release.
> >>
> >> +1:
> >>
> >> Reynold Xin*
> >> Herman van Hövell tot Westerflier
> >> Ricardo Almeida
> >> Shixiong (Ryan) Zhu
> >> Sean Owen*
> >> Michael Armbrust*
> >> Dongjoon Hyun
> >> Jagadeesan As
> >> Liwei Lin
> >> Weiqing Yang
> >> Vaquar Khan
> >> Denny Lee
> >> Yin Huai*
> >> Ryan Blue
> >> Pratik Sharma
> >> Kousuke Saruta
> >> Tathagata Das*
> >> Mingjie Tang
> >> Adam Roberts
> >>
> >> * = binding
> >>
> >>
> >> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin 
> wrote:
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and
> passes if a
> >>> majority of at least 3+1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 2.0.2
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>>
> >>> The tag to be voted on is v2.0.2-rc3
> >>> (584354eaac02531c9584188b143367ba694b0c34)
> >>>
> >>> This release candidate resolves 84 issues:
> >>> https://s.apache.org/spark-2.0.2-jira
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://people.apache.org/keys/committer/pwendell.asc
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1214/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> >>>
> >>>
> >>> Q: How can I help test this release?
> >>> A: If you are a Spark user, you can help us test this release by taking
> >>> an existing Spark workload and running on this release candidate, then
> >>> reporting any regressions from 2.0.1.
> >>>
> >>> Q: What justifies a -1 vote for this release?
> >>> A: This is a maintenance release in the 2.0.x series. Bugs already
> >>> present in 2.0.1, missing features, or bugs related to new features
> will not
> >>> necessarily block this release.
> >>>
> >>> Q: What fix version should I use for patches merging into branch-2.0
> from
> >>> now on?
> >>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> >>> (i.e. RC4) is cut, I will change the fix version of those patches to
> 2.0.2.
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Josh Rosen

+1

On Sun, Sep 25, 2016 at 1:16 PM Yin Huai  wrote:

> +1
>
> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun 
> wrote:
>
>> +1 (non binding)
>>
>> RC3 is compiled and tested on the following two systems, too. All tests
>> passed.
>>
>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dsparkr
>> * CentOS 7.2 / Open JDK 1.8.0_102
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>
>> Cheers,
>> Dongjoon
>>
>>
>>
>> On Saturday, September 24, 2016, Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc3
>>> (9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
>>>
>>> This release candidate resolves 290 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.0.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>> present in 2.0.0, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>>>
>>>
>>>
>

Re: Unable to run docker jdbc integrations test ?

2016-09-07 Thread Josh Rosen

I think that these tests are valuable so I'd like to keep them. If
possible, though, we should try to get rid of our dependency on the Spotify
docker-client library, since it's a dependency hell nightmare. Given our
relatively simple use of Docker here, I wonder whether we could just write
some simple scripting over the `docker` command-line tool instead of
pulling in such a problematic library.

On Wed, Sep 7, 2016 at 2:36 PM Luciano Resende  wrote:

> It looks like there is nobody running these tests, and after some
> dependency upgrades in Spark 2.0 this has stopped working. I have tried to
> bring up this but I am having some issues with getting the right
> dependencies loaded and satisfying the docker-client expectations.
>
> The question then is: Does the community find value on having these tests
> available ? Then we can focus on bringing them up and I can go push my
> previous experiments as a WIP PR. Otherwise we should just get rid of these
> tests.
>
> Thoughts ?
>
>
> On Tue, Sep 6, 2016 at 4:05 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> Hi,
>>
>>
>> I am getting the following error , when I am trying to run jdbc docker
>> integration tests on my laptop.   Any ideas , what I might be be doing
>> wrong ?
>>
>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0  -Phive-thriftserver
>> -Phive -DskipTests clean install
>> build/mvn -Pdocker-integration-tests -pl
>> :spark-docker-integration-tests_2.11  compile test
>>
>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>> MaxPermSize=512m; support was removed in 8.0
>> Discovery starting.
>> Discovery completed in 200 milliseconds.
>> Run starting. Expected test count is: 10
>> MySQLIntegrationSuite:
>>
>> Error:
>> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager
>> BlockManagerId(driver, 9.31.117.25, 51868)
>> *** RUN ABORTED ***
>>   java.lang.AbstractMethodError:
>>   at
>> org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>>   at
>> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>>   at
>> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>>   at
>> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>>   at
>> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>>   at
>> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>   ...
>> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
>> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint:
>> MapOutputTrackerMasterEndpoint stopped!
>>
>>
>>
>> Thanks
>> -suresh
>>
>>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: master snapshots not publishing?

2016-07-24 Thread Josh Rosen

Should be back and building now:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1595/console

I see a 2.1.0-SNAPSHOT in
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.11/,
so it looks like everything should be working.

On Thu, Jul 21, 2016 at 3:36 PM Andrew Duffy <r...@aduffy.org> wrote:

> Gotcha, that'd be great!
>
> On Thu, Jul 21, 2016 at 8:52 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> Yeah, it's on purpose: we had to disable it back when both the master and
>> branch-2.0 branches had the same versions in their POMs because that was
>> causing the master snapshots to overwrite the 2.0.0-SNAPSHOTS which are
>> generated off of branch-2.0.
>>
>> I can go ahead and re-enable it later today.
>>
>> On Thu, Jul 21, 2016 at 11:10 AM Andrew Duffy <r...@aduffy.org> wrote:
>>
>>> I’m trying to use a Snapshot build off of master, and after looking
>>> through Jenkins it appears that the last commit where the snapshot was
>>> built is back on 757dc2c09d23400dacac22e51f52062bbe471136, 22 days ago:
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>
>>>
>>>
>>> Looking at the Jenkins page it says that the master-maven build is
>>> disabled, is this purposeful?
>>>
>>>
>>>
>>> -Andrew
>>>
>>
>
>
> --
> Andrew Duffy
> aduffy.org
>

Re: master snapshots not publishing?

2016-07-21 Thread Josh Rosen

Yeah, it's on purpose: we had to disable it back when both the master and
branch-2.0 branches had the same versions in their POMs because that was
causing the master snapshots to overwrite the 2.0.0-SNAPSHOTS which are
generated off of branch-2.0.

I can go ahead and re-enable it later today.

On Thu, Jul 21, 2016 at 11:10 AM Andrew Duffy  wrote:

> I’m trying to use a Snapshot build off of master, and after looking
> through Jenkins it appears that the last commit where the snapshot was
> built is back on 757dc2c09d23400dacac22e51f52062bbe471136, 22 days ago:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
>
>
> Looking at the Jenkins page it says that the master-maven build is
> disabled, is this purposeful?
>
>
>
> -Andrew
>

Re: RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Josh Rosen

+1; I think that it's preferable for code examples, especially third-party
integration examples, to live outside of Spark.

On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin  wrote:

> Yea in general I feel examples that bring in a large amount of
> dependencies should be outside Spark.
>
>
> On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
> wrote:
>
>> Hey all,
>>
>> Two reasons why I think we should remove that from the examples:
>>
>> - HBase now has Spark integration in its own repo, so that really
>> should be the template for how to use HBase from Spark, making that
>> example less useful, even misleading.
>>
>> - It brings up a lot of extra dependencies that make the size of the
>> Spark distribution grow.
>>
>> Any reason why we shouldn't drop that example?
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Josh Rosen

Sure, I'll take a look. Planning to do full verification in a bit.

On Wed, Apr 6, 2016 at 12:54 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Josh:
> Can you check spark-1.6.1-bin-hadoop2.4.tgz ?
>
> $ tar zxf spark-1.6.1-bin-hadoop2.4.tgz
>
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
>
> $ ls -l !$
> ls -l spark-1.6.1-bin-hadoop2.4.tgz
> -rw-r--r--. 1 hbase hadoop 323614720 Apr  5 19:25
> spark-1.6.1-bin-hadoop2.4.tgz
>
> Thanks
>
> On Wed, Apr 6, 2016 at 12:19 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> I downloaded the Spark 1.6.1 artifacts from the Apache mirror network and
>> re-uploaded them to the spark-related-packages S3 bucket, so hopefully
>> these packages should be fixed now.
>>
>> On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks, that was the command. :thumbsup:
>>>
>>> On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>
>>>> I just found out how the hash is calculated:
>>>>
>>>> gpg --print-md sha512 .tgz
>>>>
>>>> you can use that to check if the resulting output matches the contents
>>>> of .tgz.sha
>>>>
>>>> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>> > The published hash is a SHA512.
>>>> >
>>>> > You can verify the integrity of the packages by running `sha512sum` on
>>>> > the archive and comparing the computed hash with the published one.
>>>> > Unfortunately however, I don't know what tool is used to generate the
>>>> > hash and I can't reproduce the format, so I ended up manually
>>>> > comparing the hashes.
>>>> >
>>>> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
>>>> > <nicholas.cham...@gmail.com> wrote:
>>>> >> An additional note: The Spark packages being served off of
>>>> CloudFront (i.e.
>>>> >> the “direct download” option on spark.apache.org) are also corrupt.
>>>> >>
>>>> >> Btw what’s the correct way to verify the SHA of a Spark package?
>>>> I’ve tried
>>>> >> a few commands on working packages downloaded from Apache mirrors,
>>>> but I
>>>> >> can’t seem to reproduce the published SHA for
>>>> spark-1.6.1-bin-hadoop2.6.tgz.
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>>> >>>
>>>> >>> Maybe temporarily take out the artifacts on S3 before the root
>>>> cause is
>>>> >>> found.
>>>> >>>
>>>> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>>>> >>> <nicholas.cham...@gmail.com> wrote:
>>>> >>>>
>>>> >>>> Just checking in on this again as the builds on S3 are still
>>>> broken. :/
>>>> >>>>
>>>> >>>> Could it have something to do with us moving release-build.sh?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>>> >>>> <nicholas.cham...@gmail.com> wrote:
>>>> >>>>>
>>>> >>>>> Is someone going to retry fixing these packages? It's still a
>>>> problem.
>>>> >>>>>
>>>> >>>>> Also, it would be good to understand why this is happening.
>>>> >>>>>
>>>> >>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> I just realized you're using a different download site. Sorry
>>>> for the
>>>> >>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>> >>>>>> Hadoop 2.6 is
>>>> >>>>>>
>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >>>>>>
>>>> >>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>> >>>>>> <nicholas.cham...@gmail.com> wrote:
>>>> >>&

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Josh Rosen

I downloaded the Spark 1.6.1 artifacts from the Apache mirror network and
re-uploaded them to the spark-related-packages S3 bucket, so hopefully
these packages should be fixed now.

On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas 
wrote:

> Thanks, that was the command. :thumbsup:
>
> On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky  wrote:
>
>> I just found out how the hash is calculated:
>>
>> gpg --print-md sha512 .tgz
>>
>> you can use that to check if the resulting output matches the contents
>> of .tgz.sha
>>
>> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky  wrote:
>> > The published hash is a SHA512.
>> >
>> > You can verify the integrity of the packages by running `sha512sum` on
>> > the archive and comparing the computed hash with the published one.
>> > Unfortunately however, I don't know what tool is used to generate the
>> > hash and I can't reproduce the format, so I ended up manually
>> > comparing the hashes.
>> >
>> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
>> >  wrote:
>> >> An additional note: The Spark packages being served off of CloudFront
>> (i.e.
>> >> the “direct download” option on spark.apache.org) are also corrupt.
>> >>
>> >> Btw what’s the correct way to verify the SHA of a Spark package? I’ve
>> tried
>> >> a few commands on working packages downloaded from Apache mirrors, but
>> I
>> >> can’t seem to reproduce the published SHA for
>> spark-1.6.1-bin-hadoop2.6.tgz.
>> >>
>> >>
>> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu  wrote:
>> >>>
>> >>> Maybe temporarily take out the artifacts on S3 before the root cause
>> is
>> >>> found.
>> >>>
>> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>> >>>  wrote:
>> 
>>  Just checking in on this again as the builds on S3 are still broken.
>> :/
>> 
>>  Could it have something to do with us moving release-build.sh?
>> 
>> 
>>  On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>   wrote:
>> >
>> > Is someone going to retry fixing these packages? It's still a
>> problem.
>> >
>> > Also, it would be good to understand why this is happening.
>> >
>> > On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky 
>> wrote:
>> >>
>> >> I just realized you're using a different download site. Sorry for
>> the
>> >> confusion, the link I get for a direct download of Spark 1.6.1 /
>> >> Hadoop 2.6 is
>> >> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>> >>
>> >> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>> >>  wrote:
>> >> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>> >> > corrupt ZIP
>> >> > file.
>> >> >
>> >> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the
>> same
>> >> > Spark
>> >> > 1.6.1/Hadoop 2.6 package you had a success with?
>> >> >
>> >> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky > >
>> >> > wrote:
>> >> >>
>> >> >> I just experienced the issue, however retrying the download a
>> second
>> >> >> time worked. Could it be that there is some load balancer/cache
>> in
>> >> >> front of the archive and some nodes still serve the corrupt
>> >> >> packages?
>> >> >>
>> >> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>> >> >>  wrote:
>> >> >> > I'm seeing the same. :(
>> >> >> >
>> >> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> I tried again this morning :
>> >> >> >>
>> >> >> >> $ wget
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> >> >> --2016-03-18 07:55:30--
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>> >> >> >> ...
>> >> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>> >> >> >>
>> >> >> >> gzip: stdin: unexpected end of file
>> >> >> >> tar: Unexpected EOF in archive
>> >> >> >> tar: Unexpected EOF in archive
>> >> >> >> tar: Error is not recoverable: exiting now
>> >> >> >>
>> >> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>> >> >> >> 
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>> >> >> >>>
>> >> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>> >> >> >>> 
>> >> >> >>> wrote:
>> >> >> 
>> >> >>  Looks like the other packages may also be corrupt. I’m
>> getting
>> >> >>  the
>> >> >>

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

I finally figured out the problem: it seems that my *export
JAVA_HOME=/path/to/java8/home* was somehow not affecting the javac
executable that Zinc's SBT incremental compiler uses when it forks out to
javac to handle Java source files. As a result, we were passing a -source
1.8 flag to the platform's default javac, which happens to be Java 7.

To fix this, I'm going to modify the build to just prepend $JAVA_HOME/bin
to $PATH while setting up the test environment

On Tue, Apr 5, 2016 at 5:09 PM Josh Rosen <joshro...@databricks.com> wrote:

> I've reverted the bulk of the conf changes while I investigate. I think
> that Zinc might be handling JAVA_HOME in a weird way and am SSH'ing to
> Jenkins to try to reproduce the problem in isolation.
>
> On Tue, Apr 5, 2016 at 4:14 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Josh:
>> You may have noticed the following error (
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
>> ):
>>
>> [error] javac: invalid source release: 1.8
>> [error] Usage: javac  
>> [error] use -help for a list of possible options
>>
>>
>> On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> In order to be able to run Java 8 API compatibility tests, I'm going to
>>> push a new set of Jenkins configurations for Spark's test and PR builders
>>> so that those jobs use a Java 8 JDK. I tried this once in the past and it
>>> seemed to introduce some rare, transient flakiness in certain tests, so if
>>> anyone observes new test failures please email me and I'll investigate
>>> right away.
>>>
>>> Note that this change has no impact on Spark's supported JDK versions
>>> and our build will still target Java 7 and emit Java 7 bytecode; the
>>> purpose of this change is simply to allow the Java 8 lambda tests to be run
>>> as part of PR builder runs.
>>>
>>> - Josh
>>>
>>
>>

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

I've reverted the bulk of the conf changes while I investigate. I think
that Zinc might be handling JAVA_HOME in a weird way and am SSH'ing to
Jenkins to try to reproduce the problem in isolation.

On Tue, Apr 5, 2016 at 4:14 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Josh:
> You may have noticed the following error (
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
> ):
>
> [error] javac: invalid source release: 1.8
> [error] Usage: javac  
> [error] use -help for a list of possible options
>
>
> On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> In order to be able to run Java 8 API compatibility tests, I'm going to
>> push a new set of Jenkins configurations for Spark's test and PR builders
>> so that those jobs use a Java 8 JDK. I tried this once in the past and it
>> seemed to introduce some rare, transient flakiness in certain tests, so if
>> anyone observes new test failures please email me and I'll investigate
>> right away.
>>
>> Note that this change has no impact on Spark's supported JDK versions and
>> our build will still target Java 7 and emit Java 7 bytecode; the purpose of
>> this change is simply to allow the Java 8 lambda tests to be run as part of
>> PR builder runs.
>>
>> - Josh
>>
>
>

Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

In order to be able to run Java 8 API compatibility tests, I'm going to
push a new set of Jenkins configurations for Spark's test and PR builders
so that those jobs use a Java 8 JDK. I tried this once in the past and it
seemed to introduce some rare, transient flakiness in certain tests, so if
anyone observes new test failures please email me and I'll investigate
right away.

Note that this change has no impact on Spark's supported JDK versions and
our build will still target Java 7 and emit Java 7 bytecode; the purpose of
this change is simply to allow the Java 8 lambda tests to be run as part of
PR builder runs.

- Josh

Re: Understanding PySpark Internals

2016-03-30 Thread Josh Rosen

One clarification: there *are* Python interpreters running on executors so
that Python UDFs and RDD API code can be executed. Some slightly-outdated
but mostly-correct reference material for this can be found at
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

See also: search the Spark codebase for PythonRDD and look at
python/pyspark/worker.py

On Tue, Mar 29, 2016 at 8:21 AM Adam Roberts  wrote:

> Hi, I'm interested in figuring out how the Python API for Spark works,
> I've came to the following conclusion and want to share this with the
> community; could be of use in the PySpark docs here
> ,
> specifically the "Execution and pipelining part".
>
> *Any sanity checking would be much appreciated, here's the trivial Python
> example I've traced:*
>
>
> *from pyspark import SparkContext sc = SparkContext("local[1]", "Adam
> test") sc.setCheckpointDir("foo checkpoint dir")*
>
> *Added this JVM option:*
> *export
> IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"*
>
> *Prints added in py4j-java/src/py4j/commands/CallCommand.java -
> specifically in the execute method. Built and replaced existing class in
> the py4j 0.9 jar in my Spark assembly jar. Example output is:*
> *In execute for CallCommand, commandName: c*
> *target object id: o0*
> *methodName: get*
>
> *I'll launch the Spark application with:*
> *$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1*
>
> I've quickly put together the following WIP diagram of what I think is
> happening:
> http://postimg.org/image/nihylmset/
>
> To summarise I think:
>
>- We're heavily using reflection (as evidenced by Py4j's
>ReflectionEngine and MethodInvoker classes) to invoke Spark's API in a JVM
>from Python
>- There's an agreed protocol (in Py4j's Protocol.java) for handling
>commands: said commands are exchanged using a local socket between Python
>and our JVM (the driver based on docs, not the master)
>- The Spark API is accessible by means of commands exchanged using
>said socket using the agreed protocol
>- Commands are read/written using BufferedReader/Writer
>- Type conversion is also performed from Python to Java (not looked at
>in detail yet)
>- We keep track of the objects with, for example, o0 representing the
>first object we know about
>
> Does this sound correct?
>
> I've only checked the trace output in local mode, curious as to what
> happens when we're running in standalone mode (I didn't see a Python
> interpreter appearing on all workers in order to process partitions of
> data, I assume in standalone mode we use Python solely as an orchestrator -
> the driver - and not as an executor for distributed computing?).
>
> Happy to provide the full trace output on request (omitted timestamps,
> logging info, added spacing), I expect there's a O*JDK method tracing
> equivalent so the above can easily be reproduced regardless of Java vendor.
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: Spark build with scala-2.10 fails ?

2016-03-20 Thread Josh Rosen

It looks like the Scala 2.10 Jenkins build is working:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/

Can you share more details about how you're compiling with 2.10 (e.g. which
commands you ran, git SHA, etc)?

On Wed, Mar 16, 2016 at 11:46 PM Jeff Zhang  wrote:

> Anyone can pass the spark build with scala-2.10 ?
>
>
> [info] Compiling 475 Scala sources and 78 Java sources to
> /Users/jzhang/github/spark/core/target/scala-2.10/classes...
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:30:
> object ShuffleServiceHeartbeat is not a member of package
> org.apache.spark.network.shuffle.protocol.mesos
> [error] import
> org.apache.spark.network.shuffle.protocol.mesos.{RegisterDriver,
> ShuffleServiceHeartbeat}
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:87:
> not found: type ShuffleServiceHeartbeat
> [error] def unapply(h: ShuffleServiceHeartbeat): Option[String] =
> Some(h.getAppId)
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala:83:
> value getHeartbeatTimeoutMs is not a member of
> org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver
> [error]   Some((r.getAppId, new AppState(r.getHeartbeatTimeoutMs,
> System.nanoTime(
> [error]^
> [error]
> /Users/jzhang/github/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:451:
> too many arguments for method registerDriverWithShuffleService: (x$1:
> String, x$2: Int)Unit
> [error]   .registerDriverWithShuffleService(
> [error]^
> [error] four errors found
> [error] Compile failed at Mar 17, 2016 2:45:22 PM [13.105s]
> --
> Best Regards
>
> Jeff Zhang
>

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen

See the instructions in the Spark documentation:
https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna 
wrote:

>
>
> Hi,
>
> Scala version:2.11.7(had to upgrade the scala verison to enable case
> clasess to accept more than 22 parameters.)
>
> Spark version:1.6.1.
>
> PFB pom.xml
>
> Getting below error when trying to setup spark on intellij IDE,
>
> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
> Exception in thread "main" java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class at
> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
> com.examples.testSparkPost.main(testSparkPost.scala) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
> by: java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class at
> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
> java.security.AccessController.doPrivileged(Native Method) at
> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>
> pom.xml:
>
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> StreamProcess
> StreamProcess
> 0.0.1-SNAPSHOT
> ${project.artifactId}
> This is a boilerplate maven project to start using Spark in 
> Scala
> 2010
>
> 
> 1.6
> 1.6
> UTF-8
> 2.10
> 
> 2.11.7
> 
>
> 
> 
> 
> cloudera-repo-releases
> https://repository.cloudera.com/artifactory/repo/
> 
> 
>
> 
> src/main/scala
> src/test/scala
> 
> 
> 
> maven-assembly-plugin
> 
> 
> package
> 
> single
> 
> 
> 
> 
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> 
> net.alchim31.maven
> scala-maven-plugin
> 3.2.2
> 
> 
> 
> compile
> testCompile
> 
> 
> 
> 
> -dependencyfile
> 
> ${project.build.directory}/.scala_dependencies
> 
> 
> 
> 
> 
>
> 
> 
> maven-assembly-plugin
> 2.4.1
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> make-assembly
> package
> 
> single
> 
> 
> 
> 
> 
> 
> 
> 
> org.scala-lang
> scala-library
> ${scala.version}
> 
> 
> org.mongodb.mongo-hadoop
> mongo-hadoop-core
> 1.4.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
>

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen

Err, whoops, looks like this is a user app and not building Spark itself,
so you'll have to change your deps to use the 2.11 versions of Spark.
e.g. spark-streaming_2.10 -> spark-streaming_2.11.

On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen <joshro...@databricks.com> wrote:

> See the instructions in the Spark documentation:
> https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
>
> On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna <
> satyajit.apas...@gmail.com> wrote:
>
>>
>>
>> Hi,
>>
>> Scala version:2.11.7(had to upgrade the scala verison to enable case
>> clasess to accept more than 22 parameters.)
>>
>> Spark version:1.6.1.
>>
>> PFB pom.xml
>>
>> Getting below error when trying to setup spark on intellij IDE,
>>
>> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class at
>> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
>> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
>> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
>> com.examples.testSparkPost.main(testSparkPost.scala) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
>> by: java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>>
>> pom.xml:
>>
>> http://maven.apache.org/POM/4.0.0; 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> http://maven.apache.org/maven-v4_0_0.xsd;>
>> 4.0.0
>> StreamProcess
>> StreamProcess
>> 0.0.1-SNAPSHOT
>> ${project.artifactId}
>> This is a boilerplate maven project to start using Spark in 
>> Scala
>> 2010
>>
>> 
>> 1.6
>> 1.6
>> UTF-8
>> 2.10
>> 
>> 2.11.7
>> 
>>
>> 
>> 
>> 
>> cloudera-repo-releases
>> https://repository.cloudera.com/artifactory/repo/
>> 
>> 
>>
>> 
>> src/main/scala
>> src/test/scala
>> 
>> 
>> 
>> maven-assembly-plugin
>> 
>> 
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>> jar-with-dependencies
>> 
>> 
>> 
>> 
>> 
>> net.alchim31.maven
>> scala-maven-plugin
>> 3.2.2
>> 
>> 
>> 
>> compile
>> testCompile
>> 
>> 
>> 
>> 
>> -dependencyfile
>> 
>> ${project.build.directory}/.scala_dependencies
>> 
>> 
>> 
>> 
>> 
>>
>> 
>> 
>> maven-assembly-plugin
>> 2.4.1
>>

Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen

Does anyone implement Spark's serializer interface
(org.apache.spark.serializer.Serializer) in your own third-party code? If
so, please let me know because I'd like to change this interface from a
DeveloperAPI to private[spark] in Spark 2.0 in order to do some cleanup and
refactoring. I think that the only reason it was a DeveloperAPI was Shark,
but I'd like to confirm this by asking the community.

Thanks,
Josh

Re: Spark 1.6.1

2016-02-26 Thread Josh Rosen

I updated the release packaging scripts to use SFTP via the *lftp* client:
https://github.com/apache/spark/pull/11350

I'm starting the process of cutting a 1.6.1-RC1 tag and release artifacts
right now, so please be extra careful about merging into branch-1.6 until
after the release. Once the RC packaging completes, Michael or I will email
the list to start a vote thread.

- Josh

On Wed, Feb 24, 2016 at 5:44 PM Yin Yang  wrote:

> Have you tried using scp ?
>
> scp file i...@people.apache.org
>
> Thanks
>
> On Wed, Feb 24, 2016 at 5:04 PM, Michael Armbrust 
> wrote:
>
>> Unfortunately I don't think thats sufficient as they don't seem to
>> support sftp in the same way they did before.  We'll still need to update
>> our release scripts.
>>
>> On Wed, Feb 24, 2016 at 2:09 AM, Yin Yang  wrote:
>>
>>> Looks like access to people.apache.org has been restored.
>>>
>>> FYI
>>>
>>> On Mon, Feb 22, 2016 at 10:07 PM, Luciano Resende 
>>>  wrote:
>>>


 On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> An update: people.apache.org has been shut down so the release
> scripts are broken. Will try again after we fix them.
>
>
 If you skip uploading to people.a.o, it should still be available in
 nexus for review.

 The other option is to add the RC into
 https://dist.apache.org/repos/dist/dev/



 --
 Luciano Resende
 http://people.apache.org/~lresende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/


>>
>

Re: BUILD FAILURE...again?! :( Spark Project External Flume on fire

2016-01-11 Thread Josh Rosen

I've got a hotfix which should address it:
https://github.com/apache/spark/pull/10693



On Sun, Jan 10, 2016 at 11:50 PM, Jacek Laskowski  wrote:

> Hi,
>
> It appears that the last commit [1] broke the build. Is anyone working
> on it? I can when told so.
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> -DskipTests clean install
> ...
> [info] Compiling 8 Scala sources and 1 Java source to
> /Users/jacek/dev/oss/spark/external/flume/target/scala-2.11/classes...
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:33:
> object jboss is not a member of package org
> [error] import org.jboss.netty.handler.codec.compression._
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:31:
> object jboss is not a member of package org
> [error] import org.jboss.netty.channel.{ChannelPipeline,
> ChannelPipelineFactory, Channels}
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:32:
> object jboss is not a member of package org
> [error] import
> org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> [error]^
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelPipelineFactory not found
> - continuing with a stub.
> [warn] Class org.jboss.netty.handler.execution.ExecutionHandler not
> found - continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.handler.execution.ExecutionHandler not
> found - continuing with a stub.
> [warn] Class org.jboss.netty.channel.group.ChannelGroup not found -
> continuing with a stub.
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:149:
> not found: type NioServerSocketChannelFactory
> [error]   val channelFactory = new
> NioServerSocketChannelFactory(Executors.newCachedThreadPool(),
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:196:
> not found: type ChannelPipelineFactory
> [error]   class CompressionChannelPipelineFactory extends
> ChannelPipelineFactory {
> [error]   ^
> [error] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [error] Class org.jboss.netty.channel.ChannelPipelineFactory not found
> - continuing with a stub.
> [error] Class org.jboss.netty.handler.execution.ExecutionHandler not
> found - continuing with a stub.
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:197:
> not found: type ChannelPipeline
> [error] def getPipeline(): ChannelPipeline = {
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:198:
> not found: value Channels
> [error]   val pipeline = Channels.pipeline()
> [error]  ^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:199:
> not found: type ZlibEncoder
> [error]   val encoder = new ZlibEncoder(6)
> [error] ^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala:29:
> object jboss is not a member of package org
> [error] import
> org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory
> [error]^
> [error]
> /Users/jacek/dev/oss/spark/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala:73:
> not found: type NioClientSocketChannelFactory
> [error] new NioClientSocketChannelFactory(channelFactoryExecutor,
> channelFactoryExecutor)
> [error] ^
> [warn] Class org.jboss.netty.channel.ChannelFuture not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
> [warn] Class org.jboss.netty.channel.ChannelUpstreamHandler not found
> - continuing with a stub.
> [error] Class org.jboss.netty.channel.ChannelFactory not found -
>

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen

If users are able to install Spark 2.0 on their RHEL clusters, then I
imagine that they're also capable of installing a standalone Python
alongside that Spark version (without changing Python systemwide). For
instance, Anaconda/Miniconda make it really easy to install Python
2.7.x/3.x without impacting / changing the system Python and doesn't
require any special permissions to install (you don't need root / sudo
access). Does this address the Python versioning concerns for RHEL users?

On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:

> yeah, the practical concern is that we have no control over java or python
> version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we did i
> can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
> no point in developing something that doesnt run for majority of customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>> 2020. So I'm assuming these large companies will have the option of riding
>> out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6 for
>> the next several years? Even though the core Python devs stopped supporting
>> it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with those
>> annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>> 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry) where
>>> CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is the
>>> only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>> juliet.hougl...@gmail.com> wrote:
>>>
 I don't see a reason Spark 2.0 would need to support Python 2.6. At
 this point, Python 3 should be the default that is encouraged.
 Most organizations acknowledge the 2.7 is common, but lagging behind
 the version they should theoretically use. Dropping python 2.6
 support sounds very reasonable to me.

 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> ,
> but otherwise yes, Python 2.6 is ancient history and the core Python
> developers stopped supporting it in 2013. REHL 5 is not a good enough
> reason to continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe
> we currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
> wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45，"Meethu Mathew"  写道：
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>> wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in
>>> Spark 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark 
>>> depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still 
>>> uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>

>>>
>

New processes / tools for changing dependencies in Spark

2015-12-30 Thread Josh Rosen

I just merged https://github.com/apache/spark/pull/10461, a PR that adds
new automated tooling to help us reason about dependency changes in Spark.
Here's a summary of the changes:

   - The dev/run-tests script (used in the SBT Jenkins builds and for
   testing Spark pull requests) now generates a file which contains Spark's
   resolved runtime classpath for each Hadoop profile, then compares that file
   to a copy which is checked into the repository. These dependency lists are
   found at https://github.com/apache/spark/tree/master/dev/deps; there is
   a separate list for each Hadoop profile.

   - If a pull request changes dependencies without updating these manifest
   files, our test script will fail the build
    and
   the build console output will list the dependency diff
   

   .

   - If you are intentionally changing dependencies, run
./dev/test-dependencies.sh
   --replace-manifest to re-generate these dependency manifests then commit
   the changed files and include them with your pull request.

The goal of this change is to make it simpler to reason about build
changes: it should now be much easier to verify whether dependency
exclusions worked properly or determine whether transitive dependencies
changed in a way that affects the final classpath.

Let me know if you have any questions about this change and, as always,
feel free to submit pull requests if you would like to make any
enhancements to this script.

Thanks,
Josh

Re: Is there any way to stop a jenkins build

2015-12-29 Thread Josh Rosen

Yeah, I thought that my quick fix might address the
HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work
so I'll now have to do the more principled fix of using a UDF which sleeps
for some amount of time.

In order to stop builds, you need to have a Jenkins account with the proper
permissions. I believe that it's generally only Spark committers and AMPLab
members who have accounts + Jenkins SSH access.

I've gone ahead killed the build for you. It looks like someone had
configured the pull request builder timeout to be 300 minutes (5 hours),
but I think we should consider decreasing that to match the timeout used by
the Spark full test jobs.

On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Thanks. I'll merge the most recent master...
>
> Still curious if we can stop a build.
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>
> 2015-12-29 18:59 GMT+01:00 Ted Yu :
>
>> HiveThriftBinaryServerSuite got stuck.
>>
>> I thought Josh has fixed this issue:
>>
>> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
>> HiveThriftBinaryServerSuite
>>
>> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> My AMPLAB jenkins build has been stuck for a few hours now:
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull
>>>
>>> Is there a way for me to stop the build?
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Josh Rosen

+1

On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang  wrote:

> +1
>
> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc4
>>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
>>> *
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1176/
>>>
>>> The test repository (versioned as v1.6.0-rc4) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1175/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>>>
>>> ===
>>> == How can I help test this release? ==
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> == What justifies a -1 vote for this release? ==
>>> 
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==
>>> == Major changes to help you focus your testing ==
>>> ==
>>>
>>> Notable changes since 1.6 RC3
>>>
>>>   - SPARK-12404 - Fix serialization error for Datasets with
>>> Timestamps/Arrays/Decimal
>>>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>>>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>>>   - SPARK-12413 - Fix mesos HA
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>- SPARK-2629  
>>>trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>- SPARK-12165 
>>>SPARK-12189  Fix
>>>bugs in eviction of storage memory by execution.
>>>- SPARK-12258  correct
>>>passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>- SPARK-11787  Parquet
>>>Performance - Improve Parquet scan performance when using flat
>>>schemas.
>>>- SPARK-10810 
>>>Session Management - Isolated devault database (i.e USE mydb) even
>>>on shared clusters.
>>>- SPARK-   Dataset
>>>API - A type-safe API (similar to RDDs) that performs many
>>>operations on serialized binary data and code generation (i.e. Project
>>>Tungsten).
>>>- SPARK-1  Unified
>>>Memory Management - Shared memory for execution and caching instead
>>>of exclusive division of the regions.
>>>- SPARK-11197  SQL
>>>Queries on Files - Concise syntax for running SQL queries over

Re: Spark fails after 6000s because of akka

2015-12-20 Thread Josh Rosen

Would you mind copying this information into a JIRA ticket to make it
easier to discover / track? Thanks!

On Sun, Dec 20, 2015 at 11:35 AM Alexander Pivovarov 
wrote:

> Usually Spark EMR job fails with the following exception in 1 hour 40 min
> - Job cancelled because SparkContext was shut down
>
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@2d602a14 rejected from 
> java.util.concurrent.ThreadPoolExecutor@46a9e52[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 6294]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
>   at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>   at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>   at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
>   at java.lang.Thread.run(Thread.java:745)
> Exception in

Re: JIRA: Wrong dates from imported JIRAs

2015-12-16 Thread Josh Rosen

Personally, I'd rather avoid the risk of breaking things during the
reimport. In my experience we've had a lot of unforeseen problems with JIRA
import/export and the benefit here doesn't seem huge (this issue only
impacts people that are searching for the oldest JIRAs across all projects,
which I think is pretty uncommon). Just my two cents.

- Josh

On Wed, Dec 16, 2015 at 8:45 AM, Lars Francke 
wrote:

> Any other opinions on this?
>
> On Fri, Dec 11, 2015 at 9:54 AM, Lars Francke 
> wrote:
>
>> That's a good point. I assume there's always a small risk but it's at
>> least the documented way from Atlassian to change the creation date so I'd
>> hope it should be okay. I'd build the minimal CSV file.
>>
>> I agree that probably not a lot of people are going to search across
>> projects but on the other hand it's a one-time fix and who knows how long
>> the Apache Jira is going to live :)
>>
>> On Fri, Dec 11, 2015 at 9:05 AM, Reynold Xin  wrote:
>>
>>> Thanks for looking at this. Is it worth fixing? Is there a risk
>>> (although small) that the re-import would break other things?
>>>
>>> Most of those are done and I don't know how often people search JIRAs by
>>> date across projects.
>>>
>>> On Fri, Dec 11, 2015 at 3:40 PM, Lars Francke 
>>> wrote:
>>>
 Hi,

 I've been digging into JIRA a bit and found a couple of old issues
 (~250) and I just assume that they are all from the old JIRA.

 Here's one example:

 Old: 
 New: 

 created": "0012-08-21T09:03:00.000-0800",

 That's quite impressive but wrong :)

 That means when you sort all Apache JIRAs by creation date Spark comes
 first: <
 https://issues.apache.org/jira/issues/?jql=order%20By%20createdDate%20ASC=250
 >

 The dates were already wrong in the source JIRA.

 Now it seems as if those can be fixed using a CSV import. I still
 remember how painful the initial import was but this looks relatively
 straight forward <
 https://confluence.atlassian.com/display/JIRAKB/How+to+change+the+issue+creation+date+using+CSV+import
 >

 If everyone's okay with it I'd raise it with INFRA (and would prepare
 the necessary CSV file) but as I'm not a committer it'd be great if
 one/some of the committers could give me a +1

 Cheers,
 Lars

>>>
>>>
>>
>

Re: Fastest way to build Spark from scratch

2015-12-09 Thread Josh Rosen

Yeah, this is the same idea behind having Travis cache the ivy2 folder to
speed up builds. In Amplab Jenkins each individual build workspace has its
own individual Ivy cache which is preserved across build runs but which is
only used by one active run at a time in order to avoid SBT ivy lock
contention (this shouldn't be an issue in most environments though).
On Tue, Dec 8, 2015 at 10:32 AM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Interesting. As long as Spark's dependencies don't change that often, the
> same caches could save "from scratch" build time over many months of Spark
> development. Is that right?
>
>
> On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> @Nick, on a fresh EC2 instance a significant chunk of the initial build
>> time might be due to artifact resolution + downloading. Putting
>> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
>> decent chunk of time off that first build.
>>
>> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for the tips, Jakob and Steve.
>>>
>>> It looks like my original approach is the best for me since I'm
>>> installing Spark on newly launched EC2 instances and can't take advantage
>>> of incremental compilation.
>>>
>>> Nick
>>>
>>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <ste...@hortonworks.com>
>>> wrote:
>>>
>>>> On 7 Dec 2015, at 19:07, Jakob Odersky <joder...@gmail.com> wrote:
>>>>
>>>> make-distribution and the second code snippet both create a
>>>> distribution from a clean state. They therefore require that every source
>>>> file be compiled and that takes time (you can maybe tweak some settings or
>>>> use a newer compiler to gain some speed).
>>>>
>>>> I'm inferring from your question that for your use-case deployment
>>>> speed is a critical issue, furthermore you'd like to build Spark for lots
>>>> of (every?) commit in a systematic way. In that case I would suggest you
>>>> try using the second code snippet without the `clean` task and only resort
>>>> to it if the build fails.
>>>>
>>>> On my local machine, an assembly without a clean drops from 6 minutes
>>>> to 2.
>>>>
>>>> regards,
>>>> --Jakob
>>>>
>>>>
>>>> 1. you can use zinc -where possible- to speed up scala compilations
>>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>>> whatever git repo & branch you are working off, and have it do the builds
>>>> and tests for you. Not so great for interactive dev,
>>>>
>>>> finally, on the mac, the "say" command is pretty handy at letting you
>>>> know when some work in a terminal is ready, so you can do the
>>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>>
>>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say
>>>> moo
>>>>
>>>> After that you can work on the modules you care about (via the -pl)
>>>> option). That doesn't work if you are running on an EC2 instance though
>>>>
>>>>
>>>>
>>>>
>>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>>> as fast as possible from scratch.
>>>>>
>>>>> This is what I’m doing at the moment:
>>>>>
>>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>>
>>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>>> takes around 20 minutes on an m3.large instance.
>>>>>
>>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>>> when you deploy Spark at a specific git commit:
>>>>>
>>>>> sbt/sbt clean assembly
>>>>> sbt/sbt publish-local
>>>>>
>>>>> This seems slower than using make-distribution.sh, actually.
>>>>>
>>>>> Is there a faster way to do this?
>>>>>
>>>>> Nick
>>>>> 
>>>>>
>>>>
>>>>
>>>>
>>

Re: Spark doesn't unset HADOOP_CONF_DIR when testing ?

2015-12-06 Thread Josh Rosen

I agree that we should unset this in our tests. Want to file a JIRA and
submit a PR to do this?

On Thu, Dec 3, 2015 at 6:40 PM Jeff Zhang  wrote:

> I try to do test on HiveSparkSubmitSuite on local box, but fails. The
> cause is that spark is still using my local single node cluster hadoop when
> doing the unit test. I don't think it make sense to do that. These
> environment variable should be unset before the testing. And I suspect
> dev/run-tests also
> didn't do that either.
>
> Here's the error message:
>
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are:
> rwxr-xr-x
> [info]   at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: IntelliJ license for committers?

2015-12-02 Thread Josh Rosen

Yep, I'm the point of contact between us and JetBrains. I forwarded the
2015 license renewal email to the private@ list, so it should be accessible
via the archives. I'll go ahead and forward you a copy of our project
license, which will have to be renewed in January of next year.

On Wed, Dec 2, 2015 at 10:24 AM, Sean Owen  wrote:

> Thanks, yes I've seen this, though I recall from another project that
> at some point they said, wait, we already gave your project a license!
> and I had to track down who had it. I think Josh might be the keeper?
> Not a big deal, just making sure I didn't miss an update there.
>
> On Wed, Dec 2, 2015 at 6:18 PM, Yin Huai  wrote:
> > I think they can renew your license. In
> > https://www.jetbrains.com/buy/opensource/?product=idea, you can find
> "Update
> > Open Source License".
> >
> > On Wed, Dec 2, 2015 at 7:47 AM, Sean Owen  wrote:
> >>
> >> I'm aware that IntelliJ has (at least in the past) made licenses
> >> available to committers in bona fide open source projects, and I
> >> recall they did the same for Spark. I believe I'm using that license
> >> now, but it seems to have expired? If anyone knows the status of that
> >> (or of any renewals to the license), I wonder if you could share that
> >> with me, offline of course.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>

Re: Bringing up JDBC Tests to trunk

2015-11-30 Thread Josh Rosen

The JDBC drivers are currently being pulled in as test-scope dependencies
of the `sql/core` module:
https://github.com/apache/spark/blob/f2fbfa444f6e8d27953ec2d1c0b3abd603c963f9/sql/core/pom.xml#L91

In SBT, these wind up on the Docker JDBC tests' classpath as a transitive
dependency of the `spark-sql` test JAR. However, what we *should* be doing
is adding them as explicit test dependencies of the
`docker-integration-tests` subproject, since Maven handles transitive test
JAR dependencies differently than SBT (see
https://github.com/apache/spark/pull/9876#issuecomment-158593498 for some
discussion). If you choose to make that fix as part of your PR, be sure to
move the version handling to the root POM's  section
so that the versions in both modules stay in sync. We might also be able to
just simply move the JDBC driver dependencies to docker-integration-tests'
POM if it turns out that they're not used anywhere else (that's my hunch).

On Sun, Nov 22, 2015 at 6:49 PM, Luciano Resende <luckbr1...@gmail.com>
wrote:

> Hey Josh,
>
> Thanks for helping bringing this up, I have just pushed a WIP PR for
> bringing the DB2 tests to be running on Docker, and I have a question about
> how the jdbc drivers are actually being setup for the other datasources
> (MySQL and PostgreSQL), are these setup directly on the Jenkins slaves ? I
> didn't see the jars or anything specific on the pom or other files...
>
>
> Thanks
>
> On Wed, Oct 21, 2015 at 1:26 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
>> Hey Luciano,
>>
>> This sounds like a reasonable plan to me. One of my colleagues has
>> written some Dockerized MySQL testing utilities, so I'll take a peek at
>> those to see if there are any specifics of their solution that we should
>> adapt for Spark.
>>
>> On Wed, Oct 21, 2015 at 1:16 PM, Luciano Resende <luckbr1...@gmail.com>
>> wrote:
>>
>>> I have started looking into PR-8101 [1] and what is required to merge it
>>> into trunk which will also unblock me around SPARK-10521 [2].
>>>
>>> So here is the minimal plan I was thinking about :
>>>
>>> - make the docker image version fixed so we make sure we are using the
>>> same image all the time
>>> - pull the required images on the Jenkins executors so tests are not
>>> delayed/timedout because it is waiting for docker images to download
>>> - create a profile to run the JDBC tests
>>> - create daily jobs for running the JDBC tests
>>>
>>>
>>> In parallel, I learned that Alan Chin from my team is working with the
>>> AmpLab team to expand the build capacity for Spark, so I will use some of
>>> the nodes he is preparing to test/run these builds for now.
>>>
>>> Please let me know if there is anything else needed around this.
>>>
>>>
>>> [1] https://github.com/apache/spark/pull/8101
>>> [2] https://issues.apache.org/jira/browse/SPARK-10521
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: VerifyError running Spark SQL code?

2015-11-25 Thread Josh Rosen

I think I've also seen this issue as well, but in a different suite. I
wasn't able to easily get to the bottom of it, though. What JDK / JRE are
you using? I'm on


Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

on OSX.

On Wed, Nov 25, 2015 at 4:51 PM, Marcelo Vanzin  wrote:

> I've been running into this error when running Spark SQL recently; no
> matter what I try (completely clean build or anything else) doesn't
> seem to fix it. Anyone has some idea of what's wrong?
>
> [info] Exception encountered when attempting to run a suite with class
> name: org.apache.spark.sql.execution.ui.SQLListenerMemoryLeakSuite ***
> ABORTED *** (4 seconds, 111 milliseconds)
> [info]   java.lang.VerifyError: Bad  method call from inside of a
> branch
> [info] Exception Details:
> [info]   Location:
> [info]
>  
> org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Lorg/apache/spark/sql/catalyst/expressions/Expression;)V
> @82: invokespecial
>
> Same happens with spark shell (when instantiating SQLContext), so not
> an issue with the test code...
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Josh Rosen

Can you file a JIRA issue to help me triage this further? Thanks!

On Tue, Nov 17, 2015 at 4:08 PM Jeff Zhang <zjf...@gmail.com> wrote:

> Sure, hive profile is enabled.
>
> On Wed, Nov 18, 2015 at 6:12 AM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> Is the Hive profile enabled? I think it may need to be turned on in order
>> for those JARs to be deployed.
>>
>> On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> BTW, After I revert  SPARK-7841, I can see all the jars under
>>> lib_managed/jars
>>>
>>> On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> I notice the comments in https://github.com/apache/spark/pull/9575 said
>>>> that Datanucleus related jars will still be copied to
>>>> lib_managed/jars. But I don't see any jars under lib_managed/jars.
>>>> The weird thing is that I see the jars on another machine, but could not
>>>> see jars on my laptop even after I delete the whole spark project and start
>>>> from scratch. Does it related with environments ? I try to add the
>>>> following code in SparkBuild.scala to track the issue, it shows that the
>>>> jars is empty. Any thoughts on that ?
>>>>
>>>>
>>>> deployDatanucleusJars := {
>>>>   val jars: Seq[File] = (fullClasspath in
>>>> assembly).value.map(_.data)
>>>> .filter(_.getPath.contains("org.datanucleus"))
>>>>   // this is what I added
>>>>   println("*")
>>>>   println("fullClasspath:"+fullClasspath)
>>>>   println("assembly:"+assembly)
>>>>   println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
>>>>   //
>>>>
>>>>
>>>> On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> This is the exception I got
>>>>>
>>>>> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating
>>>>> default database after error: Class
>>>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>>>>> javax.jdo.JDOFatalUserException: Class
>>>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>>>>> at
>>>>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
>>>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>>>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>>>>> at
>>>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>>>>> at
>>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>>>>> at
>>>>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>>>>> at
>>>>>

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-17 Thread Josh Rosen

Is the Hive profile enabled? I think it may need to be turned on in order
for those JARs to be deployed.
On Tue, Nov 17, 2015 at 2:27 AM Jeff Zhang <zjf...@gmail.com> wrote:

> BTW, After I revert  SPARK-7841, I can see all the jars under
> lib_managed/jars
>
> On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Hi Josh,
>>
>> I notice the comments in https://github.com/apache/spark/pull/9575 said
>> that Datanucleus related jars will still be copied to lib_managed/jars.
>> But I don't see any jars under lib_managed/jars. The weird thing is that I
>> see the jars on another machine, but could not see jars on my laptop even
>> after I delete the whole spark project and start from scratch. Does it
>> related with environments ? I try to add the following code in
>> SparkBuild.scala to track the issue, it shows that the jars is empty. Any
>> thoughts on that ?
>>
>>
>> deployDatanucleusJars := {
>>   val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data)
>> .filter(_.getPath.contains("org.datanucleus"))
>>   // this is what I added
>>   println("*")
>>   println("fullClasspath:"+fullClasspath)
>>   println("assembly:"+assembly)
>>   println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
>>   //
>>
>>
>> On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> This is the exception I got
>>>
>>> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating
>>> default database after error: Class
>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>>> javax.jdo.JDOFatalUserException: Class
>>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>>> at
>>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>>> at
>>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>> at
>>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>>> at
>>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>>> at
>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>>> at
>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at
>>> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>>> at
>>> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>>>
>>> On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang <

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Josh Rosen

As of https://github.com/apache/spark/pull/9575, Spark's build will no
longer place every dependency JAR into lib_managed. Can you say more about
how this affected spark-shell for you (maybe share a stacktrace)?

On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:

>
> Sometimes, the jars under lib_managed is missing. And after I rebuild the
> spark, the jars under lib_managed is still not downloaded. This would cause
> the spark-shell fail due to jars missing. Anyone has hit this weird issue ?
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: A proposal for Spark 2.0

2015-11-10 Thread Josh Rosen

There's a proposal / discussion of the assembly-less distributions at
https://github.com/vanzin/spark/pull/2/files /
https://issues.apache.org/jira/browse/SPARK-11157.

On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin  wrote:

>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>>
> Right now we ship Spark using a single assembly jar, which causes a few
> different problems:
>
> - total number of classes are limited on some configurations
>
> - dependency swapping is harder
>
>
> The proposal is to just avoid a single fat jar.
>
>
>

Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen

I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
version to see if that works?

I just checked and it looks like our Jenkins tests are running against PyPy
2.5.1, so that version is known to work. I'm not sure what the actual
minimum supported PyPy version is. Would you be interested in helping to
investigate so that we can update the documentation or produce a fix to
restore compatibility with earlier PyPy builds?

On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan  wrote:

> Hi all,
>
> I am trying to run pyspark with pypy, and it is work when using
> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>
> my pypy version:
>
> $ /usr/bin/pypy --version
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4]
>
> works with spark-1.3.1
>
> $ PYSPARK_PYTHON=/usr/bin/pypy
> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a loopback
> address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface eth0)
> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
>   /_/
>
> Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
> SparkContext available as sc, HiveContext available as sqlContext.
> And now for something completely different: ``Armin: "Prolog is a mess.",
> CF:
> "No, it's very cool!", Armin: "Isn't this what I said?"''
> >>>
>
> error message for 1.5.1
>
> $ PYSPARK_PYTHON=/usr/bin/pypy
> ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
> [PyPy 2.2.1 with GCC 4.8.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "app_main.py", line 614, in run_it
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
> line 30, in 
> import pyspark
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
> line 41, in 
> from pyspark.context import SparkContext
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
> line 26, in 
> from pyspark import accumulators
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
> line 98, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 400, in 
> _hijack_namedtuple()
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 378, in _hijack_namedtuple
> _old_namedtuple = _copy_func(collections.namedtuple)
>   File
> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
> line 376, in _copy_func
> f.__defaults__, f.__closure__)
> AttributeError: 'function' object has no attribute '__closure__'
> And now for something completely different: ``the traces don't lie''
>
> is this a known issue? any suggestion to resolve it? or how can I help to
> fix this problem?
>
> Thanks.
>

Re: pyspark with pypy not work for spark-1.5.1

2015-11-05 Thread Josh Rosen

You could try running PySpark's own unit tests. Try ./python/run-tests
--help for instructions.

On Thu, Nov 5, 2015 at 12:31 AM Chang Ya-Hsuan <sumti...@gmail.com> wrote:

> I've test on following pypy version against to spark-1.5.1
>
>   pypy-2.2.1
>   pypy-2.3
>   pypy-2.3.1
>   pypy-2.4.0
>   pypy-2.5.0
>   pypy-2.5.1
>   pypy-2.6.0
>   pypy-2.6.1
>
> I run
>
> $ PYSPARK_PYTHON=/path/to/pypy-xx.xx/bin/pypy
> /path/to/spark-1.5.1/bin/pyspark
>
> and only pypy-2.2.1 failed.
>
> Any suggestion to run advanced test?
>
> On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan <sumti...@gmail.com> wrote:
>
>> Thanks for your quickly reply.
>>
>> I will test several pypy versions and report the result later.
>>
>> On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>>
>>> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
>>> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
>>> version to see if that works?
>>>
>>> I just checked and it looks like our Jenkins tests are running against
>>> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
>>> minimum supported PyPy version is. Would you be interested in helping to
>>> investigate so that we can update the documentation or produce a fix to
>>> restore compatibility with earlier PyPy builds?
>>>
>>> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan <sumti...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to run pyspark with pypy, and it is work when using
>>>> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>>>>
>>>> my pypy version:
>>>>
>>>> $ /usr/bin/pypy --version
>>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>>> [PyPy 2.2.1 with GCC 4.8.4]
>>>>
>>>> works with spark-1.3.1
>>>>
>>>> $ PYSPARK_PYTHON=/usr/bin/pypy
>>>> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
>>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
>>>> loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
>>>> eth0)
>>>> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
>>>> another address
>>>> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
>>>> library for your platform... using builtin-java classes where applicable
>>>> Welcome to
>>>>     __
>>>>  / __/__  ___ _/ /__
>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>/__ / .__/\_,_/_/ /_/\_\   version 1.3.1
>>>>   /_/
>>>>
>>>> Using Python version 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015)
>>>> SparkContext available as sc, HiveContext available as sqlContext.
>>>> And now for something completely different: ``Armin: "Prolog is a
>>>> mess.", CF:
>>>> "No, it's very cool!", Armin: "Isn't this what I said?"''
>>>> >>>
>>>>
>>>> error message for 1.5.1
>>>>
>>>> $ PYSPARK_PYTHON=/usr/bin/pypy
>>>> ~/Tool/spark-1.5.1-bin-hadoop2.6/bin/pyspark
>>>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>>>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>> Traceback (most recent call last):
>>>>   File "app_main.py", line 72, in run_toplevel
>>>>   File "app_main.py", line 614, in run_it
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py",
>>>> line 30, in 
>>>> import pyspark
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/__init__.py",
>>>> line 41, in 
>>>> from pyspark.context import SparkContext
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/context.py",
>>>> line 26, in 
>>>> from pyspark import accumulators
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/accumulators.py",
>>>> line 98, in 
>>>> from pyspark.serializers import read_int, PickleSerializer
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>>> line 400, in 
>>>> _hijack_namedtuple()
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>>> line 378, in _hijack_namedtuple
>>>> _old_namedtuple = _copy_func(collections.namedtuple)
>>>>   File
>>>> "/home/yahsuan/Tool/spark-1.5.1-bin-hadoop2.6/python/pyspark/serializers.py",
>>>> line 376, in _copy_func
>>>> f.__defaults__, f.__closure__)
>>>> AttributeError: 'function' object has no attribute '__closure__'
>>>> And now for something completely different: ``the traces don't lie''
>>>>
>>>> is this a known issue? any suggestion to resolve it? or how can I help
>>>> to fix this problem?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>>
>> --
>> -- 張雅軒
>>
>
>
>
> --
> -- 張雅軒
>

Re: Bringing up JDBC Tests to trunk

2015-10-21 Thread Josh Rosen

Hey Luciano,

This sounds like a reasonable plan to me. One of my colleagues has written
some Dockerized MySQL testing utilities, so I'll take a peek at those to
see if there are any specifics of their solution that we should adapt for
Spark.

On Wed, Oct 21, 2015 at 1:16 PM, Luciano Resende 
wrote:

> I have started looking into PR-8101 [1] and what is required to merge it
> into trunk which will also unblock me around SPARK-10521 [2].
>
> So here is the minimal plan I was thinking about :
>
> - make the docker image version fixed so we make sure we are using the
> same image all the time
> - pull the required images on the Jenkins executors so tests are not
> delayed/timedout because it is waiting for docker images to download
> - create a profile to run the JDBC tests
> - create daily jobs for running the JDBC tests
>
>
> In parallel, I learned that Alan Chin from my team is working with the
> AmpLab team to expand the build capacity for Spark, so I will use some of
> the nodes he is preparing to test/run these builds for now.
>
> Please let me know if there is anything else needed around this.
>
>
> [1] https://github.com/apache/spark/pull/8101
> [2] https://issues.apache.org/jira/browse/SPARK-10521
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Spark Event Listener

2015-10-16 Thread Josh Rosen

The reason for having two separate interfaces is developer API
backwards-compatibility, as far as I know. SparkFirehoseListener came later.

On Tue, Oct 13, 2015 at 4:36 PM, Jakob Odersky  wrote:

> the path of the source file defining the event API is
> `core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala`
>
> On 13 October 2015 at 16:29, Jakob Odersky  wrote:
>
>> Hi,
>> I came across the spark listener API while checking out possible UI
>> extensions recently. I noticed that all events inherit from a sealed trait
>> `SparkListenerEvent` and that a SparkListener has a corresponding
>> `onEventXXX(event)` method for every possible event.
>>
>> Considering that events inherit from a sealed trait and thus all events
>> are known during compile-time, what is the rationale of using specific
>> methods for every event rather than a single method that would let a client
>> pattern match on the type of event?
>>
>> I don't know the internals of the pattern matcher, but again, considering
>> events are sealed, I reckon that matching performance should not be an
>> issue.
>>
>> thanks,
>> --Jakob
>>
>
>

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-15 Thread Josh Rosen

To clarify, we're asking about the *spark.sql.tungsten.enabled* flag, which
was introduced in Spark 1.5 and enables Project Tungsten optimizations in
Spark SQL. This option is set to *true* by default in Spark 1.5+ and exists
primarily to allow users to disable the new code paths if they encounter
bugs or performance regressions.

If anyone sets spark.sql.tungsten.enabled=*false *in their SparkConf in
order to *disable* these optimizations, we'd like to hear from you in order
to figure out why you disabled them and to see whether we can make
improvements to allow your workload to run with Tungsten enabled.

Thanks,
Josh

On Thu, Oct 15, 2015 at 9:33 AM, mkhaitman  wrote:

> Are you referring to spark.shuffle.manager=tungsten-sort? If so, we saw the
> default value as still being as the regular sort, and since it was only
> first introduced in 1.5, were actually waiting a bit to see if anyone
> ENABLED it as opposed to DISABLING it since - it's disabled by default! :)
>
> I recall enabling it during testing within our dev environment, but didn't
> have a comparable workload and environment to our production cluster, so we
> were going to play it safe and wait until 1.6 in case there were any major
> changes / regressions that weren't seen during 1.5 testing!
>
> Mark.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14627.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark Event Listener

2015-10-13 Thread Josh Rosen

Check out SparkFirehoseListener, an adapter which forwards all events to a
single `onEvent` method in order to let you do pattern-matching as you have
described:
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/SparkFirehoseListener.java

On Tue, Oct 13, 2015 at 4:29 PM, Jakob Odersky  wrote:

> Hi,
> I came across the spark listener API while checking out possible UI
> extensions recently. I noticed that all events inherit from a sealed trait
> `SparkListenerEvent` and that a SparkListener has a corresponding
> `onEventXXX(event)` method for every possible event.
>
> Considering that events inherit from a sealed trait and thus all events
> are known during compile-time, what is the rationale of using specific
> methods for every event rather than a single method that would let a client
> pattern match on the type of event?
>
> I don't know the internals of the pattern matcher, but again, considering
> events are sealed, I reckon that matching performance should not be an
> issue.
>
> thanks,
> --Jakob
>

Re: Pyspark dataframe read

2015-10-06 Thread Josh Rosen

Could someone please file a JIRA to track this?
https://issues.apache.org/jira/browse/SPARK

On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers  wrote:

> i ran into the same thing in scala api. we depend heavily on comma
> separated paths, and it no longer works.
>
>
> On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl  wrote:
>
>> Hello everyone.
>>
>> It seems pyspark dataframe read is broken for reading multiple files.
>>
>> sql.read.json( "file1,file2") fails with java.io.IOException: No input
>> paths specified in job.
>>
>> This used to work in spark 1.4 and also still work with sc.textFile
>>
>> Blaž
>>
>
>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Josh Rosen

I'm working on a fix for this right now. I'm planning to re-run a modified
copy of the release packaging scripts which will emit only the missing
artifacts (so we won't upload new artifacts with different SHAs for the
builds which *did* succeed).

I expect to have this finished in the next day or so; I'm currently blocked
by some infra downtime but expect that to be resolved soon.

- Josh

On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas  wrote:

> Blaž said:
>
> Also missing is
> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
> which breaks spark-ec2 script.
>
> This is the package I am referring to in my original email.
>
> Nick said:
>
> It appears that almost every version of Spark up to and including 1.5.0
> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
> However, 1.5.1 has no such package.
>
> Nick
> 
>
> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>
>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>
>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:
>>
>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I’m looking here:

 https://s3.amazonaws.com/spark-related-packages/

 I believe this is where one set of official packages is published.
 Please correct me if this is not the case.

 It appears that almost every version of Spark up to and including 1.5.0
 has included a --bin-hadoop1.tgz release (e.g.
 spark-1.5.0-bin-hadoop1.tgz).

 However, 1.5.1 has no such package. There is a
 spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
 thing. (1.5.0 also has a hadoop1-scala2.11 package.)

 Was this intentional?

 More importantly, is there some rough specification for what packages
 we should be able to expect in this S3 bucket with every release?

 This is important for those of us who depend on this publishing venue
 (e.g. spark-ec2 and related tools).

 Nick

>>>
>>>
>>

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen

Does anyone use ShuffleDependency

directly in their Spark code or libraries? If so, how do you use it?

Similarly, does anyone use ShuffleHandle

directly?

Re: Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Josh Rosen

I ran into a similar problem while working on the spark-redshift library
and was able to fix it by bumping that library's ScalaTest version. I'm
still fighting some mysterious Scala issues while trying to test the
spark-csv library against 1.5.0-RC1, so it's possible that a build or
dependency change in Spark might be responsible for this.

On 8/26/15 2:27 PM, Marcelo Vanzin wrote:
 I ran into the same error (different dependency) earlier today. In my
 case, the maven pom files and the sbt dependencies had a conflict
 (different versions of the same artifact) and ivy got confused. Not
 sure whether that will help in your case or not...

 On Wed, Aug 26, 2015 at 2:23 PM, Holden Karau hol...@pigscanfly.ca wrote:
 Has anyone else run into impossible to get artifacts when data has not been
 loaded. IvyNode = org.scala-lang#scala-library;2.10.3 during hive/update
 when building with sbt. Working around it is pretty simple (just add it as a
 dependency), but I'm wondering if its impacting anyone else and I should
 make a PR for it or if its something funky with my local build setup.

 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau
 Linked In: https://www.linkedin.com/in/holdenkarau




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen

I think that I'm still going to want some custom code to remove the build
start messages from SparkQA and it's hardly any code, so I'm going to stick
with the custom approach for now. The problem is that I don't want _any_
posts from AmplabJenkins, even if they're improved to be more informative,
since our custom SparkQA provides nicer output.

On Fri, Aug 14, 2015 at 1:57 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:



 On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59

 On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com wrote:

 *TL;DR*: would anyone object if I wrote a script to auto-delete pull
 request comments from AmplabJenkins?

 Currently there are two bots which post Jenkins test result comments to
 GitHub, AmplabJenkins and SparkQA.

 SparkQA is the account which post the detailed Jenkins start and finish
 messages that contain information on which commit is being tested and which
 tests have failed. This bot is controlled via the dev/run-tests-jenkins
 script.

 AmplabJenkins is controlled by the Jenkins GitHub Pull Request Builder
 plugin. This bot posts relatively uninformative comments (Merge build
 triggered, Merge build started, Merge build failed) that do not
 contain any links or details specific to the tests being run.


 Some of these can be configured. For instance, make sure to disable Use
 comments to report intermediate phases: triggered et al, and if you add a
 publicly accessible URL in Published Jenkins URL, you will get a link to
 the test result in the test result comment. I know these are global
 settings, but the Jenkins URL is unique anyway, and intermediate phases are
 probably equally annoying to everyone.

 You can see the only comment posted for a successful PR build here:
 https://github.com/scala-ide/scala-ide/pull/991#issuecomment-128016214

 I'd avoid more custom code if possible.

 my 2c,
 iulian




 It is technically non-trivial prevent these AmplabJenkins comments from
 being posted in the first place (see
 https://issues.apache.org/jira/browse/SPARK-4216).

 However, as a short-term hack I'd like to deploy a script which
 automatically deletes these comments as soon as they're posted, with an
 exemption carved out for the Can an admin approve this patch for testing?
 messages. This will help to significantly de-clutter pull request
 discussions in the GitHub UI.

 If nobody objects, I'd like to deploy this script sometime in the next
 few days.

 (From a technical perspective, my script uses the GitHub REST API and
 AmplabJenkins' own OAuth token to delete the comments.  The final
 deployment environment will most likely be the backend of
 http://spark-prs.appspot.com).

 - Josh





 --

 --
 Iulian Dragos

 --
 Reactive Apps on the JVM
 www.typesafe.com

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen

The updated prototype listed in
https://github.com/databricks/spark-pr-dashboard/pull/59 is now running
live on spark-prs as part of its PR comment update task.

On Fri, Aug 14, 2015 at 10:51 AM, Josh Rosen rosenvi...@gmail.com wrote:

 I think that I'm still going to want some custom code to remove the build
 start messages from SparkQA and it's hardly any code, so I'm going to stick
 with the custom approach for now. The problem is that I don't want _any_
 posts from AmplabJenkins, even if they're improved to be more informative,
 since our custom SparkQA provides nicer output.

 On Fri, Aug 14, 2015 at 1:57 AM, Iulian Dragoș iulian.dra...@typesafe.com
  wrote:



 On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59

 On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com
 wrote:

 *TL;DR*: would anyone object if I wrote a script to auto-delete pull
 request comments from AmplabJenkins?

 Currently there are two bots which post Jenkins test result comments to
 GitHub, AmplabJenkins and SparkQA.

 SparkQA is the account which post the detailed Jenkins start and finish
 messages that contain information on which commit is being tested and which
 tests have failed. This bot is controlled via the dev/run-tests-jenkins
 script.

 AmplabJenkins is controlled by the Jenkins GitHub Pull Request Builder
 plugin. This bot posts relatively uninformative comments (Merge build
 triggered, Merge build started, Merge build failed) that do not
 contain any links or details specific to the tests being run.


 Some of these can be configured. For instance, make sure to disable Use
 comments to report intermediate phases: triggered et al, and if you add a
 publicly accessible URL in Published Jenkins URL, you will get a link to
 the test result in the test result comment. I know these are global
 settings, but the Jenkins URL is unique anyway, and intermediate phases are
 probably equally annoying to everyone.

 You can see the only comment posted for a successful PR build here:
 https://github.com/scala-ide/scala-ide/pull/991#issuecomment-128016214

 I'd avoid more custom code if possible.

 my 2c,
 iulian




 It is technically non-trivial prevent these AmplabJenkins comments from
 being posted in the first place (see
 https://issues.apache.org/jira/browse/SPARK-4216).

 However, as a short-term hack I'd like to deploy a script which
 automatically deletes these comments as soon as they're posted, with an
 exemption carved out for the Can an admin approve this patch for testing?
 messages. This will help to significantly de-clutter pull request
 discussions in the GitHub UI.

 If nobody objects, I'd like to deploy this script sometime in the next
 few days.

 (From a technical perspective, my script uses the GitHub REST API and
 AmplabJenkins' own OAuth token to delete the comments.  The final
 deployment environment will most likely be the backend of
 http://spark-prs.appspot.com).

 - Josh





 --

 --
 Iulian Dragos

 --
 Reactive Apps on the JVM
 www.typesafe.com

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Josh Rosen

Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59

On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen rosenvi...@gmail.com wrote:

 *TL;DR*: would anyone object if I wrote a script to auto-delete pull
 request comments from AmplabJenkins?

 Currently there are two bots which post Jenkins test result comments to
 GitHub, AmplabJenkins and SparkQA.

 SparkQA is the account which post the detailed Jenkins start and finish
 messages that contain information on which commit is being tested and which
 tests have failed. This bot is controlled via the dev/run-tests-jenkins
 script.

 AmplabJenkins is controlled by the Jenkins GitHub Pull Request Builder
 plugin. This bot posts relatively uninformative comments (Merge build
 triggered, Merge build started, Merge build failed) that do not
 contain any links or details specific to the tests being run.

 It is technically non-trivial prevent these AmplabJenkins comments from
 being posted in the first place (see
 https://issues.apache.org/jira/browse/SPARK-4216).

 However, as a short-term hack I'd like to deploy a script which
 automatically deletes these comments as soon as they're posted, with an
 exemption carved out for the Can an admin approve this patch for testing?
 messages. This will help to significantly de-clutter pull request
 discussions in the GitHub UI.

 If nobody objects, I'd like to deploy this script sometime in the next few
 days.

 (From a technical perspective, my script uses the GitHub REST API and
 AmplabJenkins' own OAuth token to delete the comments.  The final
 deployment environment will most likely be the backend of
 http://spark-prs.appspot.com).

 - Josh

Re: Avoiding unnecessary build changes until tests are in better shape

2015-08-05 Thread Josh Rosen

+1.  I've been holding off on reviewing / merging patches like the 
run-tests-jenkins Python refactoring for exactly this reason.


On 8/5/15 11:24 AM, Patrick Wendell wrote:

Hey All,

Was wondering if people would be willing to avoid merging build
changes until we have put the tests in better shape. The reason is
that build changes are the most likely to cause downstream issues with
the test matrix and it's very difficult to reverse engineer which
patches caused which problems when the tests are not in a stable
state. For instance, the updates to Hive 1.2.1 caused cascading
failures that have lasted several days now and in the mean time a few
other build related patches were also merged - as these pile up it
gets harder for us to have confidence those other patches didn't
introduce problems.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Master JIRA ticket for tracking Spark 1.5.0 configuration renames, defaults changes, and configuration deprecation

2015-08-02 Thread Josh Rosen

To help us track planned / finished configuration renames, defaults
changes, and configuration deprecation for the upcoming 1.5.0 release, I
have created https://issues.apache.org/jira/browse/SPARK-9550.

As you make configuration changes or think of configurations that need to
be audited, please update that ticket by editing it or posting a comment.

This ticket will help us later when it comes time to draft release notes.

Thanks,
Josh

Re: Should spark-ec2 get its own repo?

2015-08-01 Thread Josh Rosen

I don't think that using git submodules is a good idea here:

   - The extra `git submodule init  git submodule update` step can lead
   to confusing problems in certain workflows.
   - We'd wind up with many commits that serve only to bump the submodule
   SHA; these commits will be hard to review since they won't contain line
   diffs (the author will have to manually provide a link to the diff of code
   changes).


On Sat, Aug 1, 2015 at 10:08 AM, Matt Goodman meawo...@gmail.com wrote:

 I think that is a good idea, and slated to happen.  At the very least a
 README or some such.  Is this a use case for git submodules?  I am
 considering porting some of this to a more general spark-cloud launcher,
 including google/aliyun/rackspace.  It shouldn't be hard at all given the
 current approach for setup/install.

 --Matthew Goodman

 =
 Check Out My Website: http://craneium.net
 Find me on LinkedIn: http://tinyurl.com/d6wlch

 On Fri, Jul 31, 2015 at 6:50 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 I've mostly kept quiet since I am not very active in maintaining this
 code anymore. However, it is a bit odd that the project is
 split-brained with a lot of the code being on github and some in the
 Spark repo.

 If the consensus is to migrate everything to github, that seems okay
 with me. I would vouch for having user continuity, for instance still
 have a shim ec2/spark-ec2 script that could perhaps just download
 and unpack the real script from github.

 - Patrick

 On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Yes - It is still in progress, but I have just not gotten time to get to
  this. I think getting the repo moved from mesos to amplab in the
 codebase by
  1.5 should be possible.
 
  Thanks
  Shivaram
 
  On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote:
 
  PS is this still in progress? it feels like something that would be
  good to do before 1.5.0, if it's going to happen soon.
 
  On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   Yeah I'll send a note to the mesos dev list just to make sure they
 are
   informed.
  
   Shivaram
  
   On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com
 wrote:
  
   I agree it's worth informing Mesos devs and checking that there are
 no
   big objections. I presume Shivaram is plugged in enough to Mesos
 that
   there won't be any surprises there, and that the project would also
   agree with moving this Spark-specific bit out. they may also want to
   leave a pointer to the new location in the mesos repo of course.
  
   I don't think it is something that requires a formal vote. It's not
 a
   question of ownership -- neither Apache nor the project PMC owns the
   code. I don't think it's different from retiring or removing any
 other
   code.
  
  
  
  
  
   On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan 
 mri...@gmail.com
   wrote:
If I am not wrong, since the code was hosted within mesos project
repo, I assume (atleast part of it) is owned by mesos project and
 so
its PMC ?
   
- Mridul
   
On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There is technically no PMC for the spark-ec2 project (I guess we
are
kind
of establishing one right now). I haven't heard anything from the
Spark
PMC
on the dev list that might suggest a need for a vote so far. I
 will
send
another round of email notification to the dev list when we have
 a
JIRA
/ PR
that actually moves the scripts (right now the only thing that
changed
is
the location of some scripts in mesos/ to amplab/).
   
Thanks
Shivaram
   
  
  
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Josh Rosen

It would also be great to test this with codegen and unsafe enabled but
while continuing to use sort shuffle manager instead of the new
tungsten-sort one.

On Fri, Jul 31, 2015 at 1:39 AM, Reynold Xin r...@databricks.com wrote:

 Is this deterministically reproducible? Can you try this on the latest
 master branch?

 Would be great to turn debug logging and and dump the generated code. Also
 would be great to dump the array size at your line 314 in UnsafeRow (and
 whatever master branch's appropriate line is).

 On Fri, Jul 31, 2015 at 1:31 AM, james yiaz...@gmail.com wrote:

 Another error：
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40443
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output
 statuses
 for shuffle 3 is 583 bytes
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode1:40474
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:34052
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:46929
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode3:50890
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode2:47795
 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
 map output locations for shuffle 3 to bignode4:35120
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 32.0 in
 stage
 151.0 (TID 1203) in 155 ms on bignode3 (1/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 35.0 in
 stage
 151.0 (TID 1204) in 157 ms on bignode2 (2/50)
 15/07/31 16:15:28 INFO scheduler.TaskSetManager: Finished task 8.0 in
 stage
 151.0 (TID 1196) in 168 ms on bignode3 (3/50)
 15/07/31 16:15:28 WARN scheduler.TaskSetManager: Lost task 46.0 in stage
 151.0 (TID 1184, bignode1): java.lang.NegativeArraySizeException
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBinary(UnsafeRow.java:314)
 at

 org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:297)
 at SC$SpecificProjection.apply(Unknown Source)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:152)
 at

 org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection.apply(Projection.scala:140)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at

 org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:148)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:86)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Came-across-Spark-SQL-hang-Error-issue-with-Spark-1-5-Tungsten-feature-tp13537p13538.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Worker memory leaks?

2015-07-27 Thread Josh Rosen

I have promoted https://issues.apache.org/jira/browse/SPARK-9202 to a
blocker to ensure that we get a fix for it before 1.5.0 I'm pretty swamped
with other tasks for the next few days, but I'd be happy to shepherd a
bugfix patch for this (this should be pretty straightforward and the JIRA
ticket contains a sketch of how I'd do it).

On Mon, Jul 20, 2015 at 12:37 PM, Richard Marscher rmarsc...@localytics.com
wrote:

Hi,

thanks for the follow up. You are right regarding the invalidation of
observation #2. I later realized the Worker UI page directly displays the
entries in the executors map and can see in our production UI it's in a
proper state.

As for the Killed vs Exited, it's less relevant now since the theory about
the executors map is invalid. However to answer your question, the current
setup is that the SparkContext lifecycle encapsulates exactly one
application. That is we create a single context per application submitted
and close it upon success/failure completion of the application.

Thanks,

On Mon, Jul 20, 2015 at 3:20 PM, Josh Rosen joshro...@databricks.com
wrote:

Hi Richard,

Thanks for your detailed investigation of this issue. I agree with your
observation that the finishedExecutors hashmap is a source of memory leaks
for very-long-lived clusters. It looks like the finishedExecutors map is
only read when rendering the Worker Web UI and in constructing REST API
responses. I think that we could address this leak by adding a
configuration to cap the maximum number of retained executors,
applications, etc. We already have similar caps in the driver UI. If we
add this configuration, I think that we should pick some sensible default
value rather than an unlimited one. This is technically a user-facing
behavior change but I think it's okay since the current behavior is to
crash / OOM.

Regarding `KillExecutor`, I think that there might be some asynchrony and
indirection masking the cleanup here. Based on a quick glance through the
code, it looks like ExecutorRunner's thread will end an
ExecutorStateChanged RPC back to the Worker after the executor is killed,
so I think that the cleanup will be triggered by that RPC. Since this
isn't clear from reading the code, though, it would be great to add some
comments to the code to explain this, plus a unit test to make sure that
this indirect cleanup mechanism isn't broken in the future.

I'm not sure what's causing the Killed vs Exited issue, but I have one
theory: does the behavior vary based on whether your application cleanly
shuts down the SparkContext via SparkContext.stop()? It's possible that
omitting the stop() could lead to a Killed exit status, but I don't know
for sure. (This could probably also be clarified with a unit test).

To my knowledge, the spark-perf suite does not contain the sort of
scale-testing workload that would expose these types of memory leaks; we
have some tests for very long-lived individual applications, but not tests
for long-lived clusters that run thousands of applications between
restarts. I'm going to create some tickets to add such tests.

I've filed https://issues.apache.org/jira/browse/SPARK-9202 to follow up
on the finishedExecutors leak.

- Josh

On Mon, Jul 20, 2015 at 9:56 AM, Richard Marscher
rmarsc...@localytics.com wrote:

Hi,

we have been experiencing issues in production over the past couple
weeks with Spark Standalone Worker JVMs seeming to have memory leaks. They
accumulate Old Gen until it reaches max and then reach a failed state that
starts critically failing some applications running against the cluster.

I've done some exploration of the Spark code base related to Worker in
search of potential sources of this problem and am looking for some
commentary on a couple theories I have:

Observation 1: The `finishedExecutors` HashMap seem to only accumulate
new entries over time unbounded. It only seems to be appended and never
periodically purged or cleaned of older executors in line with something
like the worker cleanup scheduler.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473

I feel somewhat confident that over time this will exhibit a leak. I
quote it just because it may be intentional to hold these references to
support functionality versus a true leak where you just accidentally hold
onto memory.

Observation 2: I feel much less certain about this, but it seemed like
if the Worker is messaged with `KillExecutor` then it only kills the `
ExecutorRunner` but does not clean it up from the executor map.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L492

I haven't been able to sort out whether I'm missing something indirect
where it before/after cleans that executor from the map. However, if it
does not, then it may be leaking references on this map.

One final observation related to our

Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-26 Thread Josh Rosen

Given that 2.11 may be more stringent with respect to warnings, we might
consider building with 2.11 instead of 2.10 in the pull request builder.
This would also have some secondary benefits in terms of letting us use
tools like Scapegoat or SCoverage highlighting.

On Sat, Jul 25, 2015 at 8:52 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:

On Fri, Jul 24, 2015 at 8:19 PM, Reynold Xin r...@databricks.com wrote:

Jenkins only run Scala 2.10. I'm actually not sure what the behavior is
with 2.11 for that patch.

iulian - can you take a look into it and see if it is working as expected?

It is, in the sense that warnings fail the build. Unfortunately there are
warnings in 2.11 that were not there in 2.10, and that fail the build. For
instance:

[error]
/Users/dragos/workspace/git/spark/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala:31:
no valid targets for annotation on value conf - it is discarded unused. You
may specify targets with meta-annotations, e.g. @(transient @param)
[error] @transient conf: Configuration,
[error]

Currently the 2.11 build is broken. I don’t think fixing these is too
hard, but it requires these parameters to become vals. I haven’t looked
at all warnings, but I think this is the most common one (if not the only
one).

iulian

On Fri, Jul 24, 2015 at 10:24 AM, Iulian Dragoș
iulian.dra...@typesafe.com wrote:

On Thu, Jul 23, 2015 at 6:08 AM, Reynold Xin r...@databricks.com
wrote:

Hi all,

FYI, we just merged a patch that fails a build if there is a scala
compiler warning (if it is not deprecation warning).

I’m a bit confused, since I see quite a lot of warnings in
semi-legitimate code.

For instance, @transient (plenty of instances like this in
spark-streaming) might generate warnings like:

abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ :
StreamingContext)
extends InputDStream[T](ssc_) {

// and the warning is:
no valid targets for annotation on value ssc_ - it is discarded unused. You
may specify targets with meta-annotations, e.g. @(transient @param)

At least that’s what happens if I build with Scala 2.11, not sure if
this setting is only for 2.10, or something really weird is happening on my
machine that doesn’t happen on others.

iulian

In the past, many compiler warnings are actually caused by legitimate
bugs that we need to address. However, if we don't fail the build with
warnings, people don't pay attention at all to the warnings (it is also
tough to pay attention since there are a lot of deprecated warnings due to
unit tests testing deprecated APIs and reliance on Hadoop on deprecated
APIs).

Note that ideally we should be able to mark deprecation warnings as
errors as well. However, due to the lack of ability to suppress individual
warning messages in the Scala compiler, we cannot do that (since we do need
to access deprecated APIs in Hadoop).

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

Re: Worker memory leaks?

2015-07-20 Thread Josh Rosen

Hi Richard,

I've filed https://issues.apache.org/jira/browse/SPARK-9202 to follow up on
the finishedExecutors leak.

- Josh

On Mon, Jul 20, 2015 at 9:56 AM, Richard Marscher rmarsc...@localytics.com
wrote:

Hi,

we have been experiencing issues in production over the past couple weeks
with Spark Standalone Worker JVMs seeming to have memory leaks. They
accumulate Old Gen until it reaches max and then reach a failed state that
starts critically failing some applications running against the cluster.

I've done some exploration of the Spark code base related to Worker in
search of potential sources of this problem and am looking for some
commentary on a couple theories I have:

Observation 2: I feel much less certain about this, but it seemed like if
the Worker is messaged with `KillExecutor` then it only kills the `
ExecutorRunner` but does not clean it up from the executor map.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L492

I haven't been able to sort out whether I'm missing something indirect
where it before/after cleans that executor from the map. However, if it
does not, then it may be leaking references on this map.

One final observation related to our production metrics and not the
codebase itself. We used to periodically see that our completed
applications had the status of Killed instead of Exited for all the
executors. However, now we see every completed application has a final
state of Killed for all the executors. I might speculatively correlate
this to Observation 2 as a potential reason we have started seeing this
issue more recently.

We also have a larger and increasing workload over the past few weeks and
possibly code changes to the application description that could be
exacerbating these potential underlying issues. We run a lot of smaller
applications per day, something in the range of hundreds to maybe 1000
applications per day with 16 executors per application.

Thanks
--
*Richard Marscher*
Software Engineer
Localytics
Localytics.com http://localytics.com/ | Our Blog
http://localytics.com/blog | Twitter http://twitter.com/localytics |
Facebook http://facebook.com/localytics | LinkedIn
http://www.linkedin.com/company/1148792?trk=tyah

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Josh Rosen

Yep, I emailed TD about it; I think that we may need to make a change to
the pull request builder to fix this.  Pending that, we could just revert
the commit that added this.

On Sun, Jul 19, 2015 at 5:32 PM, Ted Yu yuzhih...@gmail.com wrote:

 Hi,
 I noticed that KinesisStreamSuite fails for both hadoop profiles in master
 Jenkins builds.

 From
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/3011/console
 :

 KinesisStreamSuite:*** RUN ABORTED ***  java.lang.AssertionError: assertion 
 failed: Kinesis test not enabled, should not attempt to get AWS credentials  
 at scala.Predef$.assert(Predef.scala:179)  at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:189)
   at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.org$apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient$lzycompute(KinesisTestUtils.scala:59)
   at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.org$apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient(KinesisTestUtils.scala:58)
   at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:121)
   at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:157)
   at 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:78)
   at 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:45)
   at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)  
 at 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:33)


 FYI

Re: KryoSerializer gives class cast exception

2015-07-17 Thread Josh Rosen

We've run into other problems caused by our old Kryo versions. I agree that
the Chill dependency is one of the main blockers to upgrading Kryo, but I
don't think that it's insurmountable: if necessary, we could just publish
our own forked version of Chill under our own namespace, similar to what we
used to do with Pyrolite.

A bigger concern, perhaps, is dependency conflicts with user-specified Kryo
versions.

See https://github.com/apache/spark/pull/6361 and
https://issues.apache.org/jira/browse/SPARK-7708 for some more previous
discussions RE: Kryo upgrade.

Anyhow, I'm not sure what the right solution is yet, but just wanted to
link to some previous context / discussions.

- Josh

On Thu, Jul 16, 2015 at 7:57 AM, Eugene Morozov fathers...@list.ru wrote:

Hi, some time ago we’ve found that it’s better use Kryo serializer instead
of Java one.
So, we turned it on and use it everywhere.

I have pretty complex objects, which I can’t change. Previously my algo
was building such an objects and then storing them into external storage.
It was not required to reshuffle partitions. Now, it seems I have to
reshuffle them, but I’m stuck with ClassCastException. I investigated it a
little and it seems to me that KryoSerializer does not clear it’s state at
some point, so it tries to use StringSerializer for my non String object.
My objects are pretty complex, it’d be pretty hard to make them
serializable.

Caused by: java.lang.ClassCastException:
com.company.metadata.model.cleanse.CleanseInfoSequence cannot be cast to
java.lang.String
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:68)
at
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
... 71 more

I’ve found this state issue in Kryo jira and that it’s been fixed after
2.21 (current kryo version in spark). But spark cannot update, because of
chill and chill cannot be updated because of some dependencies on their
side. So, spark sort of stuck with kryo version 2.21.

My own thoughts how I could workaround this
1. Rewrite algo, so that my objects shouldn’t be reshuffled. But at some
point it’be required.
2. Make my objects implement Serializable and be stuck with java
serialization forever.
3. My object inside of kryo looks like ArrayList with my object, so I’m
not sure it’s possible to register my class with custom serializer in kryo.

Any advice would be highly appreciated.
Thanks.
--
Eugene Morozov
fathers...@list.ru

Re: why doesn't jenkins like me?

2015-07-17 Thread Josh Rosen

The It is not a test failed test message means that something went wrong
in a suite-wide setup or teardown method.  This could be some sort of race
or flakiness.  If this problem persists, we should file a JIRA and label it
with flaky-test so that we can find it later.

On Thu, Jul 16, 2015 at 5:44 AM, Steve Loughran ste...@hortonworks.com
wrote:


  One of my pull requests is failing in a test that I have gone nowhere
 near


 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37491/testReport/junit/org.apache.spark/DistributedSuite/_It_is_not_a_test_/

  This isn't the only pull request that's failing, and I've merged in the
 master branch to make sure its not something fixed in the source.

  is there some intermittent race condition or similar?

  -steve

Re: problems with build of latest the master

2015-07-15 Thread Josh Rosen

We may be able to fix this from the Spark side by adding appropriate
exclusions in our Hadoop dependencies, right?  If possible, I think that we
should do this.

On Wed, Jul 15, 2015 at 7:10 AM, Ted Yu yuzhih...@gmail.com wrote:

 I attached a patch for HADOOP-12235

 BTW openstack was not mentioned in the first email from Gil.
 My email and Gil's second email were sent around the same moment.

 Cheers

 On Wed, Jul 15, 2015 at 2:06 AM, Steve Loughran ste...@hortonworks.com
 wrote:


  On 14 Jul 2015, at 12:22, Ted Yu yuzhih...@gmail.com wrote:

  Looking at Jenkins, master branch compiles.

  Can you try the following command ?

 mvn -Phive -Phadoop-2.6 -DskipTests clean package

  What version of Java are you using ?


  Ted, Giles has stuck in hadoop-openstack, it's that which is creating
 the problem

  Giles, I don't know why hadoop-openstack has a mockito dependency as
  it should be test time only

  Looking at the POM it's tag

  in hadoop-2.7 tis scoped to compile, which
  dependency
   groupIdorg.mockito/groupId
   artifactIdmockito-all/artifactId
   scopecompile/scope
 /dependency

  it should be provided, shouldn't it?

  Created https://issues.apache.org/jira/browse/HADOOP-12235 : if someone
 supplies a patch I'll get it in.

  -steve

Re: Joining Apache Spark

2015-07-13 Thread Josh Rosen

Also, check out
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Mon, Jul 13, 2015 at 4:08 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hello, welcome, and please start by going through the web site (
 http://spark.apache.org/), especially the Contributors section at the
 bottom.

 On Mon, Jul 13, 2015 at 3:58 PM, Animesh Tripathy a.tripathy...@gmail.com
  wrote:

 I would like to join the Apache Spark Development Team in order to
 contribute code for further improvement of Apache Spark. I was referred
 here from EECS professor Anthony after the completion of Big Data with
 Apache Spark.

 Sincerely,
 Animesh Tripathy




 --
 Marcelo

Re: Spark master broken?

2015-07-12 Thread Josh Rosen

I think it is just broken for 2.11 since pull requests are building properly.

Sent from my phone

 On Jul 12, 2015, at 8:22 AM, René Treffer rtref...@gmail.com wrote:
 
 Java 8, make-distribution
 
 Jenkins does show the same error, though: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Snapshots/325/console
 
 On Sun, Jul 12, 2015 at 4:32 PM, Ted Yu yuzhih...@gmail.com wrote:
 Jenkins shows green builds.
 
 What Java version did you use ?
 
 Cheers
 
 On Sun, Jul 12, 2015 at 3:49 AM, René Treffer rtref...@gmail.com wrote:
 Hi *,
 
 I'm currently trying to build master but it fails with
 
 [error] Picked up JAVA_TOOL_OPTIONS: 
 -javaagent:/usr/share/java/jayatanaag.jar 
 [error] 
 /home/rtreffer/work/spark-master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135:
  
 error: anonymous 
 org.apache.spark.sql.execution.UnsafeExternalRowSorter$1 is not abstract 
 and does not override abstract method B
 minBy(Function1InternalRow,B,OrderingB) in TraversableOnce 
 [error]   return new AbstractScalaRowIterator() { 
 [error] ^ 
 [error]   where B,A are type-variables: 
 [error] B extends Object declared in method 
 BminBy(Function1A,B,OrderingB) 
 [error] A extends Object declared in interface TraversableOnce 
 [error] 1 error 
 [error] Compile failed at Jul 12, 2015 12:17:56 PM [20.565s] 
 [INFO] 
  
 [INFO] Reactor Summary: 
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS 
 [6.094s] 
 [INFO] Spark Project Core  SUCCESS 
 [2:52.035s] 
 [INFO] Spark Project Bagel ... SUCCESS 
 [22.506s] 
 [INFO] Spark Project GraphX .. SUCCESS 
 [19.076s] 
 [INFO] Spark Project ML Library .. SUCCESS 
 [1:15.520s] 
 [INFO] Spark Project Tools ... SUCCESS 
 [2.041s] 
 [INFO] Spark Project Networking .. SUCCESS 
 [8.741s] 
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS 
 [7.298s] 
 [INFO] Spark Project Streaming ... SUCCESS 
 [29.154s] 
 [INFO] Spark Project Catalyst  FAILURE 
 [21.048s]
 
  I've tried to build for 2.11 and 2.10 without success. Is there a known 
 issue on master?
 
 Regards,
   Rene Treffer

Re: The latest master branch didn't compile with -Phive?

2015-07-09 Thread Josh Rosen

Jenkins runs compile-only builds for Maven as an early warning system for
this type of issue; you can see from
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/ that the
Maven compilation is now broken in master.

On Thu, Jul 9, 2015 at 8:48 AM, Ted Yu yuzhih...@gmail.com wrote:

 I guess the compilation issue didn't surface in QA run because sbt was
 used:

 [info] Building Spark (w/Hive 0.13.1) using SBT with these arguments:  -Pyarn 
 -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive 
 package assembly/assembly streaming-kafka-assembly/assembly 
 streaming-flume-assembly/assembly


 Cheers


 On Thu, Jul 9, 2015 at 7:58 AM, Ted Yu yuzhih...@gmail.com wrote:

 From
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/2875/consoleFull
 :

 + build/mvn -DzincPort=3439 -DskipTests -Phadoop-2.4 -Pyarn -Phive 
 -Phive-thriftserver -Pkinesis-asl clean package


 FYI


 On Thu, Jul 9, 2015 at 7:51 AM, Sean Owen so...@cloudera.com wrote:

 This is an error from scalac and not Spark. I find it happens
 frequently for me but goes away on a clean build. *shrug*


 On Thu, Jul 9, 2015 at 3:45 PM, Ted Yu yuzhih...@gmail.com wrote:
  Looking at
 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/2875/consoleFull
  :
 
  [error]
  [error]  while compiling:
 
 /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
  [error] during phase: typer
  [error]  library version: version 2.10.4
  [error] compiler version: version 2.10.4
 
 
  I traced back to build #2869 and the error was there - didn't go back
  further.
 
 
  FYI
 
 
  On Thu, Jul 9, 2015 at 7:24 AM, Yijie Shen henry.yijies...@gmail.com
  wrote:
 
  Hi,
 
  I use the clean version just clone from the master branch, build with:
 
  build/mvn -Phive -Phadoop-2.4 -DskipTests package
 
  And BUILD FAILURE at last, due to:
 
  [error]  while compiling:
 
 /Users/yijie/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
  [error] during phase: typer
  [error]  library version: version 2.10.4
  [error] compiler version: version 2.10.4
  ...
  [error]
  [error]   last tree to typer: Ident(Warehouse)
  [error]   symbol: none (flags: )
  [error]symbol definition: none
  [error]symbol owners:
  [error]   context owners: lazy value hiveWarehouse - class
  HiveMetastoreCatalog - package hive
  [error]
  [error] == Enclosing template or block ==
  [error]
  [error] Template( // val local HiveMetastoreCatalog: notype in
 class
  HiveMetastoreCatalog
  [error]   Catalog, Logging // parents
  [error]   ValDef(
  [error] private
  [error] _
  [error] tpt
  [error] empty
  [error]   )
  [error]   // 24 statements
  [error]   ValDef( // private[this] val client:
  org.apache.spark.sql.hive.client.ClientInterface in class
  HiveMetastoreCatalog
  [error] private local paramaccessor
  [error] client
  [error] ClientInterface
  [error] empty
  …
 
 
 https://gist.github.com/yijieshen/e0925e2227a312ae4c64#file-build_failure
 
  Did I make a silly mistake?
 
  Thanks, Yijie

Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-08 Thread Josh Rosen

I've filed https://issues.apache.org/jira/browse/SPARK-8903 to fix the
DataFrameStatSuite test failure. The problem turned out to be caused by a
mistake made while resolving a merge-conflict when backporting that patch
to branch-1.4.

I've submitted https://github.com/apache/spark/pull/7295 to fix this issue.

On Wed, Jul 8, 2015 at 11:30 AM, Sean Owen so...@cloudera.com wrote:

 I see, but shouldn't this test not be run when Hive isn't in the build?

 On Wed, Jul 8, 2015 at 7:13 PM, Andrew Or and...@databricks.com wrote:
  @Sean You actually need to run HiveSparkSubmitSuite with `-Phive` and
  `-Phive-thriftserver`. The MissingRequirementsError is just complaining
 that
  it can't find the right classes. The other one (DataFrameStatSuite) is a
  little more concerning.
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.5.0-SNAPSHOT broken with Scala 2.11

2015-06-28 Thread Josh Rosen

The 2.11 compile build is going to be green because this is an issue with
tests, not compilation.

On Sun, Jun 28, 2015 at 6:30 PM, Ted Yu yuzhih...@gmail.com wrote:

 Spark-Master-Scala211-Compile build is green.

 However it is not clear what the actual command is:

 [EnvInject] - Variables injected successfully.
 [Spark-Master-Scala211-Compile] $ /bin/bash /tmp/hudson8945334776362889961.sh


 FYI


 On Sun, Jun 28, 2015 at 6:02 PM, Alessandro Baretta alexbare...@gmail.com
  wrote:

 I am building the current master branch with Scala 2.11 following these
 instructions:

 Building for Scala 2.11

 To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11
  property:

 dev/change-version-to-2.11.sh
 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package


 Here's what I'm seeing:

 log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.security.Groups).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
 more info.
 Using Spark's repl log4j profile:
 org/apache/spark/log4j-defaults-repl.properties
 To adjust logging level use sc.setLogLevel(INFO)
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/

 Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_79)
 Type in expressions to have them evaluated.
 Type :help for more information.
 15/06/29 00:42:20 ERROR ActorSystemImpl: Uncaught fatal error from thread
 [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
 ActorSystem [sparkDriver]
 java.lang.VerifyError: class akka.remote.WireFormats$AkkaControlMessage
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at
 akka.remote.transport.AkkaPduProtobufCodec$.constructControlMessagePdu(AkkaPduCodec.scala:231)
 at
 akka.remote.transport.AkkaPduProtobufCodec$.init(AkkaPduCodec.scala:153)
 at
 akka.remote.transport.AkkaPduProtobufCodec$.clinit(AkkaPduCodec.scala)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:733)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:703)

 What am I doing wrong?

Re: [SQL] codegen on wide dataset throws StackOverflow

2015-06-26 Thread Josh Rosen

Which Spark version are you using?  Can you file a JIRA for this issue?

On Thu, Jun 25, 2015 at 6:35 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi, i have a small but very wide dataset (2000 columns). Trying to
 optimize Dataframe pipeline for it, since it behaves very poorly comparing
 to rdd operation.
 With spark.sql.codegen=true it throws StackOverflow:

 15/06/25 16:27:16 INFO CacheManager: Partition rdd_12_3 not found, computing 
 it
 15/06/25 16:27:16 INFO HadoopRDD: Input split: 
 file:/home/peter/validation.csv:0+337768
 15/06/25 16:27:16 INFO CacheManager: Partition rdd_12_1 not found, computing 
 it
 15/06/25 16:27:16 INFO HadoopRDD: Input split: 
 file:/home/peter/work/train.csv:0+15540706
 15/06/25 16:27:16 INFO CacheManager: Partition rdd_12_0 not found, computing 
 it
 15/06/25 16:27:16 INFO HadoopRDD: Input split: 
 file:/home/peter/holdout.csv:0+336296
 15/06/25 16:27:16 INFO CacheManager: Partition rdd_12_2 not found, computing 
 it
 15/06/25 16:27:16 INFO HadoopRDD: Input split: 
 file:/home/peter/train.csv:15540706+14866642
 15/06/25 16:27:17 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
 org.spark-project.guava.util.concurrent.ExecutionError: 
 java.lang.StackOverflowError
   at 
 org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2261)
   at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
   at 
 org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
   at 
 org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
   at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105)
   at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102)
   at 
 org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170)
   at 
 org.apache.spark.sql.execution.Project.buildProjection$lzycompute(basicOperators.scala:38)
   at 
 org.apache.spark.sql.execution.Project.buildProjection(basicOperators.scala:38)
   at 
 org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:41)
   at 
 org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:40)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.StackOverflowError
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameInternal(Symbols.scala:1042)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameAsName(Symbols.scala:1047)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameInternal(Symbols.scala:1044)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameAsName(Symbols.scala:1047)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameInternal(Symbols.scala:1044)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameAsName(Symbols.scala:1047)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameInternal(Symbols.scala:1044)
   at 
 scala.reflect.internal.Symbols$Symbol.fullNameAsName(Symbols.scala:1047)
   at

Re: [jenkins] ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error?

2015-06-21 Thread Josh Rosen

This is a side effect of the new pull request tester script interacting badly
with a Jenkins plugin, not anything caused by your changes. I'm working on a
fix but in the meantime I'd just trust what SparkQA says.

Sent from my phone

On Jun 21, 2015, at 1:54 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote:

Hi all,

How do I deal with the error on the official Jenkins?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35412/console

```
Archiving unit tests logs...
Send successful.
Attempting to post to Github...
Post successful.
Archiving artifacts
WARN: No artifacts found that match the file pattern
**/target/unit-tests.log. Configuration error?
WARN: java.lang.InterruptedException: no matches found within 1
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report
files were found. Configuration error?
Finished: FAILURE
```

It seems that the unit testing related to the PR passed. However, the
Jenkins posted Merged build finished. Test FAILed. to github.
https://github.com/apache/spark/pull/6926

Thanks
Yu

-
-- Yu Ishikawa
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-ERROR-Publisher-Publish-JUnit-test-result-report-failed-No-test-report-files-were-found-Conf-tp12823.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Tungsten] NPE in UnsafeShuffleWriter.java

2015-06-20 Thread Josh Rosen

I've filed https://issues.apache.org/jira/browse/SPARK-8498 to fix this
error-handling code.

On Fri, Jun 19, 2015 at 11:51 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Hey Peter,

 I think that this is actually due to an error-handling issue: if you look
 at the stack trace that you posted, the NPE is being thrown from an
 error-handling branch of a `finally` block:

 @Override public void write(scala.collection.IteratorProduct2K, V
 records) throws IOException { boolean success = false; try { while
 (records.hasNext()) { insertRecordIntoSorter(records.next()); }
 closeAndWriteOutput(); success = true; } finally { if (!success) { 
 sorter.cleanupAfterError();
 //  this is the line throwing the error } } }

 I suspect that what's happening is that an exception is being thrown from
 user / upstream code in the initial call to records.next(), but the
 error-handling block is failing because sorter == null since we haven't
 initialized it yet.

 I'm going to file a JIRA for this and will try to add a set of regression
 tests to the ShuffleSuite to make sure exceptions from user code aren't
 swallowed like this.

 On Fri, Jun 19, 2015 at 11:36 AM, Peter Rudenko petro.rude...@gmail.com
 wrote:

  Hi want to try new tungsten-sort shuffle manager, but on 1 stage
 executors start to die with NPE:

 15/06/19 17:53:35 WARN TaskSetManager: Lost task 38.0 in stage 41.0 (TID
 3176, ip-10-50-225-214.ec2.internal): java.lang.NullPointerException
 at
 org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:151)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Any suggestions?

 Thanks,
 Peter Rudenko

Re: [Tungsten] NPE in UnsafeShuffleWriter.java

2015-06-19 Thread Josh Rosen

Hey Peter,

I think that this is actually due to an error-handling issue: if you look
at the stack trace that you posted, the NPE is being thrown from an
error-handling branch of a `finally` block:

@Override public void write(scala.collection.IteratorProduct2K, V
records) throws IOException { boolean success = false; try { while
(records.hasNext())
{ insertRecordIntoSorter(records.next()); } closeAndWriteOutput(); success =
true; } finally { if (!success) { sorter.cleanupAfterError(); //  this
is the line throwing the error } } }

I suspect that what's happening is that an exception is being thrown from
user / upstream code in the initial call to records.next(), but the
error-handling block is failing because sorter == null since we haven't
initialized it yet.

I'm going to file a JIRA for this and will try to add a set of regression
tests to the ShuffleSuite to make sure exceptions from user code aren't
swallowed like this.

On Fri, Jun 19, 2015 at 11:36 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi want to try new tungsten-sort shuffle manager, but on 1 stage
 executors start to die with NPE:

 15/06/19 17:53:35 WARN TaskSetManager: Lost task 38.0 in stage 41.0 (TID
 3176, ip-10-50-225-214.ec2.internal): java.lang.NullPointerException
 at
 org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:151)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Any suggestions?

 Thanks,
 Peter Rudenko

Re: Sidebar: issues targeted for 1.4.0

2015-06-16 Thread Josh Rosen

Whatever you do, DO NOT use the built-in JIRA 'releases' feature to migrate
issues from 1.4.0 to another version: the JIRA feature will have the
side-effect of automatically changing the target versions for issues that
have been closed, which is going to be really confusing. I've made this
mistake once myself and it was a bit of a hassle to clean up.

On Tue, Jun 16, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:

 Question: what would happen if I cleared Target Version for everything
 still marked Target Version = 1.4.0? There are 76 right now, and
 clearly that's not correct.

 56 were opened by committers, including issues like Do X for 1.4.
 I'd like to understand whether these are resolved but just weren't
 closed, or else why so many issues are being filed as a todo and not
 resolved? Slipping things here or there is OK, but these weren't even
 slipped, just forgotten.

 On Sat, May 30, 2015 at 3:55 PM, Sean Owen so...@cloudera.com wrote:
  In an ideal world,  Target Version really is what's going to go in as
  far as anyone knows and when new stuff comes up, we all have to figure
  out what gets dropped to fit by the release date. Boring, standard
  software project management practice. I don't know how realistic that
  is, but, I'm wondering how people feel about this, who have filed
  these JIRAs?
 
  Concretely, should non-Critical issues for 1.4.0 be un-Targeted?
  should they all be un-Targeted after the release?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: PySpark on PyPi

2015-06-05 Thread Josh Rosen

This has been proposed before:
https://issues.apache.org/jira/browse/SPARK-1267

There's currently tighter coupling between the Python and Java halves of
PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
we'd run into tons of issues when users try to run a newer version of the
Python half of PySpark against an older set of Java components or
vice-versa.

On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Considering the python API as just a front needing the SPARK_HOME defined
 anyway, I think it would be interesting to deploy the Python part of Spark
 on PyPi in order to handle the dependencies in a Python project needing
 PySpark via pip.

 For now I just symlink the python/pyspark in my python install dir
 site-packages/ in order for PyCharm or other lint tools to work properly.
 I can do the setup.py work or anything.

 What do you think ?

 Regards,

 Olivier.

Re: Possible space improvements to shuffle

2015-06-02 Thread Josh Rosen

The relevant JIRA that springs to mind is
https://issues.apache.org/jira/browse/SPARK-2926

If an aggregator and ordering are both defined, then the map side of
sort-based shuffle will sort based on the key ordering so that map-side
spills can be efficiently merged.  We do not currently do a sort-based
merge on the reduce side; implementing this is a little tricky because it
will require more map partitions' output to be buffered on the reduce
side.  I think that SPARK-2926 has some proposals of how to deal with this,
including hierarchical merging of reduce outputs.

RE: ExternalSorter#partitionedIterator, I don't think it's safe to do
!ordering.isDefined
 !aggregator.isDefined.  If an aggregator is defined but we don't have an
ordering, then I don't think it makes sense to sort the keys based on their
hashcodes or some default ordering, since hashcode collisions would lead to
incorrect results for sort-based aggregation.

On Tue, Jun 2, 2015 at 1:50 PM, John Carrino john.carr...@gmail.com wrote:

 One thing I have noticed with ExternalSorter is that if an ordering is not
 defined, it does the sort using only the partition_id, instead of
 (parition_id, hash).  This means that on the reduce side you need to pull
 the entire dataset into memory before you can begin iterating over the
 results.

 I figure since we are doing a sort of the data anyway it doesn't seem more
 expensive to sort by (parition, hash).  That way the reducer can do a merge
 and only has the hold in memory the data for a single int hashCode before
 it can combine then and start returning results form the iterator.

 Has this already been discussed?  If so, can someone point me in the right
 direction to find out more?

 Thanks for any help!
 -jc

 p.s. I am using spark version 1.3.1.  The code I am looking at below is
 from ExternalSorter#partitionedIterator.  I think maybe
 !ordering.isDefined should also include  !aggregator.isDefined

if (spills.isEmpty  partitionWriters == null) {
   // Special case: if we have only in-memory data, we don't need to
 merge streams, and perhaps
   // we don't even need to sort by anything other than partition ID
   if (!ordering.isDefined) {
 // The user hasn't requested sorted keys, so only sort by
 partition ID, not key

 groupByPartition(collection.destructiveSortedIterator(partitionComparator))
   } else {
 // We do need to sort by both partition ID and key

 groupByPartition(collection.destructiveSortedIterator(partitionKeyComparator))
   }

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-29 Thread Josh Rosen

Hey, want to file a JIRA for this?  This will make it easier to track
progress on this issue.  Definitely upload the profiler screenshots there,
too, since that's helpful information.

https://issues.apache.org/jira/browse/SPARK



On Wed, May 27, 2015 at 11:12 AM, Nitin Goyal nitin2go...@gmail.com wrote:

 Hi Ted,

 Thanks a lot for replying. First of all, moving to 1.4.0 RC2 is not easy
 for
 us as migration cost is big since lot has changed in Spark SQL since 1.2.

 Regarding SPARK-7233, I had already looked at it few hours back and it
 solves the problem for concurrent queries but my problem is just for a
 single query. I also looked at the fix's code diff and it wasn't related to
 the problem which seems to exist in Closure Cleaner code.

 Thanks
 -Nitin



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466p12468.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Kryo option changed

2015-05-23 Thread Josh Rosen

Which commit of master are you building off?  It looks like there was a
bugfix for an issue related to KryoSerializer buffer configuration:
https://github.com/apache/spark/pull/5934

That patch was committed two weeks ago, but you mentioned that you're
building off a newer version of master.  Could you confirm the commit that
you're running?  If this used to work but now throws an error, then this is
a regression that should be fixed; we shouldn't require you to perform a mb
- kb conversion to work around this.

On Sat, May 23, 2015 at 6:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 Pardon me.

 Please use '8192k'

 Cheers

 On Sat, May 23, 2015 at 6:24 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Tried 8mb...still I am failing on the same error...

 On Sat, May 23, 2015 at 6:10 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. it shuld be 8mb

 Please use the above syntax.

 Cheers

 On Sat, May 23, 2015 at 6:04 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 Hi,

 I am on last week's master but all the examples that set up the
 following

 .set(spark.kryoserializer.buffer, 8m)

 are failing with the following error:

 Exception in thread main java.lang.IllegalArgumentException:
 spark.kryoserializer.buffer must be less than 2048 mb, got: + 8192 mb.
 looks like buffer.mb is deprecated...Is 8m is not the right syntax to
 get 8mb kryo buffer or it shuld be 8mb

 Thanks.
 Deb

Re: Testing spark applications

2015-05-22 Thread Josh Rosen

I think that @holdenk's *spark-testing-base* project publishes some of
these test classes as well as some helper classes for testing streaming
jobs: https://github.com/holdenk/spark-testing-base

On Thu, May 21, 2015 at 10:39 PM, Reynold Xin r...@databricks.com wrote:

 It is just 15 lines of code to copy, isn't it?

 On Thu, May 21, 2015 at 7:46 PM, Nathan Kronenfeld 
 nkronenfeld@uncharted.software wrote:

 see discussions about Spark not really liking multiple contexts in the
 same JVM


 Speaking of this - is there a standard way of writing unit tests that
 require a SparkContext?

 We've ended up copying out the code of SharedSparkContext to our own
 testing hierarchy, but it occurs to me someone would have published a test
 jar by now if that was the best way.

   -Nathan

Re: [build system] scheduled datacenter downtime, sunday may 17th

2015-05-17 Thread Josh Rosen

Reminder: the network migration has started this morning, so Jenkins is
currently down.

Status updates on the migration are being published at
http://ucbsystems.org/

On Wed, May 13, 2015 at 5:12 PM, shane knapp skn...@berkeley.edu wrote:

 our datacenter is rejiggering our network (read: fully re-engineering large
 portions from the ground up) and has downtime scheduled from 9am-3pm PDT,
 this sunday may17th.

 this means our jenkins instance will not be available to the outside world,
 and i will be putting jenkins in to quiet mode the night before.  this will
 allow any running builds to finish, and to save me from getting up @ 6am on
 my day off.  :)

 once things are back up and running (~3pm or earlier), i will purge the
 build queue and bring jenkins out of quiet mode.

 of course, stay tuned to this bat-channel for future, and potentially
 riveting updates!

Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Josh Rosen

Spark PRs didn't always used to handle the JIRA linking.  We used to rely
on a Jenkins job that ran
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py.  We
switched this over to Spark PRs at a time when the Jenkins GitHub Pull
Request Builder plugin was having flakiness issues, but as far as I know
that old script should still work.

On Wed, May 13, 2015 at 9:40 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 There's no magic to it. We're doing the same, except Josh automated it in
 the PR dashboard he created.

 https://spark-prs.appspot.com/

 Nick

 On Wed, May 13, 2015 at 6:20 PM Markus Weimer mar...@weimo.de wrote:

  Hi,
 
  how did you set this up? Over in the REEF incubation project, we
  painstakingly create the forwards- and backwards links despite having
  the IDs in the PR descriptions...
 
  Thanks!
 
  Markus
 
 
  On 2015-05-13 11:56, Ted Yu wrote:
   Subproject tag should follow SPARK JIRA number.
   e.g.
  
   [SPARK-5277][SQL] ...
  
   Cheers
  
   On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com
  wrote:
  
   following up from Nicholas, it is
  
   [SPARK-12345] Your PR description
  
   where 12345 is the jira number.
  
  
   One thing I tend to forget is when/where to include the subproject tag
  e.g.
[MLLIB]
  
  
   2015-05-13 11:11 GMT-07:00 Nicholas Chammas 
 nicholas.cham...@gmail.com
  :
  
   That happens automatically when you open a PR with the JIRA key in
 the
  PR
   title.
  
   On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar 
   shekhar.kote...@gmail.com wrote:
  
   Hi,
  
   I am new to open source contribution and trying to understand the
   process
   starting from pulling code to uploading patch.
  
   I have managed to pull code from GitHub. In JIRA I saw that each
 JIRA
   issue
   is connected with pull request. I would like to know how do people
   attach
   pull request details to JIRA issue?
  
   Thanks,
   Chandrash3khar Kotekar
   Mobile - +91 8600011455
  
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen

Do you have any more specific profiling data that you can share?  I'm
curious to know where AppendOnlyMap.changeValue is being called from.

On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 +dev
 On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

  Just wanted to check if somebody has seen similar behaviour or knows what
  we might be doing wrong. We have a relatively complex spark application
  which processes half a terabyte of data at various stages. We have
 profiled
  it in several ways and everything seems to point to one place where 90%
 of
  the time is spent:  AppendOnlyMap.changeValue. The job scales and is
  relatively faster than its map-reduce alternative but it still feels
 slower
  than it should be. I am suspecting too much spill but I haven't seen any
  improvement by increasing number of partitions to 10k. Any idea would be
  appreciated.
 
  --
  Michal Haris
  Technical Architect
  direct line: +44 (0) 207 749 0229
  www.visualdna.com | t: +44 (0) 207 734 7033,

Re: Github auth problems = some test results not posting

2015-04-05 Thread Josh Rosen

Thanks for catching this.  It looks like a recent Jenkins job configuration
change inadvertently renamed the GITHUB_OAUTH_KEY environment variable to
something else, causing this to break.  I've rolled back that change, so
hopefully the GitHub posting should start working again.

- Josh

On Sun, Apr 5, 2015 at 6:40 AM, Sean Owen so...@cloudera.com wrote:

 I noticed recent pull request build results weren't posting results of
 MiMa checks, etc.

 I think it's due to Github auth issues:

 Attempting to post to Github...
   http_code: 401.
   api_response: {
   message: Bad credentials,
   documentation_url: https://developer.github.com/v3;
 }

 I've heard another colleague say they're having trouble with
 credentials today. Anyone else?

 I don't know if it's transient or what, but for today, just be aware
 you'll have to look at the end of the Jenkins output to see if these
 other checks passed.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Josh Rosen

It looks like this was fixed in
https://issues.apache.org/jira/browse/SPARK-4743 /
https://github.com/apache/spark/pull/3605.  Can you see whether that patch
fixes this issue for you?



On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah mch...@palantir.com wrote:

 Hi everyone,

 I was using JavaPairRDD’s combineByKey() to compute all of my aggregations
 before, since I assumed that every aggregation required a key. However, I
 realized I could do my analysis using JavaRDD’s aggregate() instead and not
 use a key.

 I have set spark.serializer to use Kryo. As a result, JavaRDD’s
 combineByKey requires that a “createCombiner” function is provided, and the
 return value from that function must be serializable using Kryo. When I
 switched to using rdd.aggregate I assumed that the zero value would also be
 strictly Kryo serialized, as it is a data item and not part of a closure or
 the aggregation functions. However, I got a serialization exception as the
 closure serializer (only valid serializer is the Java serializer) was used
 instead.

 I was wondering the following:

1. What is the rationale for making the zero value be serialized using
the closure serializer? This isn’t part of the closure, but is an initial
data item.
2. Would it make sense for us to perhaps write a version of
rdd.aggregate() that takes a function as a parameter, that generates the
zero value? This would be more intuitive to be serialized using the closure
serializer.

 I believe aggregateByKey is also affected.

 Thanks,

 -Matt Cheah

Re: Unit tests

2015-02-09 Thread Josh Rosen

Hi Iulian,

I think the AkakUtilsSuite failure that you observed has been fixed in 
https://issues.apache.org/jira/browse/SPARK-5548 / 
https://github.com/apache/spark/pull/4343
On February 9, 2015 at 5:47:59 AM, Iulian Dragoș (iulian.dra...@typesafe.com) 
wrote:

Hi Patrick,  

Thanks for the heads up. I was trying to set up our own infrastructure for  
testing Spark (essentially, running `run-tests` every night) on EC2. I  
stumbled upon a number of flaky tests, but none of them look similar to  
anything in Jira with the flaky-test tag. I wonder if there's something  
wrong with our infrastructure, or I should simply open Jira tickets with  
the failures I find. For example, one that appears fairly often on our  
setup is in AkkaUtilsSuite remote fetch ssl on - untrusted server  
(exception `ActorNotFound`, instead of `TimeoutException`).  

thanks,  
iulian  


On Fri, Feb 6, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote:  

 Hey All,  
  
 The tests are in a not-amazing state right now due to a few compounding  
 factors:  
  
 1. We've merged a large volume of patches recently.  
 2. The load on jenkins has been relatively high, exposing races and  
 other behavior not seen at lower load.  
  
 For those not familiar, the main issue is flaky (non deterministic)  
 test failures. Right now I'm trying to prioritize keeping the  
 PullReqeustBuilder in good shape since it will block development if it  
 is down.  
  
 For other tests, let's try to keep filing JIRA's when we see issues  
 and use the flaky-test label (see http://bit.ly/1yRif9S):  
  
 I may contact people regarding specific tests. This is a very high  
 priority to get in good shape. This kind of thing is no one's fault  
 but just the result of a lot of concurrent development, and everyone  
 needs to pitch in to get back in a good place.  
  
 - Patrick  
  
 -  
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
 For additional commands, e-mail: dev-h...@spark.apache.org  
  
  


--  

--  
Iulian Dragos  

--  
Reactive Apps on the JVM  
www.typesafe.com

Re: Temporary jenkins issue

2015-02-08 Thread Josh Rosen

It looks like this may be fixed soon in Jenkins:

https://issues.jenkins-ci.org/browse/JENKINS-25446
https://github.com/jenkinsci/flaky-test-handler-plugin/pull/1

On February 2, 2015 at 7:38:19 PM, Patrick Wendell (pwend...@gmail.com) wrote:

Hey All, 

I made a change to the Jenkins configuration that caused most builds 
to fail (attempting to enable a new plugin), I've reverted the change 
effective about 10 minutes ago. 

If you've seen recent build failures like below, this was caused by 
that change. Sorry about that. 

 
ERROR: Publisher 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver 
aborted due to exception 
java.lang.NoSuchMethodError: 
hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction;
 
at 
com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.init(FlakyTestResultAction.java:78)
 
at 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89)
 
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
 
at hudson.model.Build$BuildExecution.post2(Build.java:183) 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683) 
at hudson.model.Run.execute(Run.java:1784) 
at hudson.matrix.MatrixRun.run(MatrixRun.java:146) 
at hudson.model.ResourceController.execute(ResourceController.java:89) 
at hudson.model.Executor.run(Executor.java:240) 
 

- Patrick 

- 
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Results of tests

2015-01-09 Thread Josh Rosen

The Test Result pages for Jenkins builds shows some nice statistics for
the test run, including individual test times:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/

Currently this only covers the Java / Scala tests, but we might be able to
integrate the PySpark tests here, too (I think it's just a matter of
getting the Python test runner to generate the correct test result XML
output).

On Fri, Jan 9, 2015 at 10:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 For a build which uses JUnit, we would see a summary such as the following
 (
 https://builds.apache.org/job/HBase-TRUNK/6007/console):

 Tests run: 2199, Failures: 0, Errors: 0, Skipped: 25


 In
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
 , I don't see such statistics.


 Looks like scalatest-maven-plugin can be enhanced :-)


 On Fri, Jan 9, 2015 at 3:52 AM, Sean Owen so...@cloudera.com wrote:

  Hey Tony, the number of tests run could vary depending on how the
  build is configured. For example, YARN-related tests would only run
  when the yarn profile is turned on. Java 8 tests would only run under
  Java 8.
 
  Although I don't know that there's any reason to believe the IBM JVM
  has a problem with Spark, I see this issue that is potentially related
  to endian-ness : https://issues.apache.org/jira/browse/SPARK-2018 I
  don't know if that was a Spark issue. Certainly, would be good for you
  to investigate if you are interested in resolving it.
 
  The Jenkins output shows you exactly what tests were run and how --
  have a look at the logs.
 
 
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
 
  On Fri, Jan 9, 2015 at 9:15 AM, Tony Reix tony.r...@bull.net wrote:
   Hi Ted
  
   Thanks for the info.
   However, I'm still unable to understand how the page:
  
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/
   has been built.
   This page contains details I do not find in the page you indicated to
 me:
  
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/consoleFull
  
   As an example, I'm still unable to find these details:
   org.apache.spark
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark/
 
   12 mn   0
   1
   247
   248
  
   org.apache.spark.api.python
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.api.python/
 
  20 ms   0
   0
   2
   2
  
   org.apache.spark.bagel
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.bagel/
 
   7.7 s   0
   0
   4
   4
  
   org.apache.spark.broadcast
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.broadcast/
 
   43 s0
   0
   17
   17
  
   org.apache.spark.deploy
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy/
 
  16 s0
   0
   29
   29
  
   org.apache.spark.deploy.worker
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.2-Maven-with-YARN/lastSuccessfulBuild/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/org.apache.spark.deploy.worker/
 
   0.55 s  0
   0
   12
   12
  
   
  
  
   Moreover, in my Ubuntu/x86_64 environment, I do not find 3745 tests and
  0 failures, but 3485 tests and 4 failures (when using Oracle JVM 1.7 ).
  When using IBM JVM, there are only 2566 tests and 5 failures (in same
  component: Streaming).
  
   On my PPC64BE (BE = Big-Endian)environment, the tests block after 2
  hundreds of tests.
   Is Spark independent of Little/Big-Endian stuff ?
  
   On my PPC64LE (LE = Little-Endian) environment, I have 3485 tests only
  (like on Ubuntu/x86_64 with IBM JVM), with 6 or 285 failures...
  
   So, I need to learn more about how your Jenkins environment extracts
  details about the results.
   Moreover, which JVM is used ?
  
   Do you plan to use IBM JVM in order to check that Spark and IBM JVM are
  compatible ? (they already do not look to be compatible 100% ...).
  
   Thanks
  
   Tony

Re: jenkins redirect down (but jenkins is up!), lots of potential

2015-01-05 Thread Josh Rosen

The pull request builder and SCM-polling builds appear to be working fine,
but the links in pull request comments won't work because the AMP Lab
webserver is still down.  In the meantime, though, you can continue to
access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/

On Mon, Jan 5, 2015 at 10:37 AM, shane knapp skn...@berkeley.edu wrote:

 UC Berkeley had some major maintenance done this past weekend, and long
 story short, not everything came back.  our primary webserver's NFS is down
 and that means we're not serving websites, meaning that the redirect to
 jenkins is failing.

 jenkins is still up, and building some jobs, but we will probably see pull
 request builder failures, and other transient issues.  SCM-polling builds
 should be fine.

 there is no ETA on when this will be fixed, but once our
 amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone know.
  i'm trying to get more status updates as they come.

 i'm really sorry about the inconvenience.

 shane

Re: Is there any way to tell if compute is being called from a retry?

2014-12-30 Thread Josh Rosen

This is timely, since I just ran into this issue myself while trying to
write a test to reproduce a bug related to speculative execution (I wanted
to configure a job so that the first attempt to compute a partition would
run slow so that a second, fast speculative copy would be launched).

I've opened a PR with a proposed fix:
https://github.com/apache/spark/pull/3849



On Tue, Dec 30, 2014 at 12:24 PM, Cody Koeninger c...@koeninger.org wrote:

 It looks like taskContext.attemptId doesn't mean what one thinks it might
 mean, based on


 http://apache-spark-developers-list.1001551.n3.nabble.com/Get-attempt-number-in-a-closure-td8853.html

 and the unresolved

 https://issues.apache.org/jira/browse/SPARK-4014



 Is there any alternative way to tell if compute is being called from a
 retry?  Barring that, does anyone have any tips on how it might be possible
 to get the attempt count propagated to executors?

 It would be extremely useful for the kafka rdd preferred location
 awareness.

Re: cleaning up cache files left by SPARK-2713

2014-12-24 Thread Josh Rosen

I reviewed and merged that PR, in case you want to try out the fix.

- Josh

On December 22, 2014 at 10:40:35 AM, Marcelo Vanzin (van...@cloudera.com) wrote:

https://github.com/apache/spark/pull/3705 

On Mon, Dec 22, 2014 at 10:19 AM, Cody Koeninger c...@koeninger.org wrote: 
 Is there a reason not to go ahead and move the _cache and _lock files 
 created by Utils.fetchFiles into the work directory, so they can be cleaned 
 up more easily? I saw comments to that effect in the discussion of the PR 
 for 2713, but it doesn't look like it got done. 
 
 And no, I didn't just have a machine fill up the /tmp directory, why do you 
 ask? :) 



-- 
Marcelo 

- 
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Confirming race condition in DagScheduler (NoSuchElementException)

2014-12-24 Thread Josh Rosen

I’m investigating this issue and left some comments on the proposed fix: 
https://github.com/apache/spark/pull/3345#issuecomment-68014353

To summarize, I agree with your description of the problem but think that the 
right fix may be a bit more involved than what’s proposed in that PR (that PR’s 
fix shouldn’t actually work, as far as I can tell).

- Josh

On December 19, 2014 at 10:57:41 AM, thlee (ti...@ooyala.com) wrote:

any comments?  



--  
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Confirming-race-condition-in-DagScheduler-NoSuchElementException-tp9798p9855.html
  
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.  

-  
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
For additional commands, e-mail: dev-h...@spark.apache.org

1 2 >

1 - 100 of 145 matches

Mail list logo