Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-03 Thread Ricardo Almeida
+1 (non-binding)

On 3 June 2018 at 09:23, Dongjoon Hyun  wrote:

> +1
>
> Bests,
> Dongjoon.
>
> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:
>
>> +1
>>
>> On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'll give that a try, but I'll still have to figure out what to do if
>>> none of the release builds work with hadoop-aws, since Flintrock deploys
>>> Spark release builds to set up a cluster. Building Spark is slow, so we
>>> only do it if the user specifically requests a Spark version by git hash.
>>> (This is basically how spark-ec2 did things, too.)
>>>
>>>
>>> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
>>> wrote:
>>>
 If you're building your own Spark, definitely try the hadoop-cloud
 profile. Then you don't even need to pull anything at runtime,
 everything is already packaged with Spark.

 On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
  wrote:
 > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work
 for me
 > either (even building with -Phadoop-2.7). I guess I’ve been relying
 on an
 > unsupported pattern and will need to figure something else out going
 forward
 > in order to use s3a://.
 >
 >
 > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
 wrote:
 >>
 >> I have personally never tried to include hadoop-aws that way. But at
 >> the very least, I'd try to use the same version of Hadoop as the
 Spark
 >> build (2.7.3 IIRC). I don't really expect a different version to
 work,
 >> and if it did in the past it definitely was not by design.
 >>
 >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
 >>  wrote:
 >> > Building with -Phadoop-2.7 didn’t help, and if I remember
 correctly,
 >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
 release,
 >> > so
 >> > it appears something has changed since then.
 >> >
 >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
 >> >
 >> > My goal here is simply to confirm that this release of Spark works
 with
 >> > hadoop-aws like past releases did, particularly for Flintrock
 users who
 >> > use
 >> > Spark with S3A.
 >> >
 >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
 builds
 >> > with
 >> > every Spark release. If the -hadoop2.7 release build won’t work
 with
 >> > hadoop-aws anymore, are there plans to provide a new build type
 that
 >> > will?
 >> >
 >> > Apologies if the question is poorly formed. I’m batting a bit
 outside my
 >> > league here. Again, my goal is simply to confirm that I/my users
 still
 >> > have
 >> > a way to use s3a://. In the past, that way was simply to call
 pyspark
 >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
 similar.
 >> > If
 >> > that will no longer work, I’m trying to confirm that the change of
 >> > behavior
 >> > is intentional or acceptable (as a review for the Spark project)
 and
 >> > figure
 >> > out what I need to change (as due diligence for Flintrock’s users).
 >> >
 >> > Nick
 >> >
 >> >
 >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin >>> >
 >> > wrote:
 >> >>
 >> >> Using the hadoop-aws package is probably going to be a little more
 >> >> complicated than that. The best bet is to use a custom build of
 Spark
 >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
 >> >> looking at some nasty dependency issues, especially if you end up
 >> >> mixing different versions of Hadoop.
 >> >>
 >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
 >> >>  wrote:
 >> >> > I was able to successfully launch a Spark cluster on EC2 at
 2.3.1 RC4
 >> >> > using
 >> >> > Flintrock. However, trying to load the hadoop-aws package gave
 me
 >> >> > some
 >> >> > errors.
 >> >> >
 >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
 >> >> >
 >> >> > 
 >> >> >
 >> >> > :: problems summary ::
 >> >> >  WARNINGS
 >> >> > [NOT FOUND  ]
 >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
 >> >> >  local-m2-cache: tried
 >> >> >
 >> >> >
 >> >> >
 >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-
 json/1.9/jersey-json-1.9.jar
 >> >> > [NOT FOUND  ]
 >> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)
 (0ms)
 >> >> >  local-m2-cache: tried
 >> >> >
 >> >> >
 >> >> >
 >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-
 server/1.9/jersey-server-1.9.jar
 >> >> > [NOT FOUND  ]
 >> >> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
 >> >> >  

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-24 Thread Ricardo Almeida
+1 (non-binding)

same as previous RC

On 24 February 2018 at 11:10, Hyukjin Kwon  wrote:

> +1
>
> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>
>> +1
>> Tests passed and additionally ran Arrow related tests and did some perf
>> checks with python 2.7.14
>>
>> On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
>> wrote:
>>
>>> Note: given the state of Jenkins I'd love to see Bryan Cutler or someone
>>> with Arrow experience sign off on this release.
>>>
>>> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
>>> wrote:
>>>
 +1 (binding)

 Passed all the tests, looks good.

 Cheng

 On 2/23/18 15:00, Holden Karau wrote:

 +1 (binding)
 PySpark artifacts install in a fresh Py3 virtual env

 On Feb 23, 2018 7:55 AM, "Denny Lee"  wrote:

> +1 (non-binding)
>
> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
> joshgoldsboroughs...@gmail.com> wrote:
>
>> New to testing out Spark RCs for the community but I was able to run
>> some of the basic unit tests without error so for what it's worth, I'm a 
>> +1.
>>
>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.0. The vote is open until Tuesday February 27, 2018 at 
>>> 8:00:00
>>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc5: https://github.com/apache/spar
>>> k/tree/v2.3.0-rc5 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapache
>>> spark-1266/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>>> /_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the time of writing, there
>>> are currently no known release blockers.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and
>>> install the current RC and see if anything important breaks, in the
>>> Java/Scala you can add the staging repository to your projects resolvers
>>> and test with the RC (make sure to clean up the artifact cache 
>>> before/after
>>> so you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.1 or 
>>> 2.4.0 as
>>> appropriate.
>>>
>>> ===
>>> Why is my bug not fixed?
>>> ===
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from 2.2.0. That 
>>> being
>>> said, if there is something which is a regression from 2.2.0 and has not
>>> been correctly targeted please ping me or a committer to help target the
>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>> https://s.apache.org/WmoI).
>>>
>>
>>

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-18 Thread Ricardo Almeida
+1 (non-binding)

Built and tested on macOS 10.12.6 Java 8 (build 1.8.0_111). No regressions
detected so far.


On 18 February 2018 at 16:12, Sean Owen  wrote:

> +1 from me as last time, same outcome.
>
> I saw one test fail, but passed on a second run, so just seems flaky.
>
> - subscribing topic by name from latest offsets (failOnDataLoss: true) ***
> FAILED ***
>   Error while stopping stream:
>   query.exception() is not empty after clean stop: org.apache.spark.sql.
> streaming.StreamingQueryException: Writing job failed.
>   === Streaming Query ===
>   Identifier: [id = cdd647ec-d7f0-437b-9950-ce9d79d691d1, runId =
> 3a7cf7ec-670a-48b6-8185-8b6cd7e27f96]
>   Current Committed Offsets: {KafkaSource[Subscribe[topic-4]]:
> {"topic-4":{"2":1,"4":1,"1":0,"3":0,"0":2}}}
>   Current Available Offsets: {}
>
>   Current State: TERMINATED
>   Thread State: RUNNABLE
>
> On Sat, Feb 17, 2018 at 3:41 PM Sameer Agarwal 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.0. The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc4: https://github.com/apache/
>> spark/tree/v2.3.0-rc4 (44095cb65500739695b0324c177c19dfa1471472)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1265/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-
>> docs/_site/index.html
>>
>>
>> FAQ
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of writing, there are
>> currently no known release blockers.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
>> appropriate.
>>
>> ===
>> Why is my bug not fixed?
>> ===
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said, if
>> there is something which is a regression from 2.2.0 and has not been
>> correctly targeted please ping me or a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.3.0 at
>> https://s.apache.org/WmoI).
>>
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-07 Thread Ricardo Almeida
+1 (non-binding)

Built and tested on

   - macOS 10.12.5 Java 8 (build 1.8.0_131)
   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)


On 3 October 2017 at 08:24, Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Saturday October 7th at 9:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc4
>  (2abaea9e40fce81
> cd4626498e0f5c28a70917499)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1252
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the 2.1.3
> fixed into 2.1.2 as appropriate.
>
> *What happened to RC3?*
>
> Some R+zinc interactions kept it from getting out the door.
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Ricardo Almeida
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
-Phive-thriftserver -Pscala-2.11 on

   - macOS 10.12.5 Java 8 (build 1.8.0_131)
   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)





On 1 Jul 2017 02:45, "Michael Armbrust"  wrote:

Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc6
 (a2c7b2133cfee7f
a9abfaa2bfbfb637155466783)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1245/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-07 Thread Ricardo Almeida
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
-Phive-thriftserver -Pscala-2.11 on

   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
   - macOS 10.12.5 Java 8 (build 1.8.0_131)


On 5 June 2017 at 21:14, Michael Armbrust  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc4
>  (377cfa8ac7ff7a8
> a6a6d273182e18ea7dc25ce7e)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1241/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Re: Expand the Spark SQL programming guide?

2016-12-20 Thread Ricardo Almeida
The examples look great indeed. Seems a good addition to the existing
documentation.
I understand the UDAF examples don't apply to Python but is there any
relevant reason to skip Python API altogether from this window functions
documentation?

On 20 December 2016 at 16:56, Jim Hughes  wrote:

> Hi Anton,
>
> Your example and documentation looks great!  I left some comments
> suggesting a few additions, but the PR in its current state is a great
> improvement!
>
> Thanks,
>
> Jim
>
>
> On 12/18/2016 09:09 AM, Anton Okolnychyi wrote:
>
> Any comments/suggestions are more than welcome.
>
> Thanks,
> Anton
>
> 2016-12-18 15:08 GMT+01:00 Anton Okolnychyi :
>
>> Here is the pull request: 
>> https://github.com/apache/spark/pull/16329
>>
>>
>>
>> 2016-12-16 20:54 GMT+01:00 Jim Hughes < jn...@ccri.com>:
>>
>>> I'd be happy to review a PR.  At the minute, I'm still learning Spark
>>> SQL, so writing documentation might be a bit of a stretch, but reviewing
>>> would be fine.
>>>
>>> Thanks!
>>>
>>>
>>> On 12/16/2016 08:39 AM, Thakrar, Jayesh wrote:
>>>
>>> Yes - that sounds good Anton, I can work on documenting the window
>>> functions.
>>>
>>>
>>>
>>> *From: *Anton Okolnychyi 
>>>  
>>> *Date: *Thursday, December 15, 2016 at 4:34 PM
>>> *To: *Conversant 
>>>  
>>> *Cc: *Michael Armbrust 
>>> , Jim Hughes 
>>> , "dev@spark.apache.org" 
>>>  
>>> *Subject: *Re: Expand the Spark SQL programming guide?
>>>
>>>
>>>
>>> I think it will make sense to show a sample implementation of
>>> UserDefinedAggregateFunction for DataFrames, and an example of the
>>> Aggregator API for typed Datasets.
>>>
>>>
>>>
>>> Jim, what if I submit a PR and you join the review process? I also do
>>> not mind to split this if you want, but it seems to be an overkill for this
>>> part.
>>>
>>>
>>>
>>> Jayesh, shall I skip the window functions part since you are going to
>>> work on that?
>>>
>>>
>>>
>>> 2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh <
>>> jthak...@conversantmedia.com>:
>>>
>>> I too am interested in expanding the documentation for Spark SQL.
>>>
>>> For my work I needed to get some info/examples/guidance on window
>>> functions and have been using
>>> 
>>> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>>> .
>>>
>>> How about divide and conquer?
>>>
>>>
>>>
>>>
>>>
>>> *From: *Michael Armbrust < 
>>> mich...@databricks.com>
>>> *Date: *Thursday, December 15, 2016 at 3:21 PM
>>> *To: *Jim Hughes < jn...@ccri.com>
>>> *Cc: *" dev@spark.apache.org" <
>>> dev@spark.apache.org>
>>> *Subject: *Re: Expand the Spark SQL programming guide?
>>>
>>>
>>>
>>> Pull requests would be welcome for any major missing features in the
>>> guide:
>>> 
>>> https://github.com/apache/spark/blob/master/docs/sql-
>>> programming-guide.md
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 11:48 AM, Jim Hughes < 
>>> jn...@ccri.com> wrote:
>>>
>>> Hi Anton,
>>>
>>> I'd like to see this as well.  I've been working on implementing
>>> geospatial user-defined types and functions.  Having examples of
>>> aggregations and window functions would be awesome!
>>>
>>> I did test out implementing a distributed convex hull as a
>>> UserDefinedAggregateFunction, and that seemed to work sensibly.
>>>
>>> Cheers,
>>>
>>> Jim
>>>
>>>
>>>
>>> On 12/15/2016 03:28 AM, Anton Okolnychyi wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am wondering whether it makes sense to expand the Spark SQL
>>> programming guide with examples of aggregations (including user-defined via
>>> the Aggregator API) and window functions.  For instance, there might be a
>>> separate subsection under "Getting Started" for each functionality.
>>>
>>>
>>>
>>> SPARK-16046 seems to be related but there is no activity for more than 4
>>> months.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Anton
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>


Re: SparkUI via proxy

2016-11-25 Thread Ricardo Almeida
Marco,

Depending on your configuration, maybe what you're looking for is:
localhost:4040

Check this two StackOverflow answers:
http://stackoverflow.com/questions/31460079/spark-ui-on-aws-emr
or similar questions. This is not a specific Spark issue.

Please check StackOverflow or post to User Mailing List, next time on this
type of question.


On 25 November 2016 at 17:19, marco rocchi <
rocchi.1407...@studenti.uniroma1.it> wrote:

> Thanks for the helping.
> I've created my ssh tunnel at port 4040, and setted browser firefox SOCKS
> to localhost:4040.
> Now When I run a job I can read from INFO message: "SparkUI activated at
> http://192.168.1.204:4040;. But if I open the browser and type local host
> or http://192.168.1.204:4040, webUI doesn't appear.
> Where I'm wrong?
> The question could be stupid, but I never worked with spark over a cluster
> :)
>
> Thanks
> Marco
>
> 2016-11-25 10:19 GMT+01:00 Ewan Leith :
>
>> This is more of a question for the spark user’s list, but if you look at
>> FoxyProxy and SSH tunnels it’ll get you going.
>>
>>
>>
>> These instructions from AWS for accessing EMR are a good start
>>
>>
>>
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer
>> Guide/emr-ssh-tunnel.html
>>
>>
>>
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer
>> Guide/emr-connect-master-node-proxy.html
>>
>>
>>
>> Ewan
>>
>>
>>
>> *From:* Georg Heiler [mailto:georg.kf.hei...@gmail.com]
>> *Sent:* 24 November 2016 16:41
>> *To:* marco rocchi ;
>> dev@spark.apache.org
>> *Subject:* Re: SparkUI via proxy
>>
>>
>>
>> Sehr Port forwarding will help you out.
>>
>> marco rocchi  schrieb am Do. 24.
>> Nov. 2016 um 16:33:
>>
>> Hi,
>>
>> I'm working with Apache Spark in order to develop my master thesis.I'm
>> new in spark and working with cluster. I searched through internet but I
>> didn't found a way to solve.
>>
>> My problem is the following one: from my pc I can access to a master node
>> of a cluster only via proxy.
>>
>> To connect to proxy and then to master node,I have to set up an ssh
>> tunnel, but from parctical point of view I have no idea of how in this way
>> I can interact with WebUI spark.
>>
>> Anyone can help me?
>>
>> Thanks in advance
>>
>>
>>
>> This email and any attachments to it may contain confidential information
>> and are intended solely for the addressee.
>>
>>
>> If you are not the intended recipient of this email or if you believe you
>> have received this email in error, please contact the sender and remove it
>> from your system.Do not use, copy or disclose the information contained in
>> this email or in any attachment.
>>
>> RealityMine Limited may monitor email traffic data including the content
>> of email for the purposes of security.
>>
>> RealityMine Limited is a company registered in England and Wales.
>> Registered number: 07920936 Registered office: Warren Bruce Court, Warren
>> Bruce Road, Trafford Park, Manchester M17 1LB
>>
>
>


Re: separate spark and hive

2016-11-16 Thread Ricardo Almeida
Great to know about the "spark.sql.catalogImplementation" configuration
property.
I can't find this anywhere but in Jacek Laskowski's "Mastering Apache Spark
2.0" Gitbook.

I guess we should document on Spark Configuration page

On 15 November 2016 at 11:49, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> You can start a spark without hive support by setting the spark.sql.
> catalogImplementation configuration to in-memory, for example:
>>
>> ./bin/spark-shell --master local[*] --conf spark.sql.
>> catalogImplementation=in-memory
>
>
> I would not change the default from Hive to Spark-only just yet.
>
> On Tue, Nov 15, 2016 at 9:38 AM, assaf.mendelson 
> wrote:
>
>> After looking at the code, I found that spark.sql.catalogImplementation
>> is set to “hive”. I would proposed that it should be set to “in-memory” by
>> default (or at least have this in the documentation, the configuration
>> documentation at http://spark.apache.org/docs/latest/configuration.html
>> has no mentioning of hive at all)
>>
>> Assaf.
>>
>>
>>
>> *From:* Mendelson, Assaf
>> *Sent:* Tuesday, November 15, 2016 10:11 AM
>> *To:* 'rxin [via Apache Spark Developers List]'
>> *Subject:* RE: separate spark and hive
>>
>>
>>
>> Spark shell (and pyspark) by default create the spark session with hive
>> support (also true when the session is created using getOrCreate, at least
>> in pyspark)
>>
>> At a minimum there should be a way to configure it using
>> spark-defaults.conf
>>
>> Assaf.
>>
>>
>>
>> *From:* rxin [via Apache Spark Developers List] [[hidden email]
>> ]
>> *Sent:* Tuesday, November 15, 2016 9:46 AM
>> *To:* Mendelson, Assaf
>> *Subject:* Re: separate spark and hive
>>
>>
>>
>> If you just start a SparkSession without calling enableHiveSupport it
>> actually won't use the Hive catalog support.
>>
>>
>>
>>
>>
>> On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]
>> > wrote:
>>
>> The default generation of spark context is actually a hive context.
>>
>> I tried to find on the documentation what are the differences between
>> hive context and sql context and couldn’t find it for spark 2.0 (I know for
>> previous versions there were a couple of functions which required hive
>> context as well as window functions but those seem to have all been fixed
>> for spark 2.0).
>>
>> Furthermore, I can’t seem to find a way to configure spark not to use
>> hive. I can only find how to compile it without hive (and having to build
>> from source each time is not a good idea for a production system).
>>
>>
>>
>> I would suggest that working without hive should be either a simple
>> configuration or even the default and that if there is any missing
>> functionality it should be documented.
>>
>> Assaf.
>>
>>
>>
>>
>>
>> *From:* Reynold Xin [mailto:[hidden email]
>> ]
>> *Sent:* Tuesday, November 15, 2016 9:31 AM
>> *To:* Mendelson, Assaf
>> *Cc:* [hidden email]
>> 
>> *Subject:* Re: separate spark and hive
>>
>>
>>
>> I agree with the high level idea, and thus SPARK-15691
>> .
>>
>>
>>
>> In reality, it's a huge amount of work to create & maintain a custom
>> catalog. It might actually make sense to do, but it just seems a lot of
>> work to do right now and it'd take a toll on interoperability.
>>
>>
>>
>> If you don't need persistent catalog, you can just run Spark without Hive
>> mode, can't you?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]
>> > wrote:
>>
>> Hi,
>>
>> Today, we basically force people to use hive if they want to get the full
>> use of spark SQL.
>>
>> When doing the default installation this means that a derby.log and
>> metastore_db directory are created where we run from.
>>
>> The problem with this is that if we run multiple scripts from the same
>> working directory we have a problem.
>>
>> The solution we employ locally is to always run from different directory
>> as we ignore hive in practice (this of course means we lose the ability to
>> use some of the catalog options in spark session).
>>
>> The only other solution is to create a full blown hive installation with
>> proper configuration (probably for a JDBC solution).
>>
>>
>>
>> I would propose that in most cases there shouldn’t be any hive use at
>> all. Even for catalog elements such as saving a permanent table, we should
>> be able to configure a target directory and simply write to it (doing
>> everything file based to avoid the need for locking). Hive should be
>> reserved for those who actually use it (probably for backward
>> compatibility).
>>
>>
>>
>> Am I missing something here?
>>
>> Assaf.
>>
>>
>> --
>>
>> View this message in context: 

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Ricardo Almeida
+1 (non-binding)

over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
YARN, Hive


On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1
>
> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>> 7ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>
>


Re: Handling questions in the mailing lists

2016-11-07 Thread Ricardo Almeida
ys then add the medium tag etc. Downvote people who
> don’t go by the process. This would mean that committers for example can
> look at advanced only tag and have a manageable number of questions they
> can help with while others can answer medium and basic.
>
>
>
> I agree that some things are not good for SO. Basically stuff which asks
> for opinion is such but most cases in the mailing list are either “how do I
> solve this bug” or “how do I do X”. Either of those two are good for SO.
>
>
>
>
>
> Assaf.
>
>
>
>
>
>
>
> *From:* rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email]]
> *Sent:* Monday, November 07, 2016 8:33 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Handling questions in the mailing lists
>
>
>
> This is an excellent point. If we do go ahead and feature SO as a way for
> users to ask questions more prominently, as someone who knows SO very well,
> would you be willing to help write a short guideline (ideally the shorter
> the better, which makes it hard) to direct what goes to user@ and what
> goes to SO?
>
>
>
>
>
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz <[hidden email]> wrote:
>
> Damn, I always thought that mailing list is only for nice and welcoming
> people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list which
> would fit just fine on SO but it is not true in general. There are dozens
> of questions which are to broad, opinion based, ask for external resources
> and so on. If you want to direct users to SO you have to help them to
> decide if it is the right channel. Otherwise it will just create a really
> bad experience for both seeking help and active answerers. Former ones will
> be downvoted and bashed, latter ones will have to deal with handling all
> the junk and the number of active Spark users with moderation privileges is
> really low (with only Massg and me being able to directly close duplicates).
>
> Believe me, I've seen this before.
>
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
>
> You have substantially underestimated how opinionated people can be on
> mailing lists too :)
>
> On Sunday, November 6, 2016, Maciej Szymkiewicz <[hidden email]> wrote:
>
> You have to remember that Stack Overflow crowd (like me) is highly
> opinionated, so many questions, which could be just fine on the mailing
> list, will be quickly downvoted and / or closed as off-topic. Just
> saying...
>
> --
>
> Best,
>
> Maciej
>
>
>
> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>
> OK I've checked on the ASF member list (which is private so there is no
> public archive).
>
>
>
> It is not against any ASF rule to recommend StackOverflow as a place for
> users to ask questions. I don't think we can or should delete the existing
> user@spark list either, but we can certainly make SO more visible than it
> is.
>
>
>
>
>
>
>
> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin <[hidden email]> wrote:
>
> Actually after talking with more ASF members, I believe the only policy is
> that development decisions have to be made and announced on ASF properties
> (dev list or jira), but user questions don't have to.
>
>
>
> I'm going to double check this. If it is true, I would actually recommend
> us moving entirely over the Q part of the user list to stackoverflow, or
> at least make that the recommended way rather than the existing user list
> which is not very scalable.
>
>
>
> On Wednesday, November 2, 2016, Nicholas Chammas <[hidden email]> wrote:
>
> We’ve discussed several times upgrading our communication tools, as far
> back as 2014 and maybe even before that too. The bottom line is that we
> can’t due to ASF rules requiring the use of ASF-managed mailing lists.
>
> For some history, see this discussion:
>
> 1.  https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%
> 3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oY5NO2dHWJ_kVEoP+Ng@...%3E
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201412.mbox_-253CCAOhmDzfL2COdysV8r5hZN8f-3DNqXM-3Df-3DoY5NO2dHWJ-5FkVEoP-2BNg-40mail.gmail.com-253E=DQMFaQ=dCBwIlVXJsYZrY6gpNt0LA=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A=Vf-yZoTpLgwZzwUCoQTMr4UFD_R0nx0naxh_SWUHfho=fILmWaylBzYeV5-XRmdm75cBbKG57kiU81cArNLLbdA=>
>
> 2.  https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%
> 3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=TKTxY_sYw@...%3E
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201501.mbox_-253CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-04 Thread Ricardo Almeida
+1 (non-binding)

tested over Ubuntu / OpenJDK 1.8.0_111

On 4 November 2016 at 10:00, Sean Owen  wrote:

> Likewise, ran my usual tests on Ubuntu with 
> yarn/hive/hive-thriftserver/hadoop-2.6
> on JDK 8 and all passed. Sigs and licenses are OK. +1
>
>
> On Thu, Nov 3, 2016 at 7:57 PM Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1
>>
>> On Thu, Nov 3, 2016 at 6:58 PM, Michael Armbrust 
>> wrote:
>>
>> +1
>>
>> On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.3
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
>> 42a2367b14b)
>>
>> This release candidate addresses 52 JIRA tickets:
>> https://s.apache.org/spark-1.6.3-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1212/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>>
>>
>> ===
>> == How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.6.2.
>>
>> 
>> == What justifies a -1 vote for this release?
>> 
>> This is a maintenance release in the 1.6.x series.  Bugs already present
>> in 1.6.2, missing features, or bugs related to new features will not
>> necessarily block this release.
>>
>>
>>
>>


Re: Handling questions in the mailing lists

2016-11-02 Thread Ricardo Almeida
I fell Assaf point is quite relevant if we want to move this project
forward from the Spark user perspective (as I do). In fact, we're still
using 20th century tools (mailing lists) with some add-ons (like Stack
Overflow).

As usually, Sean and Cody's contributions are very to the point.
I fell it is indeed a matter of of culture (hard to enforce) and tools
(much easier). Isn't it?

On 2 November 2016 at 16:36, Cody Koeninger  wrote:

> So concrete things people could do
>
> - users could tag subject lines appropriately to the component they're
> asking about
>
> - contributors could monitor user@ for tags relating to components
> they've worked on.
> I'd be surprised if my miss rate for any mailing list questions
> well-labeled as Kafka was higher than 5%
>
> - committers could be more aggressive about soliciting and merging PRs
> to improve documentation.
> It's a lot easier to answer even poorly-asked questions with a link to
> relevant docs.
>
> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen  wrote:
> > There's already reviews@ and issues@. dev@ is for project development
> itself
> > and I think is OK. You're suggesting splitting up user@ and I sympathize
> > with the motivation. Experience tells me that we'll have a beginner@
> that's
> > then totally ignored, and people will quickly learn to post to advanced@
> to
> > get attention, and we'll be back where we started. Putting it in JIRA
> > doesn't help. I don't think this a problem that is merely down to lack of
> > process. It actually requires cultivating a culture change on the
> community
> > list.
> >
> > On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf <
> assaf.mendel...@rsa.com>
> > wrote:
> >>
> >> What I am suggesting is basically to fix that.
> >>
> >> For example, we might say that mailing list A is only for voting,
> mailing
> >> list B is only for PR and have something like stack overflow for
> developer
> >> questions (I would even go as far as to have beginner, intermediate and
> >> advanced mailing list for users and beginner/advanced for dev).
> >>
> >>
> >>
> >> This can easily be done using stack overflow tags, however, that would
> >> probably be harder to manage.
> >>
> >> Maybe using special jira tags and manage it in jira?
> >>
> >>
> >>
> >> Anyway as I said, the main issue is not user questions (except maybe
> >> advanced ones) but more for dev questions. It is so easy to get lost in
> the
> >> chatter that it makes it very hard for people to learn spark internals…
> >>
> >> Assaf.
> >>
> >>
> >>
> >> From: Sean Owen [mailto:so...@cloudera.com]
> >> Sent: Wednesday, November 02, 2016 2:07 PM
> >> To: Mendelson, Assaf; dev@spark.apache.org
> >> Subject: Re: Handling questions in the mailing lists
> >>
> >>
> >>
> >> I think that unfortunately mailing lists don't scale well. This one has
> >> thousands of subscribers with different interests and levels of
> experience.
> >> For any given person, most messages will be irrelevant. I also find
> that a
> >> lot of questions on user@ are not well-asked, aren't an SSCCE
> >> (http://sscce.org/), not something most people are going to bother
> replying
> >> to even if they could answer. I almost entirely ignore user@ because
> there
> >> are higher-priority channels like PRs to deal with, that already have
> >> hundreds of messages per day. This is why little of it gets an answer
> -- too
> >> noisy.
> >>
> >>
> >>
> >> We have to have official mailing lists, in any event, to have some
> >> official channel for things like votes and announcements. It's not
> wrong to
> >> ask questions on user@ of course, but a lot of the questions I see
> could
> >> have been answered with research of existing docs or looking at the
> code. I
> >> think that given the scale of the list, it's not wrong to assert that
> this
> >> is sort of a prerequisite for asking thousands of people to answer one's
> >> question. But we can't enforce that.
> >>
> >>
> >>
> >> The situation will get better to the extent people ask better questions,
> >> help other people ask better questions, and answer good questions. I'd
> >> encourage anyone feeling this way to try to help along those dimensions.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson <
> assaf.mendel...@rsa.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >> I know this is a little off topic but I wanted to raise an issue about
> >> handling questions in the mailing list (this is true both for the user
> >> mailing list and the dev but since there are other options such as stack
> >> overflow for user questions, this is more problematic in dev).
> >>
> >> Let’s say I ask a question (as I recently did). Unfortunately this was
> >> during spark summit in Europe so probably people were busy. In any case
> no
> >> one answered.
> >>
> >> The problem is, that if no one answers very soon, the question will
> almost
> >> certainly remain unanswered because new messages will 

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Ricardo Almeida
+1 (non-binding)

built and tested without regressions from 2.0.1.



On 27 October 2016 at 19:07, vaquar khan  wrote:

> +1
>
>
>
> On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu 
> wrote:
>
>> +1
>>
>> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin 
>> wrote:
>> > Greetings from Spark Summit Europe at Brussels.
>> >
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes
>> if a
>> > majority of at least 3+1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.0.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The tag to be voted on is v2.0.2-rc1
>> > (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>> >
>> > This release candidate resolves 75 issues:
>> > https://s.apache.org/spark-2.0.2-jira
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1208/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>> >
>> >
>> > Q: How can I help test this release?
>> > A: If you are a Spark user, you can help us test this release by taking
>> an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions from 2.0.1.
>> >
>> > Q: What justifies a -1 vote for this release?
>> > A: This is a maintenance release in the 2.0.x series. Bugs already
>> present
>> > in 2.0.1, missing features, or bugs related to new features will not
>> > necessarily block this release.
>> >
>> > Q: What fix version should I use for patches merging into branch-2.0
>> from
>> > now on?
>> > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> > (i.e. RC2) is cut, I will change the fix version of those patches to
>> 2.0.2.
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
>
> IT Architect / Lead Consultant
> Greater Chicago
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Ricardo Almeida
+1 (non-binding)

Built (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver
-Pyarn) and tested on:
- Ubuntu 16.04 / OpenJDK 1.8.0_91
- CentOS / Oracle Java 1.7.0_55

No regressions from 2.0.0 found while running our workloads (Python API)


On 29 September 2016 at 08:10, Reynold Xin  wrote:

> I will kick it off with my own +1.
>
>
> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa
>> 4577ba4be38)
>>
>> This release candidate resolves 301 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1203/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present in 2.0.0, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>> (i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.
>>
>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Ricardo Almeida
+1 (non-binding)

Built and tested on
- Ubuntu 16.04 / OpenJDK 1.8.0_91
- CentOS / Oracle Java 1.7.0_55
(-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn)


On 25 September 2016 at 22:35, Matei Zaharia 
wrote:

> +1
>
> Matei
>
> On Sep 25, 2016, at 1:25 PM, Josh Rosen  wrote:
>
> +1
>
> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai  wrote:
>
>> +1
>>
>> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> RC3 is compiled and tested on the following two systems, too. All tests
>>> passed.
>>>
>>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>> -Dsparkr
>>> * CentOS 7.2 / Open JDK 1.8.0_102
>>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>>
>>> Cheers,
>>> Dongjoon
>>>
>>>
>>>
>>> On Saturday, September 24, 2016, Reynold Xin 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and
 passes if a majority of at least 3+1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.1
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
 c2cc2730d17)

 This release candidate resolves 290 issues:
 https://s.apache.org/spark-2.0.1-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1201/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/


 Q: How can I help test this release?
 A: If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions from 2.0.0.

 Q: What justifies a -1 vote for this release?
 A: This is a maintenance release in the 2.0.x series.  Bugs already
 present in 2.0.0, missing features, or bugs related to new features will
 not necessarily block this release.

 Q: What fix version should I use for patches merging into branch-2.0
 from now on?
 A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
 (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.



>>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread Ricardo Almeida
+1 (non-binding)

Build:
OK, but can no longer use the "--tgz" option when
calling make-distribution.sh (maybe a problem on my side?)

Run:
No regressions from 2.0.0 detected. Tested our pipelines on a standalone
cluster (Python API)



On 23 September 2016 at 08:01, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.1. The vote is open until Sunday, Sep 25, 2016 at 23:59 PDT and passes
> if a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc2 (04141ad49806a48afccc236b69982
> 7997142bd57)
>
> This release candidate resolves 284 issues: https://s.apache.org/spark-2.0
> .1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1199
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What happened to 2.0.1 RC1?
> A: There was an issue with RC1 R documentation during release candidate
> preparation. As a result, rc1 was canceled before a vote was called.
>
>


Spark Streaming - Twitter on Python current status

2016-05-28 Thread Ricardo Almeida
As far as I could understand...
1. Using Python (PySpark), the use of Twitter Streaming (TwitterUtils
)
as well as Customer Receivers is restricted to Scala and Java APIs on Spark
1.6.1;
2. Maven linking Twitter/spark-streaming-twitter_2.10 is being removed from
Spark Streaming core Scala/Java API (Maven Linking

)
3. There are no plans to support Twitter Streaming Python API (pyspark) on
Spark 2.0 or later
4. Twitter Streaming API usage over Python (Spark pyspark) is and will
continue being restricted to Twitter REST API


Could you please confirm if these 4 my assumption are correct or is there
any alternative way for using Twitter Streaming API on PySpark from Spark
2.0?


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-20 Thread Ricardo Almeida
+1


Ricardo Almeida

On 20 May 2016 at 18:33, Mark Hamstra <m...@clearstorydata.com> wrote:

> This is isn't yet a release candidate since, as Reynold mention in his
> opening post, preview releases are "not meant to be functional, i.e. they
> can and highly likely will contain critical bugs or documentation errors."
>  Once we're at the point where we expect there not to be such bugs and
> errors, then the release candidates will start.
>
> On Fri, May 20, 2016 at 4:40 AM, Ross Lawley <ross.law...@gmail.com>
> wrote:
>
>> +1 Having an rc1 would help me get stable feedback on using my library
>> with Spark, compared to relying on 2.0.0-SNAPSHOT.
>>
>>
>> On Fri, 20 May 2016 at 05:57 Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> Changed my vote to +1. Thanks!
>>>
>>> 2016-05-19 13:28 GMT-07:00 Xiao Li <gatorsm...@gmail.com>:
>>>
>>>> Will do. Thanks!
>>>>
>>>> 2016-05-19 13:26 GMT-07:00 Reynold Xin <r...@databricks.com>:
>>>>
>>>>> Xiao thanks for posting. Please file a bug in JIRA. Again as I said in
>>>>> the email this is not meant to be a functional release and will contain
>>>>> bugs.
>>>>>
>>>>> On Thu, May 19, 2016 at 1:20 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>>>
>>>>>> -1
>>>>>>
>>>>>> Unable to use Hive meta-store in pyspark shell. Tried both
>>>>>> HiveContext and SparkSession. Both failed. It always uses in-memory
>>>>>> catalog. Anybody else hit the same issue?
>>>>>>
>>>>>>
>>>>>> Method 1: SparkSession
>>>>>>
>>>>>> >>> from pyspark.sql import SparkSession
>>>>>>
>>>>>> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>>>>>>
>>>>>> >>>
>>>>>>
>>>>>> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
>>>>>> STRING)")
>>>>>>
>>>>>> DataFrame[]
>>>>>>
>>>>>> >>> spark.sql("LOAD DATA LOCAL INPATH
>>>>>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>
>>>>>>   File "", line 1, in 
>>>>>>
>>>>>>   File
>>>>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
>>>>>> line 494, in sql
>>>>>>
>>>>>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>>>>>
>>>>>>   File
>>>>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>>>>>> line 933, in __call__
>>>>>>
>>>>>>   File
>>>>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
>>>>>> line 57, in deco
>>>>>>
>>>>>> return f(*a, **kw)
>>>>>>
>>>>>>   File
>>>>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>>>>>> line 312, in get_return_value
>>>>>>
>>>>>> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>>>>>>
>>>>>> : java.lang.UnsupportedOperationException: loadTable is not
>>>>>> implemented
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>>&

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Ricardo Almeida
Amazing! I'll fund $1/2 million for such a interesting initiative.
Oh, wait... I have only $4 on my pocket

Cheers :)

On 1 April 2016 at 11:40, Takeshi Yamamuro  wrote:

> Oh, the annual event...
>
> On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li  wrote:
>
>> April 1st... : )
>>
>> 2016-04-01 0:33 GMT-07:00 Michael Malak :
>>
>>> I see you've been burning the midnight oil.
>>>
>>>
>>> --
>>> *From:* Reynold Xin 
>>> *To:* "dev@spark.apache.org" 
>>> *Sent:* Friday, April 1, 2016 1:15 AM
>>> *Subject:* [discuss] using deep learning to improve Spark
>>>
>>> Hi all,
>>>
>>> Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
>>>
>>> I'd like to bring your attention to a project called DeepSpark that we
>>> have been working on for the past three years. We realized that scaling
>>> software development was challenging. A large fraction of software
>>> engineering has been manual and mundane: writing test cases, fixing bugs,
>>> implementing features according to specs, and reviewing pull requests. So
>>> we started this project to see how much we could automate.
>>>
>>> After three years of development and one year of testing, we now have
>>> enough confidence that this could work well in practice. For example, Matei
>>> confessed to me today: "It looks like DeepSpark has a better understanding
>>> of Spark internals than I ever will. It updated several pieces of code I
>>> wrote long ago that even I no longer understood.”
>>>
>>>
>>> I think it's time to discuss as a community about how we want to
>>> continue this project to ensure Spark is stable, secure, and easy to use
>>> yet able to progress as fast as possible. I'm still working on a more
>>> formal design doc, and it might take a little bit more time since I haven't
>>> been able to fully grasp DeepSpark's capabilities yet. Based on my
>>> understanding right now, I've written a blog post about DeepSpark here:
>>> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
>>>
>>>
>>> Please take a look and share your thoughts. Obviously, this is an
>>> ambitious project and could take many years to fully implement. One major
>>> challenge is cost. The current Spark Jenkins infrastructure provided by the
>>> AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure
>>> whether AMPLab or Databricks can fund DeepSpark's operation for a long
>>> period of time. Perhaps AWS can help out here. Let me know if you have
>>> other ideas.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-25 Thread Ricardo Almeida
+1 (non binding)
Tested Python API, Spark Core, Spark SQL, Spark MLlib  on a standalone
cluster 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC4-tp15747p15800.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Ricardo Almeida
+1 (non binding)

Tested our workloads on a standalone cluster:
- Spark Core
- Spark SQL
- Spark MLlib
- Python API



On 12 December 2015 at 18:39, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149