Re: ASF board report for November 2019

2019-11-11 Thread Matei Zaharia
Good catch, thanks.

> On Nov 11, 2019, at 6:46 PM, Jungtaek Lim  
> wrote:
> 
> nit: - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun). <= 
> s/committer/PMC member
> 
> Thanks,
> Jungtaek Lim (HeartSaVioR)
> 
> On Tue, Nov 12, 2019 at 11:38 AM Matei Zaharia  > wrote:
> Hi all,
> 
> It’s time to send our quarterly report to the ASF board. Here is my draft — 
> please feel free to suggest any changes.
> 
> 
> 
> Apache Spark is a fast and general engine for large-scale data processing. It
> offers high-level APIs in Java, Scala, Python and R as well as a rich set of
> libraries including stream processing, machine learning, and graph analytics.
> 
> Project status:
> 
> - We made the first preview release for Spark 3.0 on November 6th. This
>   release aims to get early feedback on the new APIs and functionality
>   targeting Spark 3.0 but does not provide API or stability guarantees. We
>   encourage community members to try this release and leave feedback on
>   JIRA. More info about what’s new and how to report feedback is found at
>   https://spark.apache.org/news/spark-3.0.0-preview.html 
> .
> 
> - We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
>   in the 2.4 and 2.3 branches.
> 
> - We added one new PMC members and six committers to the project
>   in August and September, covering data sources, streaming, SQL, ML
>   and other components of the project.
> 
> Trademarks:
> 
> - Nothing new to report since August.
> 
> Latest releases:
> 
> - Spark 3.0.0-preview was released on Nov 6th, 2019.
> - Spark 2.3.4 was released on Sept 9th, 2019.
> - Spark 2.4.4 was released on Sept 1st, 2019.
> 
> Committers and PMC:
> 
> - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
> - The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
>   also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
>   Ruifeng Zhang as committers in the past three months.
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 



Re: Ask for ARM CI for spark

2019-11-11 Thread Tianhua huang
Hi all,

Spark arm jobs have built for some time, and now there are two jobs[1]
spark-master-test-maven-arm

and spark-master-test-python-arm
,
we can see there are some build failures, but it because of the poor
performance of the arm instance, and now we begin to build spark arm jobs
on other high performance instances, and the build/test are all success, we
plan to donate the instance to amplab later.  According to the build
history, we are very happy to say spark is supported on aarch64 platform,
and I suggest to add this good news into spark-3.0.0 releasenotes. Maybe
community could provide an arm-supported release of spark at the meanwhile?

[1]
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/

ps: the jira https://issues.apache.org/jira/browse/SPARK-29106 trace the
whole work, thank you very much Shane:)

On Thu, Oct 17, 2019 at 2:52 PM bo zhaobo 
wrote:

> Just Notes: The jira issue link is
> https://issues.apache.org/jira/browse/SPARK-29106
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/10/17
> 下午02:50:01
>
> Tianhua huang  于2019年10月17日周四 上午10:47写道:
>
>> OK, let's update infos there. Thanks.
>>
>> On Thu, Oct 17, 2019 at 1:52 AM Shane Knapp  wrote:
>>
>>> i totally missed the spark jira from earlier...  let's move the
>>> conversation there!
>>>
>>> On Tue, Oct 15, 2019 at 6:21 PM bo zhaobo 
>>> wrote:
>>>
 Shane, Awaresome! We will try the best to finish the test and the
 requests on the VM recently. Once we finish those things, we will send you
 an email , then we can continue the following things. Thank you very much.

 Best Regards,

 ZhaoBo

 Shane Knapp  于 2019年10月16日周三 上午3:47写道:

> ok!  i'm able to successfully log in to the VM!
>
> i also have created a jenkins worker entry:
> https://amplab.cs.berkeley.edu/jenkins/computer/spark-arm-vm/
>
> it's a pretty bare-bones VM, so i have some suggestions/requests
> before we can actually proceed w/testing.  i will not be able to perform
> any system configuration, as i don't have the cycles to reverse-engineer
> the ansible setup and test it all out.
>
> * java is not installed, please install the following:
>   - java8 min version 1.8.0_191
>   - java11 min version 11.0.1
>
> * it appears from the ansible playbook that there are other deps that
> need to be installed.
>   - please install all deps
>   - manually run the tests until they pass
>
> * the jenkins user should NEVER have sudo or any root-level access!
>
> * once the arm tests pass when manually run, take a snapshot of this
> image so we can recreate it w/o needing to reinstall everything
>
> after that's done i can finish configuring the jenkins worker and set
> up a build...
>
> thanks!
>
> shane
>
>
> On Mon, Oct 14, 2019 at 8:34 PM Shane Knapp 
> wrote:
>
>> yes, i will get to that tomorrow.  today was spent cleaning up the
>> mess from last week.
>>
>> On Mon, Oct 14, 2019 at 6:18 PM bo zhaobo <
>> bzhaojyathousa...@gmail.com> wrote:
>>
>>> Hi shane,
>>>
>>> That's great news for Amplab is back. ;-) . If possible, could you
>>> please take several minutes to check the ARM VM is accessible from your
>>> side? And is there any plan for the whole ARM test integration from
>>> you?(how about we finish it this month?) Thanks.
>>>
>>> Best regards,
>>>
>>> ZhaoBo
>>>
>>>
>>>
>>> [image: Mailtrack]
>>> 
>>>  Sender
>>> notified by
>>> Mailtrack
>>> 
>>>  19/10/15
>>> 上午09:13:33
>>>
>>> bo zhaobo  于2019年10月10日周四 上午8:29写道:
>>>
 Oh, sorry about we miss that email.  If possible, could you please
 take some minutes to test the ARM VM is accessible through your ssh 
 private
 key with jenkins user? And we plan to make the whole integration 
 process
 and test could be done before the end of this month. We are very happy 
 to
 work together with you to move it forward, if you are free and agree 
 that.
 :)  Thank you very much.

 Best Regards,
 Zhao Bo


 Shane Knapp  于 2019年10月9日周三 下午11:10写道:

> i spent yesterday dealing w/a power outage on campus.  please 

Re: Is RDD thread safe?

2019-11-11 Thread Weichen Xu
Hi Chang,

RDD/Dataframe is immutable and lazy computed. They are thread safe.

Thanks!

On Tue, Nov 12, 2019 at 12:31 PM Chang Chen  wrote:

> Hi all
>
> I meet a case where I need cache a source RDD, and then create different
> DataFrame from it in different threads to accelerate query.
>
> I know that SparkSession is thread safe(
> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
> whether RDD  is thread safe or not
>
> Thanks
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Hyukjin Kwon
In few days, I will wrote this in our guidelines probably after rewording
it a bit better:

1. Add a prefix into a test name when a PR adds a couple of tests.
2. Uses "SPARK-: test name" format.

Please let me know if you have any different opinion about what/when to
write the JIRA ID as the prefix.
I would like to make sure this simple rule is closer to the actual practice
from you guys.


2019년 11월 12일 (화) 오전 8:41, Gengliang 님이 작성:

> +1 for making it a guideline. This is helpful when the test cases are
> moved to a different file.
>
> On Mon, Nov 11, 2019 at 3:23 PM Takeshi Yamamuro 
> wrote:
>
>> +1 for having that consistent rule in test names.
>> This is a trivial problem though, I think documenting this rule in the
>> contribution guide
>> might be able to make reviewer overhead a little smaller.
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> Maybe it's not a big deal but it brought some confusions time to time
>>> into Spark dev and community. I think it's time to discuss about when/which
>>> format to add a JIRA ID as a prefix for the test case name in Scala test
>>> cases.
>>>
>>> Currently we have many test case names with prefixes as below:
>>>
>>>- test("SPARK-X blah blah")
>>>- test("SPARK-X: blah blah")
>>>- test("SPARK-X - blah blah")
>>>- test("[SPARK-X] blah blah")
>>>- …
>>>
>>> It is a good practice to have the JIRA ID in general because, for
>>> instance,
>>> it makes us put less efforts to track commit histories (or even when the
>>> files
>>> are totally moved), or to track related information of tests failed.
>>> Considering Spark's getting big, I think it's good to document.
>>>
>>> I would like to suggest this and document it in our guideline:
>>>
>>> 1. Add a prefix into a test name when a PR adds a couple of tests.
>>> 2. Uses "SPARK-: test name" format which is used in our code base
>>> most
>>>   often[1].
>>>
>>> We should make it simple and clear but closer to the actual practice.
>>> So, I would like to listen to what other people think. I would appreciate
>>> if you guys give some feedback about when to add the JIRA prefix. One
>>> alternative is that, we only add the prefix when the JIRA's type is bug.
>>>
>>> [1]
>>> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>>>  923
>>> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>>>  477
>>> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>>>   16
>>> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>>>   13
>>>
>>>
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Is RDD thread safe?

2019-11-11 Thread Chang Chen
Hi all

I meet a case where I need cache a source RDD, and then create different
DataFrame from it in different threads to accelerate query.

I know that SparkSession is thread safe(
https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
whether RDD  is thread safe or not

Thanks


Re: ASF board report for November 2019

2019-11-11 Thread Jungtaek Lim
nit: - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun). <=
s/committer/PMC member

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Nov 12, 2019 at 11:38 AM Matei Zaharia 
wrote:

> Hi all,
>
> It’s time to send our quarterly report to the ASF board. Here is my draft
> — please feel free to suggest any changes.
>
> 
>
> Apache Spark is a fast and general engine for large-scale data processing.
> It
> offers high-level APIs in Java, Scala, Python and R as well as a rich set
> of
> libraries including stream processing, machine learning, and graph
> analytics.
>
> Project status:
>
> - We made the first preview release for Spark 3.0 on November 6th. This
>   release aims to get early feedback on the new APIs and functionality
>   targeting Spark 3.0 but does not provide API or stability guarantees. We
>   encourage community members to try this release and leave feedback on
>   JIRA. More info about what’s new and how to report feedback is found at
>   https://spark.apache.org/news/spark-3.0.0-preview.html.
>
> - We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
>   in the 2.4 and 2.3 branches.
>
> - We added one new PMC members and six committers to the project
>   in August and September, covering data sources, streaming, SQL, ML
>   and other components of the project.
>
> Trademarks:
>
> - Nothing new to report since August.
>
> Latest releases:
>
> - Spark 3.0.0-preview was released on Nov 6th, 2019.
> - Spark 2.3.4 was released on Sept 9th, 2019.
> - Spark 2.4.4 was released on Sept 1st, 2019.
>
> Committers and PMC:
>
> - The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
> - The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
>   also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
>   Ruifeng Zhang as committers in the past three months.
>
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


ASF board report for November 2019

2019-11-11 Thread Matei Zaharia
Hi all,

It’s time to send our quarterly report to the ASF board. Here is my draft — 
please feel free to suggest any changes.



Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python and R as well as a rich set of
libraries including stream processing, machine learning, and graph analytics.

Project status:

- We made the first preview release for Spark 3.0 on November 6th. This
  release aims to get early feedback on the new APIs and functionality
  targeting Spark 3.0 but does not provide API or stability guarantees. We
  encourage community members to try this release and leave feedback on
  JIRA. More info about what’s new and how to report feedback is found at
  https://spark.apache.org/news/spark-3.0.0-preview.html.

- We published Spark 2.4.4. and 2.3.4 as maintenance releases to fix bugs
  in the 2.4 and 2.3 branches.

- We added one new PMC members and six committers to the project
  in August and September, covering data sources, streaming, SQL, ML
  and other components of the project.

Trademarks:

- Nothing new to report since August.

Latest releases:

- Spark 3.0.0-preview was released on Nov 6th, 2019.
- Spark 2.3.4 was released on Sept 9th, 2019.
- Spark 2.4.4 was released on Sept 1st, 2019.

Committers and PMC:

- The latest committer was added on Sept 4th, 2019 (Dongjoon Hyun).
- The latest committer was added on Sept 9th, 2019 (Weichen Xu). We
  also added Ryan Blue, L.C. Hsieh, Gengliang Wang, Yuming Wang and
  Ruifeng Zhang as committers in the past three months.


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Gengliang
+1 for making it a guideline. This is helpful when the test cases are moved
to a different file.

On Mon, Nov 11, 2019 at 3:23 PM Takeshi Yamamuro 
wrote:

> +1 for having that consistent rule in test names.
> This is a trivial problem though, I think documenting this rule in the
> contribution guide
> might be able to make reviewer overhead a little smaller.
>
> Bests,
> Takeshi
>
> On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> Maybe it's not a big deal but it brought some confusions time to time
>> into Spark dev and community. I think it's time to discuss about when/which
>> format to add a JIRA ID as a prefix for the test case name in Scala test
>> cases.
>>
>> Currently we have many test case names with prefixes as below:
>>
>>- test("SPARK-X blah blah")
>>- test("SPARK-X: blah blah")
>>- test("SPARK-X - blah blah")
>>- test("[SPARK-X] blah blah")
>>- …
>>
>> It is a good practice to have the JIRA ID in general because, for
>> instance,
>> it makes us put less efforts to track commit histories (or even when the
>> files
>> are totally moved), or to track related information of tests failed.
>> Considering Spark's getting big, I think it's good to document.
>>
>> I would like to suggest this and document it in our guideline:
>>
>> 1. Add a prefix into a test name when a PR adds a couple of tests.
>> 2. Uses "SPARK-: test name" format which is used in our code base most
>>   often[1].
>>
>> We should make it simple and clear but closer to the actual practice. So,
>> I would like to listen to what other people think. I would appreciate if
>> you guys give some feedback about when to add the JIRA prefix. One
>> alternative is that, we only add the prefix when the JIRA's type is bug.
>>
>> [1]
>> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>>  923
>> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>>  477
>> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>>   16
>> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>>   13
>>
>>
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Takeshi Yamamuro
+1 for having that consistent rule in test names.
This is a trivial problem though, I think documenting this rule in the
contribution guide
might be able to make reviewer overhead a little smaller.

Bests,
Takeshi

On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon  wrote:

> Hi all,
>
> Maybe it's not a big deal but it brought some confusions time to time into
> Spark dev and community. I think it's time to discuss about when/which
> format to add a JIRA ID as a prefix for the test case name in Scala test
> cases.
>
> Currently we have many test case names with prefixes as below:
>
>- test("SPARK-X blah blah")
>- test("SPARK-X: blah blah")
>- test("SPARK-X - blah blah")
>- test("[SPARK-X] blah blah")
>- …
>
> It is a good practice to have the JIRA ID in general because, for instance,
> it makes us put less efforts to track commit histories (or even when the
> files
> are totally moved), or to track related information of tests failed.
> Considering Spark's getting big, I think it's good to document.
>
> I would like to suggest this and document it in our guideline:
>
> 1. Add a prefix into a test name when a PR adds a couple of tests.
> 2. Uses "SPARK-: test name" format which is used in our code base most
>   often[1].
>
> We should make it simple and clear but closer to the actual practice. So,
> I would like to listen to what other people think. I would appreciate if
> you guys give some feedback about when to add the JIRA prefix. One
> alternative is that, we only add the prefix when the JIRA's type is bug.
>
> [1]
> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>  923
> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>  477
> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>   16
> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>   13
>
>
>
>

-- 
---
Takeshi Yamamuro


Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Hyukjin Kwon
Hi all,

Maybe it's not a big deal but it brought some confusions time to time into
Spark dev and community. I think it's time to discuss about when/which
format to add a JIRA ID as a prefix for the test case name in Scala test
cases.

Currently we have many test case names with prefixes as below:

   - test("SPARK-X blah blah")
   - test("SPARK-X: blah blah")
   - test("SPARK-X - blah blah")
   - test("[SPARK-X] blah blah")
   - …

It is a good practice to have the JIRA ID in general because, for instance,
it makes us put less efforts to track commit histories (or even when the
files
are totally moved), or to track related information of tests failed.
Considering Spark's getting big, I think it's good to document.

I would like to suggest this and document it in our guideline:

1. Add a prefix into a test name when a PR adds a couple of tests.
2. Uses "SPARK-: test name" format which is used in our code base most
  often[1].

We should make it simple and clear but closer to the actual practice. So, I
would like to listen to what other people think. I would appreciate if you
guys give some feedback about when to add the JIRA prefix. One alternative
is that, we only add the prefix when the JIRA's type is bug.

[1]
git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
 923
git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
 477
git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
  16
git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
  13


Does StreamingSymmetricHashJoinExec work with watermark? I don't think so

2019-11-11 Thread Jacek Laskowski
Hi,

I think watermark does not work for StreamingSymmetricHashJoinExec because
of the following:

1. leftKeys and rightKeys have no spark.watermarkDelayMs metadata entry at
planning [1]
2. Since the left and right keys had no watermark delay at planning the
code [2] won't find it at execution

Is my understanding correct? If not, can you point me at examples with
watermark on 1) join keys and 2) values ? Merci beaucoup.

[1]
https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L477-L478

[2]
https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala#L156-L164

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski