Re: [k8s] Spark operator (the Java one)

2019-10-10 Thread Yinan Li
+1. This and the GCP Spark Operator, although being very useful for k8s
users, are not something needed by all Spark users, not even by all Spark
on k8s users.


On Thu, Oct 10, 2019 at 6:34 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Hi all,
>
> I also left a comment on the PR with more details. I dont see why the java
> operator should be maintained by the Spark project.
> This is an interesting project and could thrive on its own as an external
> operator project.
>
> Best,
> Stavros
>
> On Thu, Oct 10, 2019 at 7:51 PM Sean Owen  wrote:
>
>> I'd have the same question on the PR - why does this need to be in the
>> Apache Spark project vs where it is now? Yes, it's not a Spark package
>> per se, but it seems like this is a tool for K8S to use Spark rather
>> than a core Spark tool.
>>
>> Yes of course all the packages, licenses, etc have to be overhauled,
>> but that kind of underscores that this is a dump of a third party tool
>> that works fine on its own?
>>
>> On Thu, Oct 10, 2019 at 9:30 AM Jiri Kremser  wrote:
>> >
>> > Hello,
>> >
>> >
>> > Spark Operator is a tool that can deploy/scale and help with monitoring
>> of Spark clusters on Kubernetes. It follows the operator pattern [1]
>> introduced by CoreOS so it watches for changes in custom resources
>> representing the desired state of the clusters and does the steps to
>> achieve this state in the Kubernetes by using the K8s client. It’s written
>> in Java and there is an overlap with the spark dependencies (logging, k8s
>> client, apache-commons-*, fasterxml-jackson, etc.). The operator contains
>> also metadata that allows it to deploy smoothly using the operatorhub.io
>> [2]. For a very basic info, check the readme on the project page including
>> the gif :) Other unique feature to this operator is the ability (it’s
>> optional) to compile itself to a native image using GraalVM compiler to be
>> able to start fast and have a very low memory footprint.
>> >
>> >
>> > We would like to contribute this project to Spark’s code base. It can’t
>> be distributed as a spark package, because it’s not a library that can be
>> used from Spark environment. So if you are interested, the directory under
>> resource-managers/kubernetes/spark-operator/ could be a suitable
>> destination.
>> >
>> >
>> > The current repository is radanalytics/spark-operator [2] on GitHub and
>> it contains also a test suite [3] that verifies if the operator can work
>> well on K8s (using minikube) and also on OpenShift. I am not sure how to
>> transfer those tests in case you would be interested in those as well.
>> >
>> >
>> > I’ve already opened the PR [5], but it got closed, so I am opening the
>> discussion here first. The PR contained old package names with our
>> organisation called radanalytics.io but we are willing to change that to
>> anything that will be more aligned with the existing Spark conventions,
>> same holds for the license headers in all the source files.
>> >
>> >
>> > jk
>> >
>> >
>> >
>> > [1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
>> >
>> > [2]: https://operatorhub.io/operator/radanalytics-spark
>> >
>> > [3]: https://github.com/radanalyticsio/spark-operator
>> >
>> > [4]: https://travis-ci.org/radanalyticsio/spark-operator
>> >
>> > [5]: https://github.com/apache/spark/pull/26075
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [k8s] Spark operator (the Java one)

2019-10-10 Thread Stavros Kontopoulos
Hi all,

I also left a comment on the PR with more details. I dont see why the java
operator should be maintained by the Spark project.
This is an interesting project and could thrive on its own as an external
operator project.

Best,
Stavros

On Thu, Oct 10, 2019 at 7:51 PM Sean Owen  wrote:

> I'd have the same question on the PR - why does this need to be in the
> Apache Spark project vs where it is now? Yes, it's not a Spark package
> per se, but it seems like this is a tool for K8S to use Spark rather
> than a core Spark tool.
>
> Yes of course all the packages, licenses, etc have to be overhauled,
> but that kind of underscores that this is a dump of a third party tool
> that works fine on its own?
>
> On Thu, Oct 10, 2019 at 9:30 AM Jiri Kremser  wrote:
> >
> > Hello,
> >
> >
> > Spark Operator is a tool that can deploy/scale and help with monitoring
> of Spark clusters on Kubernetes. It follows the operator pattern [1]
> introduced by CoreOS so it watches for changes in custom resources
> representing the desired state of the clusters and does the steps to
> achieve this state in the Kubernetes by using the K8s client. It’s written
> in Java and there is an overlap with the spark dependencies (logging, k8s
> client, apache-commons-*, fasterxml-jackson, etc.). The operator contains
> also metadata that allows it to deploy smoothly using the operatorhub.io
> [2]. For a very basic info, check the readme on the project page including
> the gif :) Other unique feature to this operator is the ability (it’s
> optional) to compile itself to a native image using GraalVM compiler to be
> able to start fast and have a very low memory footprint.
> >
> >
> > We would like to contribute this project to Spark’s code base. It can’t
> be distributed as a spark package, because it’s not a library that can be
> used from Spark environment. So if you are interested, the directory under
> resource-managers/kubernetes/spark-operator/ could be a suitable
> destination.
> >
> >
> > The current repository is radanalytics/spark-operator [2] on GitHub and
> it contains also a test suite [3] that verifies if the operator can work
> well on K8s (using minikube) and also on OpenShift. I am not sure how to
> transfer those tests in case you would be interested in those as well.
> >
> >
> > I’ve already opened the PR [5], but it got closed, so I am opening the
> discussion here first. The PR contained old package names with our
> organisation called radanalytics.io but we are willing to change that to
> anything that will be more aligned with the existing Spark conventions,
> same holds for the license headers in all the source files.
> >
> >
> > jk
> >
> >
> >
> > [1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
> >
> > [2]: https://operatorhub.io/operator/radanalytics-spark
> >
> > [3]: https://github.com/radanalyticsio/spark-operator
> >
> > [4]: https://travis-ci.org/radanalyticsio/spark-operator
> >
> > [5]: https://github.com/apache/spark/pull/26075
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Weichen Xu
Wait... I have some supplement:

*New API:*
SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM
SPARK-28045 add missing RankingEvaluator
SPARK-29121 Support Dot Product for Vectors

*Behavior change or new API with behavior change:*
SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer
SPARK-22798 Add multiple column support to PySpark StringIndexer
SPARK-11215 Add multiple columns support to StringIndexer
SPARK-24102 RegressionEvaluator should use sample weight data
SPARK-24101 MulticlassClassificationEvaluator should use sample weight data
SPARK-24103 BinaryClassificationEvaluator should use sample weight data
SPARK-23469 HashingTF should use corrected MurmurHash3 implementation

*Deprecated API removal:*
SPARK-25382 Remove ImageSchema.readImages in 3.0
SPARK-26133 Remove deprecated OneHotEncoder and rename
OneHotEncoderEstimator to OneHotEncoder
SPARK-25867 Remove KMeans computeCost
SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from
Python TreeEnsembleParams

Thanks!

Weichen

On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang  wrote:

> Hi all,
>
> Here is the updated feature list:
>
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155  Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539  Add
> support for Kafka headers
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25390  data
> source V2 API refactoring
>
> SPARK-25501  Add Kafka
> delegation token support
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26651  Use
> Proleptic Gregorian calendar
>
> SPARK-26759  Arrow
> optimization in SparkR's interoperability
>
> SPARK-26848 

DataSourceV2 sync notes - 2 October 2019

2019-10-10 Thread Ryan Blue
Here are my notes from last week's DSv2 sync.

*Attendees*:

Ryan Blue
Terry Kim
Wenchen Fan

*Topics*:

   - SchemaPruning only supports Parquet and ORC?
   - Out of order optimizer rules
   - 3.0 work
  - Rename session catalog to spark_catalog
  - Finish TableProvider update to avoid another API change: pass all
  table config from metastore
  - Catalog behavior fix:
  https://issues.apache.org/jira/browse/SPARK-29014
  - Stats push-down optimization:
  https://github.com/apache/spark/pull/25955
  - DataFrameWriter v1/v2 compatibility progress
   - Open PRs
  - Update identifier resolution and table resolution:
  https://github.com/apache/spark/pull/25747
  - Expose SerializableConfiguration:
  https://github.com/apache/spark/pull/26005
  - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955

*Discussion*:

   - Update identifier and table resolution
  - Wenchen: Will not handle SPARK-29014, it is a pure refactor
  - Ryan: I think this should separate the v2 rules from the v1
  fallback, to keep table and identifier resolution separate. The only time
  that table resolution needs to be done at the same time is for
v1 fallback.
  - This was merged last week
   - Update to use spark_catalog
  - Wenchen: this will be a separate PR.
  - Now open: https://github.com/apache/spark/pull/26071
   - Early DSv2 pushdown
  - Ryan: this depends on fixing a few more tests. To validate there
  are no calls to computeStats with the DSv2 relation, I’ve temporarily
  removed the method. Other than a few remaining test failures
where the old
  relation was expected, it looks like there are no uses of computeStats
  before early pushdown in the optimizer.
  - Wenchen: agreed that the batch was in the correct place in the
  optimizer
  - Ryan: once tests are passing, will add the computeStats
  implementation back with Utils.isTesting to fail during testing
when called
  before early pushdown, but will not fail at runtime
   - Wenchen: when using v2, there is no way to configure custom options
   for a JDBC table. For v1, the table was created and stored in the session
   catalog, at which point Spark-specific properties like parallelism could be
   stored. In v2, the catalog is the source of truth, so tables don’t get
   created in the same way. Options are only passed in a create statement.
  - Ryan: this could be fixed by allowing users to pass options as
  table properties. We mix the two today, but if we used a prefix for table
  properties, “options.”, then you could use SET TBLPROPERTIES to
get around
  this. That’s also better for compatibility. I’ll open a PR for this.
  - Ryan: this could also be solved by adding an OPTIONS clause or hint
  to SELECT
   - Wenchen: There are commands without v2 statements. We should add v2
   statements to reject non-v1 uses.
  - Ryan: Doesn’t the parser only parse up to 2 identifiers for these?
  That would handle the majority of cases
  - Wenchen: Yes, but there is still a problem for identifiers with 1
  part in v2 catalogs, like catalog.table. Commands that don’t support v2
  will use catalog.table in the v1 catalog.
  - Ryan: Sounds like a good plan to update the parser and add
  statements for these. Do we have a list of commands to update?
  - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc.
  Will open an umbrella JIRA with a list.

-- 
Ryan Blue
Software Engineer
Netflix


Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Xingbo Jiang
Hi all,

Here is the updated feature list:


SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23155  Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539  Add support
for Kafka headers

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25390  data source
V2 API refactoring

SPARK-25501  Add Kafka
delegation token support

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651  Use
Proleptic Gregorian calendar

SPARK-26759  Arrow
optimization in SparkR's interoperability

SPARK-26848  Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Dongjoon Hyun
+1

Bests,
Dongjoon

On Thu, Oct 10, 2019 at 10:14 Ryan Blue  wrote:

> +1
>
> Thanks for fixing this!
>
> On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:
>
>> +1
>>
>> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>>
>>> +1 (binding)
>>>
>>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이
>>> 작성:
>>>
 Thanks for the great work, Gengliang!

 +1 for that.
 As I said before, the behaviour is pretty common in DBMSs, so the change
 helps for DMBS users.

 Bests,
 Takeshi


 On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
 gengliang.w...@databricks.com> wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI
> store assignment rules in table insertion by default" after revising the
> ANSI store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type,
> Spark performs type coercion. Currently, we support 3 policies for the
> store assignment rules: ANSI, legacy and strict, which can be set via the
> option "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
> practice, the behavior is mostly the same as PostgreSQL. It disallows
> certain unreasonable type conversions such as converting `string` to `int`
> and `double` to `boolean`. It will throw a runtime exception if the value
> is out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while
> V2 uses "Strict". This proposal is to use "ANSI" policy by default for 
> both
> V1 and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


 --
 ---
 Takeshi Yamamuro

>>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Sean Owen
See the JIRA - this is too open-ended and not obviously just due to
choices in data representation, what you're trying to do, etc. It's
correctly closed IMHO.
However, identifying the issue more narrowly, and something that looks
ripe for optimization, would be useful.

On Thu, Oct 10, 2019 at 12:30 PM antonkulaga  wrote:
>
> I think for sure  SPARK-28547
> 
> At the moment there are some flows in Spark architecture and it performs
> miserably or even freezes everywhere where column number exceeds 10-15K
> (even simple describe function takes ages while the same functions with
> pandas and no Spark take seconds). In many fields (like bioinformatics) wide
> datasets with both large numbers of rows and columns are very common (gene
> expression data is a good example here) and Spark is totally useless there.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread antonkulaga
I think for sure  SPARK-28547
  
At the moment there are some flows in Spark architecture and it performs
miserably or even freezes everywhere where column number exceeds 10-15K
(even simple describe function takes ages while the same functions with
pandas and no Spark take seconds). In many fields (like bioinformatics) wide
datasets with both large numbers of rows and columns are very common (gene
expression data is a good example here) and Spark is totally useless there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Ryan Blue
+1

Thanks for fixing this!

On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:

> +1
>
> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>
>> +1 (binding)
>>
>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:
>>
>>> Thanks for the great work, Gengliang!
>>>
>>> +1 for that.
>>> As I said before, the behaviour is pretty common in DBMSs, so the change
>>> helps for DMBS users.
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>>> gengliang.w...@databricks.com> wrote:
>>>
 Hi everyone,

 I'd like to call for a new vote on SPARK-28885
  "Follow ANSI store
 assignment rules in table insertion by default" after revising the ANSI
 store assignment policy(SPARK-29326
 ).
 When inserting a value into a column with the different data type,
 Spark performs type coercion. Currently, we support 3 policies for the
 store assignment rules: ANSI, legacy and strict, which can be set via the
 option "spark.sql.storeAssignmentPolicy":
 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
 practice, the behavior is mostly the same as PostgreSQL. It disallows
 certain unreasonable type conversions such as converting `string` to `int`
 and `double` to `boolean`. It will throw a runtime exception if the value
 is out-of-range(overflow).
 2. Legacy: Spark allows the store assignment as long as it is a valid
 `Cast`, which is very loose. E.g., converting either `string` to `int` or
 `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
 for compatibility with Hive. When inserting an out-of-range value to an
 integral field, the low-order bits of the value is inserted(the same as
 Java/Scala numeric type casting). For example, if 257 is inserted into a
 field of Byte type, the result is 1.
 3. Strict: Spark doesn't allow any possible precision loss or data
 truncation in store assignment, e.g., converting either `double` to `int`
 or `decimal` to `double` is allowed. The rules are originally for Dataset
 encoder. As far as I know, no mainstream DBMS is using this policy by
 default.

 Currently, the V1 data source uses "Legacy" policy by default, while V2
 uses "Strict". This proposal is to use "ANSI" policy by default for both V1
 and V2 in Spark 3.0.

 This vote is open until Friday (Oct. 11).

 [ ] +1: Accept the proposal
 [ ] +0
 [ ] -1: I don't think this is a good idea because ...

 Thank you!

 Gengliang

>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
> [image: Databricks Summit - Watch the talks]
> 
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [k8s] Spark operator (the Java one)

2019-10-10 Thread Sean Owen
I'd have the same question on the PR - why does this need to be in the
Apache Spark project vs where it is now? Yes, it's not a Spark package
per se, but it seems like this is a tool for K8S to use Spark rather
than a core Spark tool.

Yes of course all the packages, licenses, etc have to be overhauled,
but that kind of underscores that this is a dump of a third party tool
that works fine on its own?

On Thu, Oct 10, 2019 at 9:30 AM Jiri Kremser  wrote:
>
> Hello,
>
>
> Spark Operator is a tool that can deploy/scale and help with monitoring of 
> Spark clusters on Kubernetes. It follows the operator pattern [1] introduced 
> by CoreOS so it watches for changes in custom resources representing the 
> desired state of the clusters and does the steps to achieve this state in the 
> Kubernetes by using the K8s client. It’s written in Java and there is an 
> overlap with the spark dependencies (logging, k8s client, apache-commons-*, 
> fasterxml-jackson, etc.). The operator contains also metadata that allows it 
> to deploy smoothly using the operatorhub.io [2]. For a very basic info, check 
> the readme on the project page including the gif :) Other unique feature to 
> this operator is the ability (it’s optional) to compile itself to a native 
> image using GraalVM compiler to be able to start fast and have a very low 
> memory footprint.
>
>
> We would like to contribute this project to Spark’s code base. It can’t be 
> distributed as a spark package, because it’s not a library that can be used 
> from Spark environment. So if you are interested, the directory under 
> resource-managers/kubernetes/spark-operator/ could be a suitable destination.
>
>
> The current repository is radanalytics/spark-operator [2] on GitHub and it 
> contains also a test suite [3] that verifies if the operator can work well on 
> K8s (using minikube) and also on OpenShift. I am not sure how to transfer 
> those tests in case you would be interested in those as well.
>
>
> I’ve already opened the PR [5], but it got closed, so I am opening the 
> discussion here first. The PR contained old package names with our 
> organisation called radanalytics.io but we are willing to change that to 
> anything that will be more aligned with the existing Spark conventions, same 
> holds for the license headers in all the source files.
>
>
> jk
>
>
>
> [1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
>
> [2]: https://operatorhub.io/operator/radanalytics-spark
>
> [3]: https://github.com/radanalyticsio/spark-operator
>
> [4]: https://travis-ci.org/radanalyticsio/spark-operator
>
> [5]: https://github.com/apache/spark/pull/26075

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[k8s] Spark operator (the Java one)

2019-10-10 Thread Jiri Kremser
Hello,

Spark Operator is a tool that can deploy/scale and help with monitoring of
Spark clusters on Kubernetes. It follows the operator pattern [1]
introduced by CoreOS so it watches for changes in custom resources
representing the desired state of the clusters and does the steps to
achieve this state in the Kubernetes by using the K8s client. It’s written
in Java and there is an overlap with the spark dependencies (logging, k8s
client, apache-commons-*, fasterxml-jackson, etc.). The operator contains
also metadata that allows it to deploy smoothly using the operatorhub.io
[2]. For a very basic info, check the readme on the project page including
the gif :) Other unique feature to this operator is the ability (it’s
optional) to compile itself to a native image using GraalVM compiler to be
able to start fast and have a very low memory footprint.

We would like to contribute this project to Spark’s code base. It can’t be
distributed as a spark package, because it’s not a library that can be used
from Spark environment. So if you are interested, the directory under
resource-managers/kubernetes/spark-operator/ could be a suitable
destination.

The current repository is radanalytics/spark-operator [2] on GitHub and it
contains also a test suite [3] that verifies if the operator can work well
on K8s (using minikube) and also on OpenShift. I am not sure how to
transfer those tests in case you would be interested in those as well.

I’ve already opened the PR [5], but it got closed, so I am opening the
discussion here first. The PR contained old package names with our
organisation called radanalytics.io but we are willing to change that to
anything that will be more aligned with the existing Spark conventions,
same holds for the license headers in all the source files.

jk


[1]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

[2]: https://operatorhub.io/operator/radanalytics-spark

[3]: https://github.com/radanalyticsio/spark-operator

[4]: https://travis-ci.org/radanalyticsio/spark-operator
[5]: https://github.com/apache/spark/pull/26075


Re: Committing while Jenkins down?

2019-10-10 Thread Shane Knapp
for running k8s tests locally, i have a section dedicated to that here:
https://spark.apache.org/developer-tools.html

minikube and friends is pretty straightforward to set up, but we're
running an older version of the former.  i am planning on addressing
that (and moving us to a recent release) by the end of this month...

...if we get power back.  ;)

shane

On Thu, Oct 10, 2019 at 9:16 AM Holden Karau  wrote:
>
>
>
> On Thu, Oct 10, 2019 at 9:13 AM Xiao Li  wrote:
>>
>> Thanks! Shane!
>>
>> AFAIK, it normally takes more than 5/6 hours to run all the tests. Any major 
>> changes in Core/SQL require running all the tests. If any committer did it 
>> before merging the code, I think it is fine to merge it.
>
> Glad were on the same page. Personally I have a desktop I can kick off the 
> tests on while I keep working on other things. Just need a separate check out.
>
> Note: Some of the k8s integration tests don’t work outside of Jenkins without 
> extra configuration, so I think we need to be careful with k8s related 
> changes.
>>
>>
>> Xiao
>>
>> Holden Karau  于2019年10月10日周四 上午9:11写道:
>>>
>>> Awesome, thanks Shane :)
>>>
>>> In the meantime I think committers can just run tests locally and it’ll be 
>>> a slower process but I don’t think we need to halt all merging.
>>>
>>> On Thu, Oct 10, 2019 at 9:07 AM Shane Knapp  wrote:

 if we do get power back before the weekend, i can have my sysadmin
 head down to the colo friday afternoon and power up jenkins.  he knows
 the drill.


 On Thu, Oct 10, 2019 at 8:50 AM Holden Karau  wrote:
 >
 > I think a reasonable, albeit slow, option is to run the tests locally. 
 > Since the outage could be as long as five days I’d rather not just have 
 > PRs pile up for that entire period.
 >
 > On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
 >>
 >> I think we are unable to merge any major PR if we do not know whether 
 >> the tests can pass.
 >>
 >> Xiao
 >>
 >> Xiao Li  于2019年10月10日周四 上午8:36写道:
 >>>
 >>> Please check the note from Shane.
 >>>
 >>> [build system] IMPORTANT! northern california fire danger, potential 
 >>> power outage(s)
 >>>
 >>> Thomas graves  于2019年10月10日周四 上午8:35写道:
 
  This is directed towards committers/PMC members.
 
  It looks like Jenkins will be down for a while, what is everyone's
  thoughts on committing PRs while its down?  Do we want to wait for
  Jenkins to come back up, manually run things ourselves and commit?
 
  Tom
 
  -
  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 
 > --
 > Twitter: https://twitter.com/holdenkarau
 > Books (Learning Spark, High Performance Spark, etc.): 
 > https://amzn.to/2MaRAG9
 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau



 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Committing while Jenkins down?

2019-10-10 Thread Shane Knapp
yeah, as long as tests are run locally i'm ok w/merging.  once
california is elevated from 'developing country' status to 'holy crap,
the magic of electricity has returned!' and we can automatically build
again any errors that slipped through will be caught in jenkins.

this means we'll need to keep a close eye on the 2.4 and master builds
for most of next week to ensure correctness.

shane

On Thu, Oct 10, 2019 at 9:13 AM Xiao Li  wrote:
>
> Thanks! Shane!
>
> AFAIK, it normally takes more than 5/6 hours to run all the tests. Any major 
> changes in Core/SQL require running all the tests. If any committer did it 
> before merging the code, I think it is fine to merge it.
>
> Xiao
>
> Holden Karau  于2019年10月10日周四 上午9:11写道:
>>
>> Awesome, thanks Shane :)
>>
>> In the meantime I think committers can just run tests locally and it’ll be a 
>> slower process but I don’t think we need to halt all merging.
>>
>> On Thu, Oct 10, 2019 at 9:07 AM Shane Knapp  wrote:
>>>
>>> if we do get power back before the weekend, i can have my sysadmin
>>> head down to the colo friday afternoon and power up jenkins.  he knows
>>> the drill.
>>>
>>>
>>> On Thu, Oct 10, 2019 at 8:50 AM Holden Karau  wrote:
>>> >
>>> > I think a reasonable, albeit slow, option is to run the tests locally. 
>>> > Since the outage could be as long as five days I’d rather not just have 
>>> > PRs pile up for that entire period.
>>> >
>>> > On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
>>> >>
>>> >> I think we are unable to merge any major PR if we do not know whether 
>>> >> the tests can pass.
>>> >>
>>> >> Xiao
>>> >>
>>> >> Xiao Li  于2019年10月10日周四 上午8:36写道:
>>> >>>
>>> >>> Please check the note from Shane.
>>> >>>
>>> >>> [build system] IMPORTANT! northern california fire danger, potential 
>>> >>> power outage(s)
>>> >>>
>>> >>> Thomas graves  于2019年10月10日周四 上午8:35写道:
>>> 
>>>  This is directed towards committers/PMC members.
>>> 
>>>  It looks like Jenkins will be down for a while, what is everyone's
>>>  thoughts on committing PRs while its down?  Do we want to wait for
>>>  Jenkins to come back up, manually run things ourselves and commit?
>>> 
>>>  Tom
>>> 
>>>  -
>>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): 
>>> > https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Committing while Jenkins down?

2019-10-10 Thread Holden Karau
On Thu, Oct 10, 2019 at 9:13 AM Xiao Li  wrote:

> Thanks! Shane!
>
> AFAIK, it normally takes *more than 5/6 hours* to run all the tests. Any
> major changes in Core/SQL require running all the tests. If any committer
> did it before merging the code, I think it is fine to merge it.
>
Glad were on the same page. Personally I have a desktop I can kick off the
tests on while I keep working on other things. Just need a separate check
out.

Note: Some of the k8s integration tests don’t work outside of Jenkins
without extra configuration, so I think we need to be careful with k8s
related changes.

>
> Xiao
>
> Holden Karau  于2019年10月10日周四 上午9:11写道:
>
>> Awesome, thanks Shane :)
>>
>> In the meantime I think committers can just run tests locally and it’ll
>> be a slower process but I don’t think we need to halt all merging.
>>
>> On Thu, Oct 10, 2019 at 9:07 AM Shane Knapp  wrote:
>>
>>> if we do get power back before the weekend, i can have my sysadmin
>>> head down to the colo friday afternoon and power up jenkins.  he knows
>>> the drill.
>>>
>>>
>>> On Thu, Oct 10, 2019 at 8:50 AM Holden Karau 
>>> wrote:
>>> >
>>> > I think a reasonable, albeit slow, option is to run the tests locally.
>>> Since the outage could be as long as five days I’d rather not just have PRs
>>> pile up for that entire period.
>>> >
>>> > On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
>>> >>
>>> >> I think we are unable to merge any major PR if we do not know whether
>>> the tests can pass.
>>> >>
>>> >> Xiao
>>> >>
>>> >> Xiao Li  于2019年10月10日周四 上午8:36写道:
>>> >>>
>>> >>> Please check the note from Shane.
>>> >>>
>>> >>> [build system] IMPORTANT! northern california fire danger, potential
>>> power outage(s)
>>> >>>
>>> >>> Thomas graves  于2019年10月10日周四 上午8:35写道:
>>> 
>>>  This is directed towards committers/PMC members.
>>> 
>>>  It looks like Jenkins will be down for a while, what is everyone's
>>>  thoughts on committing PRs while its down?  Do we want to wait for
>>>  Jenkins to come back up, manually run things ourselves and commit?
>>> 
>>>  Tom
>>> 
>>> 
>>> -
>>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Committing while Jenkins down?

2019-10-10 Thread Xiao Li
Thanks! Shane!

AFAIK, it normally takes *more than 5/6 hours* to run all the tests. Any
major changes in Core/SQL require running all the tests. If any committer
did it before merging the code, I think it is fine to merge it.

Xiao

Holden Karau  于2019年10月10日周四 上午9:11写道:

> Awesome, thanks Shane :)
>
> In the meantime I think committers can just run tests locally and it’ll be
> a slower process but I don’t think we need to halt all merging.
>
> On Thu, Oct 10, 2019 at 9:07 AM Shane Knapp  wrote:
>
>> if we do get power back before the weekend, i can have my sysadmin
>> head down to the colo friday afternoon and power up jenkins.  he knows
>> the drill.
>>
>>
>> On Thu, Oct 10, 2019 at 8:50 AM Holden Karau 
>> wrote:
>> >
>> > I think a reasonable, albeit slow, option is to run the tests locally.
>> Since the outage could be as long as five days I’d rather not just have PRs
>> pile up for that entire period.
>> >
>> > On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
>> >>
>> >> I think we are unable to merge any major PR if we do not know whether
>> the tests can pass.
>> >>
>> >> Xiao
>> >>
>> >> Xiao Li  于2019年10月10日周四 上午8:36写道:
>> >>>
>> >>> Please check the note from Shane.
>> >>>
>> >>> [build system] IMPORTANT! northern california fire danger, potential
>> power outage(s)
>> >>>
>> >>> Thomas graves  于2019年10月10日周四 上午8:35写道:
>> 
>>  This is directed towards committers/PMC members.
>> 
>>  It looks like Jenkins will be down for a while, what is everyone's
>>  thoughts on committing PRs while its down?  Do we want to wait for
>>  Jenkins to come back up, manually run things ourselves and commit?
>> 
>>  Tom
>> 
>>  -
>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Committing while Jenkins down?

2019-10-10 Thread Holden Karau
Awesome, thanks Shane :)

In the meantime I think committers can just run tests locally and it’ll be
a slower process but I don’t think we need to halt all merging.

On Thu, Oct 10, 2019 at 9:07 AM Shane Knapp  wrote:

> if we do get power back before the weekend, i can have my sysadmin
> head down to the colo friday afternoon and power up jenkins.  he knows
> the drill.
>
>
> On Thu, Oct 10, 2019 at 8:50 AM Holden Karau  wrote:
> >
> > I think a reasonable, albeit slow, option is to run the tests locally.
> Since the outage could be as long as five days I’d rather not just have PRs
> pile up for that entire period.
> >
> > On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
> >>
> >> I think we are unable to merge any major PR if we do not know whether
> the tests can pass.
> >>
> >> Xiao
> >>
> >> Xiao Li  于2019年10月10日周四 上午8:36写道:
> >>>
> >>> Please check the note from Shane.
> >>>
> >>> [build system] IMPORTANT! northern california fire danger, potential
> power outage(s)
> >>>
> >>> Thomas graves  于2019年10月10日周四 上午8:35写道:
> 
>  This is directed towards committers/PMC members.
> 
>  It looks like Jenkins will be down for a while, what is everyone's
>  thoughts on committing PRs while its down?  Do we want to wait for
>  Jenkins to come back up, manually run things ourselves and commit?
> 
>  Tom
> 
>  -
>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Committing while Jenkins down?

2019-10-10 Thread Shane Knapp
if we do get power back before the weekend, i can have my sysadmin
head down to the colo friday afternoon and power up jenkins.  he knows
the drill.


On Thu, Oct 10, 2019 at 8:50 AM Holden Karau  wrote:
>
> I think a reasonable, albeit slow, option is to run the tests locally. Since 
> the outage could be as long as five days I’d rather not just have PRs pile up 
> for that entire period.
>
> On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:
>>
>> I think we are unable to merge any major PR if we do not know whether the 
>> tests can pass.
>>
>> Xiao
>>
>> Xiao Li  于2019年10月10日周四 上午8:36写道:
>>>
>>> Please check the note from Shane.
>>>
>>> [build system] IMPORTANT! northern california fire danger, potential power 
>>> outage(s)
>>>
>>> Thomas graves  于2019年10月10日周四 上午8:35写道:

 This is directed towards committers/PMC members.

 It looks like Jenkins will be down for a while, what is everyone's
 thoughts on committing PRs while its down?  Do we want to wait for
 Jenkins to come back up, manually run things ourselves and commit?

 Tom

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] IMPORTANT! northern california fire danger, potential power outage(s)

2019-10-10 Thread Shane Knapp
another quick update:

campus lost power ~1130pm, and is closed for the entirety of today.

no word on power restoration, campus status, etc etc.

updates as they come.  :\

On Wed, Oct 9, 2019 at 2:34 PM Shane Knapp  wrote:
>
> quick update:
>
> campus is losing power @ 8pm.  this is after we were told 4am, 8am,
> noon, and 2-4pm.  :)
>
> PG expects to start bringing alameda county back online at noon
> tomorrow, but i believe that target to be fluid and take longer than
> expected.
>
> this means that the earliest that we can bring the build system back
> up is friday, but there's a much greater than non-zero chance of this
> not happening until monday morning.  i will be leaving town for the
> weekend friday afternoon, which means i won't be physically present to
> turn on all of our servers in the colo (about ~80 servers including
> jenkins) until monday.
>
> more updates as they come.  thanks for your patience!
>
> On Tue, Oct 8, 2019 at 7:32 PM Shane Knapp  wrote:
> >
> > jenkins is going down now.
> >
> > On Tue, Oct 8, 2019 at 4:21 PM Shane Knapp  wrote:
> > >
> > > quick update:
> > >
> > > we are definitely going to have our power shut off starting early
> > > tomorrow morning (by 4am PDT oct 9th), and expect at least 48 hours
> > > before it is restored.
> > >
> > > i will be shutting jenkins down some time this evening, and will
> > > update everyone here when i get more information.
> > >
> > > full service will be restored (i HOPE) by friday morning.
> > >
> > > shane (who doesn't ever want to check this list's archives and count
> > > how many times we've had power issues)
> > >
> > > On Tue, Oct 8, 2019 at 12:50 PM Shane Knapp  wrote:
> > > >
> > > > here in the lovely bay area, we are currently experiencing some
> > > > absolutely lovely weather:  temps around 20C, light winds, and not a
> > > > drop of moisture anywhere.
> > > >
> > > > this means that wildfire season is here, and our utilities company
> > > > (PG) is very concerned about fires like last year's Camp Fire
> > > > (https://en.wikipedia.org/wiki/Camp_Fire_(2018)), the 2018 fires
> > > > (https://en.wikipedia.org/wiki/2018_California_wildfires) and 2017
> > > > fires (https://en.wikipedia.org/wiki/2017_California_wildfires).
> > > >
> > > > because conditions are absolutely perfect for wildfires, we may lose
> > > > power here in berkeley tomorrow and thursday.
> > > >
> > > > there will be little to no notice of then this might happen, and if it
> > > > does that means that jenkins will most definitely go down.
> > > >
> > > > i will continue to keep a close eye on this and give updates as they
> > > > happen.  sadly, the pg website is down because they apparently
> > > > didn't think that they needed load balancers.  :\
> > > >
> > > > shane
> > > > --
> > > > Shane Knapp
> > > > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > > > https://rise.cs.berkeley.edu
> > >
> > >
> > >
> > > --
> > > Shane Knapp
> > > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > > https://rise.cs.berkeley.edu
> >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Committing while Jenkins down?

2019-10-10 Thread Holden Karau
I think a reasonable, albeit slow, option is to run the tests locally.
Since the outage could be as long as five days I’d rather not just have PRs
pile up for that entire period.

On Thu, Oct 10, 2019 at 8:38 AM Xiao Li  wrote:

> I think we are unable to merge any major PR if we do not know whether the
> tests can pass.
>
> Xiao
>
> Xiao Li  于2019年10月10日周四 上午8:36写道:
>
>> Please check the note from Shane.
>>
>> [build system] IMPORTANT! northern california fire danger, potential
>> power outage(s)
>>
>> Thomas graves  于2019年10月10日周四 上午8:35写道:
>>
> This is directed towards committers/PMC members.
>>>
>>> It looks like Jenkins will be down for a while, what is everyone's
>>> thoughts on committing PRs while its down?  Do we want to wait for
>>> Jenkins to come back up, manually run things ourselves and commit?
>>>
>>> Tom
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Committing while Jenkins down?

2019-10-10 Thread Xiao Li
Please check the note from Shane.

[build system] IMPORTANT! northern california fire danger, potential power
outage(s)

Thomas graves  于2019年10月10日周四 上午8:35写道:

> This is directed towards committers/PMC members.
>
> It looks like Jenkins will be down for a while, what is everyone's
> thoughts on committing PRs while its down?  Do we want to wait for
> Jenkins to come back up, manually run things ourselves and commit?
>
> Tom
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Committing while Jenkins down?

2019-10-10 Thread Xiao Li
I think we are unable to merge any major PR if we do not know whether the
tests can pass.

Xiao

Xiao Li  于2019年10月10日周四 上午8:36写道:

> Please check the note from Shane.
>
> [build system] IMPORTANT! northern california fire danger, potential power
> outage(s)
>
> Thomas graves  于2019年10月10日周四 上午8:35写道:
>
>> This is directed towards committers/PMC members.
>>
>> It looks like Jenkins will be down for a while, what is everyone's
>> thoughts on committing PRs while its down?  Do we want to wait for
>> Jenkins to come back up, manually run things ourselves and commit?
>>
>> Tom
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Committing while Jenkins down?

2019-10-10 Thread Thomas graves
This is directed towards committers/PMC members.

It looks like Jenkins will be down for a while, what is everyone's
thoughts on committing PRs while its down?  Do we want to wait for
Jenkins to come back up, manually run things ourselves and commit?

Tom

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Xiao Li
+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:

> +1 (binding)
>
> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:
>
>> Thanks for the great work, Gengliang!
>>
>> +1 for that.
>> As I said before, the behaviour is pretty common in DBMSs, so the change
>> helps for DMBS users.
>>
>> Bests,
>> Takeshi
>>
>>
>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to call for a new vote on SPARK-28885
>>>  "Follow ANSI store
>>> assignment rules in table insertion by default" after revising the ANSI
>>> store assignment policy(SPARK-29326
>>> ).
>>> When inserting a value into a column with the different data type, Spark
>>> performs type coercion. Currently, we support 3 policies for the store
>>> assignment rules: ANSI, legacy and strict, which can be set via the option
>>> "spark.sql.storeAssignmentPolicy":
>>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>>> certain unreasonable type conversions such as converting `string` to `int`
>>> and `double` to `boolean`. It will throw a runtime exception if the value
>>> is out-of-range(overflow).
>>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>>> for compatibility with Hive. When inserting an out-of-range value to an
>>> integral field, the low-order bits of the value is inserted(the same as
>>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>>> field of Byte type, the result is 1.
>>> 3. Strict: Spark doesn't allow any possible precision loss or data
>>> truncation in store assignment, e.g., converting either `double` to `int`
>>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>> default.
>>>
>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>> and V2 in Spark 3.0.
>>>
>>> This vote is open until Friday (Oct. 11).
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>> Thank you!
>>>
>>> Gengliang
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
[image: Databricks Summit - Watch the talks]



Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-10 Thread Jacek Laskowski
Hi,

Thanks much for such thorough conversation. Enjoyed it very much.

> Source/Sink traits are in org.apache.spark.sql.execution and thus they
are private.

That would explain why I couldn't find scaladocs.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan  wrote:

> > Would you mind if I ask the condition of being public API?
>
> The module doesn't matter, but the package matters. We have many public
> APIs in the catalyst module as well. (e.g. DataType)
>
> There are 3 packages in Spark SQL that are meant to be private:
> 1. org.apache.spark.sql.catalyst
> 2. org.apache.spark.sql.execution
> 3. org.apache.spark.sql.internal
>
> You can check out the full list of private packages of Spark in
> project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages
>
> Basically, classes/interfaces that don't appear in the official Spark API
> doc are private.
>
> Source/Sink traits are in org.apache.spark.sql.execution and thus they are
> private.
>
> On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim 
> wrote:
>
>> Would you mind if I ask the condition of being public API? Source/Sink
>> traits are not marked as @DeveloperApi but they're defined as public, and
>> located to sql-core so even not semantically private (for catalyst), easy
>> to give a signal they're public APIs.
>>
>> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row]
>> is not available even for private API. There're some other approaches on
>> using private API: 1) SQLContext.internalCreateDataFrame - as it requires
>> RDD[InternalRow], they should also depend on catalyst and have to deal with
>> InternalRow which Spark community seems to be desired to change it
>> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
>> catalyst. So they not only need to apply "package hack" but also need to
>> depend on catalyst.
>>
>>
>> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan  wrote:
>>
>>> AFAIK there is no public streaming data source API before DS v2. The
>>> Source and Sink API is private and is only for builtin streaming sources.
>>> Advanced users can still implement custom stream sources with private Spark
>>> APIs (you can put your classes under the org.apache.spark.sql package to
>>> access the private methods).
>>>
>>> That said, DS v2 is the first public streaming data source API. It's
>>> really hard to design a stable, efficient and flexible data source API that
>>> is unified between batch and streaming. DS v2 has evolved a lot in the
>>> master branch and hopefully there will be no big breaking changes anymore.
>>>
>>>
>>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 I remembered the actual case from developer who implements custom data
 source.


 https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E

 Quoting here:
 We started implementing DSv2 in the 2.4 branch, but quickly discovered
 that the DSv2 in 3.0 was a complete breaking change (to the point where it
 could have been named DSv3 and it wouldn’t have come as a surprise). Since
 the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
 to fall back into DSv1 in order to ease the future transition to Spark 3.

 Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
 on dealing with DSv2 breaking change is having DSv1 as temporary solution,
 even DSv2 for 3.x will be available. They need some time to make 
 transition.

 I would file an issue to support streaming data source on DSv1 and
 submit a patch unless someone objects.


 On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski  wrote:

> Hi Jungtaek,
>
> Thanks a lot for your very prompt response!
>
> > Looks like it's missing, or intended to force custom streaming
> source implemented as DSv2.
>
> That's exactly my understanding = no more DSv1 data sources. That
> however is not consistent with the official message, is it? Spark 2.4.4
> does not actually say "we're abandoning DSv1", and people could not really
> want to jump on DSv2 since it's not recommended (unless I missed that).
>
> I love surprises (as that's where people pay more for consulting :)),
> but not necessarily before public talks (with one at SparkAISummit in two
> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark 

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Hyukjin Kwon
+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:

> Thanks for the great work, Gengliang!
>
> +1 for that.
> As I said before, the behaviour is pretty common in DBMSs, so the change
> helps for DMBS users.
>
> Bests,
> Takeshi
>
>
> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a new vote on SPARK-28885
>>  "Follow ANSI store
>> assignment rules in table insertion by default" after revising the ANSI
>> store assignment policy(SPARK-29326
>> ).
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the store
>> assignment rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>> certain unreasonable type conversions such as converting `string` to `int`
>> and `double` to `boolean`. It will throw a runtime exception if the value
>> is out-of-range(overflow).
>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>> for compatibility with Hive. When inserting an out-of-range value to an
>> integral field, the low-order bits of the value is inserted(the same as
>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>> field of Byte type, the result is 1.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in store assignment, e.g., converting either `double` to `int`
>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no mainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>> and V2 in Spark 3.0.
>>
>> This vote is open until Friday (Oct. 11).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Takeshi Yamamuro
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI store
> assignment rules in table insertion by default" after revising the ANSI
> store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type, Spark
> performs type coercion. Currently, we support 3 policies for the store
> assignment rules: ANSI, legacy and strict, which can be set via the option
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
> the behavior is mostly the same as PostgreSQL. It disallows certain
> unreasonable type conversions such as converting `string` to `int` and
> `double` to `boolean`. It will throw a runtime exception if the value is
> out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2
> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
> and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


-- 
---
Takeshi Yamamuro