Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Steven Stetzler
Oh, thanks for mentioning that, it looks l dynamic allocation on Kubernetes
works in client mode in Spark 3.0.0. I just had to set the following
configurations:

spark.dynamicAllocation.enabled=true

spark.dynamicAllocation.shuffleTracking.enabled=true


to enable dynamic allocation and disable the need for the external shuffle
service (which looks like it is experimental right now). My executor pods
couldn't connect to the  external shuffle service when it was enabled. This
seems to be working okay for me.

Thanks,
Steven


On Tue, May 12, 2020 at 4:42 AM Pradeepta Choudhury <
pradeeptachoudhu...@gmail.com> wrote:

> Hey guys i was able to run dynamic scaling in both cluster and client mode
> . would document and send it over this weekend
>
> On Tue 12 May, 2020, 1:26 PM Roland Johann, 
> wrote:
>
>> Hi all,
>>
>> don’t want to interrupt the conversation but are keen where I can find
>> information regarding dynamic allocation on kubernetes. As far as I know
>> the docs just point to future work.
>>
>> Thanks a lot,
>> Roland
>>
>>
>>
>> Am 12.05.2020 um 09:25 schrieb Steven Stetzler > >:
>>
>> Hi all,
>>
>> I am interested in this as well. My use-case could benefit from dynamic
>> executor scaling but we are restricted to using client mode since we are
>> only using Spark shells.
>>
>> Could anyone help me understand the barriers to getting dynamic executor
>> scaling to work in client mode on Kubernetes?
>>
>> Thanks,
>> Steven
>>
>> On Sat, May 9, 2020 at 9:48 AM Pradeepta Choudhury <
>> pradeeptachoudhu...@gmail.com> wrote:
>>
>>> Hiii ,
>>>
>>> The dynamic executor scalling is working fine for spark on kubernetes
>>> (latest from spark master repository ) in cluster mode . is the dynamic
>>> executor scalling available for client mode ? if yes where can i find the
>>> usage doc for same .
>>> If no is there any PR open for this ?
>>>
>>> Thanks ,
>>> Pradeepta
>>>
>>
>>


Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Steven Stetzler
Hi all,

I am interested in this as well. My use-case could benefit from dynamic
executor scaling but we are restricted to using client mode since we are
only using Spark shells.

Could anyone help me understand the barriers to getting dynamic executor
scaling to work in client mode on Kubernetes?

Thanks,
Steven

On Sat, May 9, 2020 at 9:48 AM Pradeepta Choudhury <
pradeeptachoudhu...@gmail.com> wrote:

> Hiii ,
>
> The dynamic executor scalling is working fine for spark on kubernetes
> (latest from spark master repository ) in cluster mode . is the dynamic
> executor scalling available for client mode ? if yes where can i find the
> usage doc for same .
> If no is there any PR open for this ?
>
> Thanks ,
> Pradeepta
>


Re: Fitting only the intercept for LinearRegression

2020-03-21 Thread Steven Stetzler
Hi Eugen,

You should be able to do this without the LinearRegression API. I believe
for a linear regression model (
https://en.wikipedia.org/wiki/Simple_linear_regression)
[image: image.png]
the best estimator for the intercept will be
[image: image.png]
where \overline{y} is the average of the target variable and \overline{x}
is the average of the features in your new training data. \hat{\beta} will
remain your previously fit slope. This lets you fit a new intercept
(\alpha) given the previous slope (\beta) and the averages of your new
training data, which you should be able to compute in a straightforward
manner using Spark.

I imagine there might be a way to do this with the LinearRegression API
though.

Thanks,
Steven


On Sat, Mar 21, 2020 at 12:49 AM  wrote:

> Hi,
> I was wondering if it would be possible to fit only the intercept on a
> LinearRegression instance by providing a known coefficient?
>
> Here is some background information: we have a problem where linear
> regression is well suited as a predictor. However, the model requires
> continuous adoption. During an initial training, the coefficient and the
> intercept of the linear model are determined from a given set of training
> data. Later, this model requires adoption during which the intercept has to
> be adopted to a new set of training data (the coefficient, in other words
> the slope, remains the same as obtained from the initial model).
>
> I had a look on the Java API for LinearRegression and could not find a way
> how to only fit the intercept and set initial parameters for a fit. Am I
> missing something? Is there a way how to to do this with the
> LinearRegression class in Sparks' ML package or do I have to use a
> different approach?
>
> Thanks in advance.
>
> regards
> Eugen
>
>


Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-04 Thread Steven Stetzler
To successfully read from S3 using s3a, I've had to also set
```
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
in addition to `spark.hadoop.fs.s3a.access.key` and
`spark.hadoop.fs.s3a.secret.key`. I've also needed to ensure Spark has
access to the AWS SDK jar. I have downloaded `aws-java-sdk-1.7.4.jar` (maven
)
paired with `hadoop-aws-2.7.3.jar` in `$SPARK_HOME/jars`.

These additionally configurations don't seem related to credentials and
security (and may not even be needed in my case), but perhaps it will help
you.

Thanks,
Steven

On Wed, Mar 4, 2020 at 1:11 PM Devin Boyer 
wrote:

> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of
> eventually running Spark on Kubernetes. Nearly all the data we currently
> process with Spark is stored in S3, so I need to be able to interface with
> it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason
> cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
>- Build Kubernetes docker images by downloading the
>spark-2.4.5-bin-hadoop2.7.tgz archive and building the
>kubernetes/dockerfiles/spark/Dockerfile image.
>- Run an interactive docker container using the above built image.
>- Within that container, run spark-shell. This command passes valid
>AWS credentials by setting spark.hadoop.fs.s3a.access.key and
>spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
>hadoop-aws package by specifying the --packages
>org.apache.hadoop:hadoop-aws:2.7.3 flag.
>- Try to access the simple public file as outlined in the "Integration
>with Cloud Infrastructures
>"
>documentation by running:
>sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
>- Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting
> the standard AWS_ACCESS_KEY_ID environment variable before launching
> spark-shell), and other means of building a Spark image and including the
> appropriate libraries (see this Github repo:
> https://github.com/drboyer/spark-s3a-demo), all with the same results.
> I've tried also accessing objects within our AWS account, rather than the
> object from the public landsat-pds bucket, with the same 403 error being
> thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully
> using Spark, or even explain where I could look for additional clues as to
> what's misconfigured? I've tried turning up the logging verbosity and
> didn't see much that was particularly useful, but happy to share additional
> log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer
>


Spark on k8s: Mount config map in executor

2019-08-27 Thread Steven Stetzler
Hello everyone,

I am wondering if there is a way to mount a Kubernetes ConfigMap into a
directory in a Spark executor on Kubernetes. Poking around the docs, the
only volume mounting options I can find are for a PVC, a directory on the
host machine, and an empty volume. I am trying to pass in configuration
files that alter the start up of the container for a specialized Spark
executor image, and a ConfigMap seems to be the natural Kubernetes solution
for storing and accessing these files in the cluster. I have no way for the
Spark executor to access them however.

I appreciate any help or insight the userbase can offer for this issue.

Thanks,
Steven Stetzler


Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Steven Stetzler
Hi Gautham,

I am a beginner spark user too and I may not have a complete understanding
of your question, but I thought I would start a discussion anyway. Have you
looked into using Spark's built in Correlation function? (
https://spark.apache.org/docs/latest/ml-statistics.html) This might let you
get what you want (per-row correlation against the same matrix) without
having to deal with parallelizing the computation yourself. Also, I think
the question of how quick you can get your results is largely a data access
question vs how fast is Spark question. As long as you can exploit data
parallelism (i.e. you can partition up your data), Spark will give you a
speedup. You can imagine that if you had a large machine with many cores
and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could fit your
problem in main memory and perform your computation with thread based
parallelism. This might get your result relatively quickly. For a dedicated
application with well constrained memory and compute requirements, it might
not be a bad option to do everything on one machine as well. Accessing an
external database and distributing work over a large number of computers
can add overhead that might be out of your control.

Thanks,
Steven

On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya 
wrote:

> Ping? I would really appreciate advice on this! Thank you!
>
>
>
> *From:* Gautham Acharya
> *Sent:* Tuesday, July 9, 2019 4:22 PM
> *To:* user@spark.apache.org
> *Subject:* [Beginner] Run compute on large matrices and return the result
> in seconds?
>
>
>
> This is my first email to this mailing list, so I apologize if I made any
> errors.
>
>
>
> My team's going to be building an application and I'm investigating some
> options for distributed compute systems. We want to be performing computes
> on large matrices.
>
>
>
> The requirements are as follows:
>
>
>
> 1. The matrices can be expected to be up to 50,000 columns x 3
> million rows. The values are all integers (except for the row/column
> headers).
>
> 2. The application needs to select a specific row, and calculate the
> correlation coefficient (
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
>  )
> against every other row. This means up to 3 million different calculations.
>
> 3. A sorted list of the correlation coefficients and their
> corresponding row keys need to be returned in under 5 seconds.
>
> 4. Users will eventually request random row/column subsets to run
> calculations on, so precomputing our coefficients is not an option. This
> needs to be done on request.
>
>
>
> I've been looking at many compute solutions, but I'd consider Spark first
> due to the widespread use and community. I currently have my data loaded
> into Apache Hbase for a different scenario (random access of rows/columns).
> I’ve naively tired loading a dataframe from the CSV using a Spark instance
> hosted on AWS EMR, but getting the results for even a single correlation
> takes over 20 seconds.
>
>
>
> Thank you!
>
>
>
>
>
> --gautham
>
>
>


Re: Problem running Spark on Kubernetes: Certificate error

2018-12-20 Thread Steven Stetzler
s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:64)
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> ... 13 more
> Caused by: java.security.cert.CertificateException: Could not parse
> certificate: java.io.IOException: Empty input
> at
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
> at
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
> at
> io.fabric8.kubernetes.client.internal.CertUtils.createTrustStore(CertUtils.java:93)
> at
> io.fabric8.kubernetes.client.internal.CertUtils.createTrustStore(CertUtils.java:71)
> at
> io.fabric8.kubernetes.client.internal.SSLUtils.trustManagers(SSLUtils.java:114)
> at
> io.fabric8.kubernetes.client.internal.SSLUtils.trustManagers(SSLUtils.java:93)
> at
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:63)
> ... 16 more
> Caused by: java.io.IOException: Empty input
> at
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
> ... 22 more


On the other hand, I have no problem running

from pyspark import *
> sc = SparkContext()


and then performing a computation. However I believe that this will only be
able to use the resources provided to the Kubernetes pod that the Jupyter
environment is deployed on and not the whole cluster, so it's not really
what I'm looking for. I have tried running the same Python code from
outside of the cluster (on my own machine with PySpark installed). Note
that it doesn't matter what IP I use for the master, the same error occurs.
Running from outside of the cluster does seem to allow PySpark to
communicate with the cluster. However, there appear to be issues allocating
resources / creating workers. (But that seems like an issue for another
thread.)

Any idea what can be done here? Note that if it was really passing the file
and the file was not found, it would raise an error to indicate that. (I
have tried.) The error raised indicates instead that no file was ever
passed. I would assume that
conf.set('spark.kubernetes.authenticate.driver.caCertFile',
'ca.crt') would have identical behavior to --conf
spark.kubernetes.authenticate.driver.caCertFile=ca.crt in spark-submit.
Sorry if I am misunderstanding something basic in how PySpark and
spark-submit work.

Thanks,
Steven


On Thu, Dec 13, 2018 at 7:59 PM Matt Cheah  wrote:

> Hi Steven,
>
>
>
> What I think is happening is that your machine has a CA certificate that
> is used for communicating with your API server, particularly because you’re
> using Digital Ocean’s cluster manager. However, it’s unclear if your pod
> has the same CA certificate or if the pod needs that certificate file. You
> can use configurations to have your pod use a particular CA certificate
> file to communicate with the APi server. If you set 
> spark.kubernetes.authenticate.driver.caCertFile
> to the path of your CA certificate on your local disk, spark-submit will
> create a secret that contains that certificate file and use that
> certificate to configure TLS for the driver pod’s communication with the
> API server.
>
>
>
> It's clear that your driver pod doesn’t have the right TLS certificate to
> communicate with the API server, so I would try to introspect the driver
> pod and see what certificate it’s using for that communication. If there’s
> a fix that needs to happen in Spark, feel free to indicate as such.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Steven Stetzler 
> *Date: *Thursday, December 13, 2018 at 1:49 PM
> *To: *"user@spark.apache.org" 
> *Subject: *Problem running Spark on Kubernetes: Certificate error
>
>
>
> Hello,
>
> I am following the tutorial here 
> (https://spark.apache.org/docs/latest/running-on-kubernetes.html
> [spark.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=koBmBBRy1RJd_Z00W1RO9fi0a72z2vYhFoj-O0FN3RQ=9gQXKMnjixgZA7jGBSWFYdhjnBsgzdy51B5bZJ5znQ8=>)
> to get spark running on a Kubernetes cluster. My Kubernetes cluster is
> hosted with Digital Ocean's kubernetes cluster manager. I have change the
> KUBECONFIG environment variable to point to my cluster access credentials,
> so both Spark and kubectl can speak with the nodes.
>
> I am running into an issue when trying to run the SparkPi example as
> described in the Spark on Kubernetes tutorials. The command I am running
> is:
>
> ./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name
> spark-pi --class org.apache.spark.examples.SparkPi --conf
> 

Problem running Spark on Kubernetes: Certificate error

2018-12-13 Thread Steven Stetzler
Hello,

I am following the tutorial here (
https://spark.apache.org/docs/latest/running-on-kubernetes.html) to get
spark running on a Kubernetes cluster. My Kubernetes cluster is hosted with
Digital Ocean's kubernetes cluster manager. I have change the KUBECONFIG
environment variable to point to my cluster access credentials, so both
Spark and kubectl can speak with the nodes.

I am running into an issue when trying to run the SparkPi example as
described in the Spark on Kubernetes tutorials. The command I am running
is:

./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name
spark-pi --class org.apache.spark.examples.SparkPi --conf
spark.executor.instances=1 --conf
spark.kubernetes.container.image=$IMAGEURL --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar

where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL
of the Spark docker image I am using (
https://hub.docker.com/r/stevenstetzler/spark/). This docker image was
built and pushed with the script included in the Spark 2.4 distribution. I
have created a service account for Spark to ensure that it has proper
permissions to create pods etc., which I checked using

kubectl auth can-i create pods --as=system:serviceaccount:default:spark

When I try to run the SparkPi example using the above command, I get the
following output:

2018-12-12 06:26:15 WARN  Utils:66 - Your hostname, docker-test resolves to
a loopback address: 127.0.1.1; using 10.46.0.6 instead (on interface eth0)
2018-12-12 06:26:15 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind
to another address
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: 2018-12-12T06:26:18Z
 container images: docker.io/stevenstetzler/spark:v1
 phase: Pending
 status: [ContainerStatus(containerID=null, image=
docker.io/stevenstetzler/spark:v1, imageID=,
lastState=ContainerState(running=null, terminated=null, waiting=null,
additionalProperties={}), name=spark-kubernetes-driver, ready=false,
restartCount=0, state=ContainerState(running=null, terminated=null,
waiting=ContainerStateWaiting(message=null, reason=ContainerCreating,
additionalProperties={}), additionalProperties={}),
additionalProperties={})]
2018-12-12 06:26:19 INFO  Client:54 - Waiting for application spark-pi to
finish...
2018-12-12 06:26:21 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: 2018-12-12T06:26:18Z
 container images: stevenstetzler/spark:v1
 phase: Running
 status:
[ContainerStatus(containerID=docker://b923c6ff02b93557c8c104c01a4eeb1c05f3d0c0123ec4e5895bfd6be398a03a,
image=stevenstetzler/spark:v1,
imageID=docker-pullable://stevenstetzler/spark@sha256:dc4bce1e410ebd7b14a88caa46a4282a61ff058c6374b7cf721b7498829bb041,