Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Stavros Kontopoulos
Hi Dongjoon,

Should we also consider fixing
https://issues.apache.org/jira/browse/SPARK-27812 before the cut?

Best,
Stavros

On Mon, Jul 15, 2019 at 7:04 PM Dongjoon Hyun 
wrote:

> Hi, Apache Spark PMC members.
>
> Can we cut Apache Spark 2.4.4 next Monday (22nd July)?
>
> Bests,
> Dongjoon.
>
>
> On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Jacek.
>>
>> BTW, I added `@private` since we need PMC's help to make an Apache Spark
>> release.
>>
>> Can I get more feedbacks from the other PMC members?
>>
>> Please me know if you have any concerns (e.g. Release date or Release
>> manager?)
>>
>> As one of the community members, I assumed the followings (if we are on
>> schedule).
>>
>> - 2.4.4 at the end of July
>> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
>> February 2018)
>> - 3.0.0 (possibily September?)
>> - 3.1.0 (January 2020?)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Thanks Dongjoon Hyun for stepping up as a release manager!
>>> Much appreciated.
>>>
>>> If there's a volunteer to cut a release, I'm always to support it.
>>>
>>> In addition, the more frequent releases the better for end users so they
>>> have a choice to upgrade and have all the latest fixes or wait. It's their
>>> call not ours (when we'd keep them waiting).
>>>
>>> My big 2 yes'es for the release!
>>>
>>> Jacek
>>>
>>>
>>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, All.

 Spark 2.4.3 was released two months ago (8th May).

 As of today (9th July), there exist 45 fixes in `branch-2.4` including
 the following correctness or blocker issues.

 - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
 decimals not fitting in long
 - SPARK-26045 Error in the spark 2.4 release package with the
 spark-avro_2.11 dependency
 - SPARK-27798 from_avro can modify variables in other rows in local
 mode
 - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
 - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
 entries
 - SPARK-28308 CalendarInterval sub-second part should be padded
 before parsing

 It would be great if we can have Spark 2.4.4 before we are going to get
 busier for 3.0.0.
 If it's okay, I'd like to volunteer for an 2.4.4 release manager to
 roll it next Monday. (15th July).
 How do you think about this?

 Bests,
 Dongjoon.

>>>


Re: Spark on K8S - --packages not working for cluster mode?

2019-06-06 Thread Stavros Kontopoulos
Hi,

This has been fixed here: https://github.com/apache/spark/pull/23546. Will
be available with Spark 3.0.0

Best,
Stavros

On Wed, Jun 5, 2019 at 11:18 PM pacuna  wrote:

> I'm trying to run a sample code that reads a file from s3 so I need the aws
> sdk and aws hadoop dependencies.
> If I assemble these deps into the main jar everything works fine. But when
> I
> try using --packages, the deps are not seen by the pods.
>
> This is my submit command:
>
> spark-submit
> --master k8s://https://xx.xx.xx.xx
> --class "SimpleApp"
> --deploy-mode cluster
> --conf spark.kubernetes.container.image=docker.io/pacuna/spark:0.2
> --conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-test-user
> --packages
> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
> --conf spark.hadoop.fs.s3a.access.key=...
> --conf spark.hadoop.fs.s3a.secret.key=...
> https://x/simple-project_2.11-1.0.jar
>
> And the error I'm getting in the driver pod is:
>
> 19/06/05 20:13:50 ERROR SparkContext: Failed to add
>
> file:///home/dev/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar
> to Spark environment
> java.io.FileNotFoundException: Jar
> /home/dev/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar not
> found
>
> I'm getting that error for all the deps jars needed.
>
> Any ideas?
>
> Thanks.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: K8s-Spark client mode : Executor image not able to download application jar from driver

2019-04-28 Thread Stavros Kontopoulos
Yes here is why the initial effort didnt work, explained a bit better. As I
mentioned earlier SparkContext will add your jars/files (declared with the
related conf properties) to the FileServer. If it is a local to the
container's fs jar (has schema local:)  it will just be resolved to: file +
absolute path (
https://github.com/apache/spark/blob/20a3ef7259490e0c9f6348f13db1e99da5f0df83/core/src/main/scala/org/apache/spark/SparkContext.scala#L1805-L1814).
 Check:

// A JAR file which exists only on the driver node
case null =>
  // SPARK-22585 path without schema is not url encoded
  addJarFile(new File(uri.getRawPath))
// A JAR file which exists only on the driver node
case "file" => addJarFile(new File(uri.getPath))
// A JAR file which exists locally on every worker node
case "local" => "file:" + uri.getPath
case _ => path

That means the task that will run at the executor side, when task
description will be de-serialized, will have a uri to resolve that starts
with file://.
That means updateDependencies at the executor side will call doFetchFile
and will choose this path:
case "file" =>
  // In the case of a local file, copy the local file to the target
directory.
  // Note the difference between uri vs url.
  val sourceFile = if (uri.isAbsolute) new File(uri) else new File(url)
  copyFile(url, sourceFile, targetFile, fileOverwrite)

But since the file does not exist on your executor image it will fail.
The solution you have works because you initially pass a file://path uri
that the SparkContext will add from the driver's fs to the SparkContext
fileserver, meaning it will be available at the executors side from a new
uri with schema spark:// (there is a call to addJarFile while for local://
there is not). Hope this helps.

Stavros

On Sun, Apr 28, 2019 at 7:29 AM Nikhil Chinnapa 
wrote:

> Hi Stavros,
>
> Thanks a lot for pointing in right direction. I got stuck in some release,
> so didn’t got time earlier.
>
> The mistake was “LINUX_APP_RESOURCE” : I was using “local” instead it
> should
> be “file”. I reached above due to your email only.
>
> What I understood:
> Driver image :  $SPARK_HOME/bin and $SPARK_HOME/jars and application jar.
> Executor Image : just $SPARK_HOME/bin and $SPARK_HOME/jars folder will
> suffice.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: K8s-Spark client mode : Executor image not able to download application jar from driver

2019-04-19 Thread Stavros Kontopoulos
Hi Nikhil,

Application jar by default is added

to spark.jars so it is fetched by executors when tasks are launched (behind
the scenes SparkContext will add the these files to the driver's file
server and the TaskSetManager will add them to the tasks and so when tasks
are deserialized at the executor side they will have these files uris but
with a different scheme, namely spark:// which is the file server's scheme.
Then executors will get them with the updateDeps call).
You should see something like this:

19/04/19 23:15:05 INFO Executor: Fetching
spark://spark-pi-2775276a37e147f8-driver-svc.spark.svc:7078/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar
with timestamp 1555715697141
19/04/19 23:15:05 INFO TransportClientFactory: Successfully created
connection to spark-pi-2775276a37e147f8-driver-svc.spark.svc/172.17.0.4:7078
after 4 ms (0 ms spent in bootstraps)
19/04/19 23:15:05 INFO Utils: Fetching
spark://spark-pi-2775276a37e147f8-driver-svc.spark.svc:7078/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar
to
/var/data/spark-7bb5652a-7289-43b2-8e0a-2b4687eddb51/spark-4fccc535-47e8-49c1-818f-a44eb268f09e/fetchFileTemp951703538232079938.tmp
19/04/19 23:15:05 INFO Utils: Copying
/var/data/spark-7bb5652a-7289-43b2-8e0a-2b4687eddb51/spark-4fccc535-47e8-49c1-818f-a44eb268f09e/11085522591555715697141_cache
to /opt/spark/work-dir/./spark-examples_2.12-3.0.0-SNAPSHOT.jar

Could you add --verbose so we see what args spark-submit gets? What is the
value of LINUX_APP_RESOURCE?
One trick would be to set spark.jars (or add it manually with sc.add) but
still would like to know what is happening, if args.primaryResource is set
etc..

The reason you see that output is because for some reason the file uri is
passed as is at the executors and executors will try to fetch the file from
the local fs where it does not exist.

Best,
Stavros

On Tue, Apr 16, 2019 at 11:29 AM Nikhil Chinnapa <
nishant.ran...@renovite.com> wrote:

> Environment:
> Spark: 2.4.0
> Kubernetes:1.14
>
> Query: Does application jar needs to be part of both Driver and Executor
> image?
>
> Invocation point (from Java code):
> sparkLaunch = new SparkLauncher()
>
> .setMaster(LINUX_MASTER)
> .setAppResource(LINUX_APP_RESOURCE)
> .setConf("spark.app.name",APP_NAME)
> .setMainClass(MAIN_CLASS)
>
> .setConf("spark.executor.instances",EXECUTOR_COUNT)
>
> .setConf("spark.kubernetes.container.image",CONTAINER_IMAGE)
> .setConf("spark.kubernetes.driver.pod.name
> ",DRIVER_POD_NAME)
>
> .setConf("spark.kubernetes.container.image.pullSecrets",REGISTRY_SECRET)
>
>
> .setConf("spark.kubernetes.authenticate.driver.serviceAccountName",SERVICE_ACCOUNT_NAME)
> .setConf("spark.driver.host", SERVICE_NAME
> + "." + NAMESPACE +
> ".svc.cluster.local")
> .setConf("spark.driver.port",
> DRIVER_PORT)
> .setDeployMode("client")
> ;
>
> Scenario:
> I am trying to run Spark on K8s in client mode. When I put application jar
> image both in driver and executor then program work fines.
>
> But, if I put application jar in driver image only then I get following
> error:
>
> 2019-04-16 06:36:44 INFO  Executor:54 - Fetching
> file:/opt/spark/examples/jars/reno-spark-codebase-0.1.0.jar with timestamp
> 1555396592768
> 2019-04-16 06:36:44 INFO  Utils:54 - Copying
> /opt/spark/examples/jars/reno-spark-codebase-0.1.0.jar to
>
> /var/data/spark-d24c8fbc-4fe7-4968-9310-f891a097d1e7/spark-31ba5cbb-3132-408c-991a-795
> 2019-04-16 06:36:44 ERROR Executor:91 - Exception in task 0.1 in stage 0.0
> (TID 2)
> java.nio.file.NoSuchFileException:
> /opt/spark/examples/jars/reno-spark-codebase-0.1.0.jar
> at
>
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
>
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
>
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at java.base/sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:548)
> at
>
> java.base/sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:254)
> at java.base/java.nio.file.Files.copy(Files.java:1294)
> at
>
> org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664)
> at org.apache.spark.util.Utils$.copyFile(Utils.scala:635)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> at
>
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:805

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Stavros Kontopoulos
Awesome!

On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>>


 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 


 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
 adds Barrier Execution Mode for better integration with deep learning
 frameworks, introduces 30+ built-in and higher-order functions to deal with
 complex data type easier, improves the K8s integration, along with
 experimental Scala 2.12 support. Other major updates include the built-in
 Avro data source, Image data source, flexible streaming sinks, elimination
 of the 2GB block size limitation during transfer, Pandas UDF improvements.
 In addition, this release continues to focus on usability, stability, and
 polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their contributions
 and early feedback to this release. This release would not have been
 possible without you.

 To download Spark 2.4.0, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes: https://spark.apache.org/
 releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or published
 artifacts, please contact me directly off-list.

>>>


custom sink & model transformation

2018-09-09 Thread Stavros Kontopoulos
Hi,

Is it unsfate to do model prediction within a custom sink eg.
model.transform(df)?
I see that the only transformation done is adding a prediction column AFAIK.

Stavros