Re: Issues with Apache Spark tgz file

2019-12-30 Thread Marcelo Vanzin
That first URL is not the file. It's a web page with links to the file
in different mirrors. I just looked at the actual file in one of the
mirrors and it looks fine.

On Mon, Dec 30, 2019 at 1:34 PM rsinghania  wrote:
>
> Hi,
>
> I'm trying to open the file
> https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> downloaded from https://spark.apache.org/downloads.html using wget, and
> getting the following messages:
>
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
>
> It looks like there's something wrong with the original tgz file; its size
> is only 32 KB.
>
> Could one of the developers please have a look?
>
> Thanks very much,
> Rajat
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin
BTW the SparkLauncher API has hooks to capture the stderr of the
spark-submit process into the logging system of the parent process.
Check the API javadocs since it's been forever since I looked at that.

On Wed, Apr 24, 2019 at 1:58 PM Marcelo Vanzin  wrote:
>
> Setting the SPARK_PRINT_LAUNCH_COMMAND env variable to 1 in the
> launcher env will make Spark code print the command to stderr. Not
> optimal but I think it's the only current option.
>
> On Wed, Apr 24, 2019 at 1:55 PM Jeff Evans
>  wrote:
> >
> > The org.apache.spark.launcher.SparkLauncher is used to construct a
> > spark-submit invocation programmatically, via a builder pattern.  In
> > our application, which uses a SparkLauncher internally, I would like
> > to log the full spark-submit command that it will invoke to our log
> > file, in order to aid in debugging/support.  However, I can't figure
> > out a way to do this.  This snippet would work, except for the fact
> > that the createBuilder method is private.
> >
> > sparkLauncher.createBuilder().command()
> >
> > Is there an alternate way of doing this?  The Spark version is
> > 2.11:2.4.0.  Thanks.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin
Setting the SPARK_PRINT_LAUNCH_COMMAND env variable to 1 in the
launcher env will make Spark code print the command to stderr. Not
optimal but I think it's the only current option.

On Wed, Apr 24, 2019 at 1:55 PM Jeff Evans
 wrote:
>
> The org.apache.spark.launcher.SparkLauncher is used to construct a
> spark-submit invocation programmatically, via a builder pattern.  In
> our application, which uses a SparkLauncher internally, I would like
> to log the full spark-submit command that it will invoke to our log
> file, in order to aid in debugging/support.  However, I can't figure
> out a way to do this.  This snippet would work, except for the fact
> that the createBuilder method is private.
>
> sparkLauncher.createBuilder().command()
>
> Is there an alternate way of doing this?  The Spark version is
> 2.11:2.4.0.  Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.submit.deployMode: cluster

2019-03-26 Thread Marcelo Vanzin
If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API. It 
> DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running application 
> “name” goes back to my server, the machine that launched the job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set the 
> Driver to run on the cluster but it runs on the client, ignoring the 
> spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RPC timeout error for AES based encryption between driver and executor

2019-03-26 Thread Marcelo Vanzin
I don't think "spark.authenticate" works properly with k8s in 2.4
(which would make it impossible to enable encryption since it requires
authentication). I'm pretty sure I fixed it in master, though.

On Tue, Mar 26, 2019 at 2:29 AM Sinha, Breeta (Nokia - IN/Bangalore)
 wrote:
>
> Hi All,
>
>
>
> We are trying to enable RPC encryption between driver and executor. Currently 
> we're working on Spark 2.4 on Kubernetes.
>
>
>
> According to Apache Spark Security document 
> (https://spark.apache.org/docs/latest/security.html) and our understanding on 
> the same, it is clear that Spark supports AES-based encryption for RPC 
> connections. There is also support for SASL-based encryption, although it 
> should be considered deprecated.
>
>
>
> spark.network.crypto.enabled true , will enable AES-based RPC encryption.
>
> However, when we enable AES based encryption between driver and executor, we 
> could observe a very sporadic behaviour in communication between driver and 
> executor in the logs.
>
>
>
> Follwing are the options and their default values, we used for enabling 
> encryption:-
>
>
>
> spark.authenticate true
>
> spark.authenticate.secret 
>
> spark.network.crypto.enabled true
>
> spark.network.crypto.keyLength 256
>
> spark.network.crypto.saslFallback false
>
>
>
> A snippet of the executor log is provided below:-
>
> Exception in thread "main" 19/02/26 07:27:08 ERROR RpcOutboxMessage: Ask 
> timeout before connecting successfully
>
> Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply 
> from sts-spark-thrift-server-1551165767426-driver-svc.default.svc:7078 in 120 
> seconds
>
>
>
> But, there is no error message or any message from executor seen in the 
> driver log for the same timestamp.
>
>
>
> We also tried increasing spark.network.timeout, but no luck.
>
>
>
> This issue is seen sporadically, as the following observations were noted:-
>
> 1) Sometimes, enabling AES encryption works completely fine.
>
> 2) Sometimes, enabling AES encryption works fine for around 10 consecutive 
> spark-submits but next trigger of spark-submit would go into hang state with 
> the above mentioned error in the executor log.
>
> 3) Also, there are times, when enabling AES encryption would not work at all, 
> as it would keep on spawnning more than 50 executors where the executors fail 
> with the above mentioned error.
>
> Even, setting spark.network.crypto.saslFallback to true didn't help.
>
>
>
> Things are working fine when we enable SASL encryption, that is, only setting 
> the following parameters:-
>
> spark.authenticate true
>
> spark.authenticate.secret 
>
>
>
> I have attached the log file containing detailed error message. Please let us 
> know if any configuration is missing or if any one has faced the same issue.
>
>
>
> Any leads would be highly appreciated!!
>
>
>
> Kind Regards,
>
> Breeta Sinha
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Multiple context in one Driver

2019-03-14 Thread Marcelo Vanzin
It doesn't work (except if you're extremely lucky), it will eat your
lunch and will also kick your dog.

And it's not even going to be an option in the next version of Spark.

On Wed, Mar 13, 2019 at 11:38 PM Ido Friedman  wrote:
>
> Hi,
>
> I am researching the use of multiple sparkcontext in one Driver - 
> spark.driver.allowMultipleContexts
>
> I found various opinions and notes on the subject, most were against it and 
> some said it is "work in progress"
>
> What is "official" approach on this subject? Can we use this as a significant 
> part of our spark implementation?
>
> 10x
>
> Ido Friedman
>
> --



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to force-quit a Spark application?

2019-01-24 Thread Marcelo Vanzin
Hi,

On Tue, Jan 22, 2019 at 11:30 AM Pola Yao  wrote:
> "Thread-1" #19 prio=5 os_prio=0 tid=0x7f9b6828e800 nid=0x77cb waiting on 
> condition [0x7f9a123e3000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0005408a5420> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:131)

This looks a little weird. Are you sure this thread is not making any
progress (i.e. did you take multiple stack snapshots)? I wouldn't
expect that call to block.

At first I was suspicious of SPARK-24309 but that looks different from
what you're seeing.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to force-quit a Spark application?

2019-01-16 Thread Marcelo Vanzin
ct monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1249)
> - locked <0x00064056f6a0> (a org.apache.hadoop.util.ShutdownHookManager$1)
> at java.lang.Thread.join(Thread.java:1323)
> at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
> at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
> at java.lang.Shutdown.runHooks(Shutdown.java:123)
> at java.lang.Shutdown.sequence(Shutdown.java:167)
> at java.lang.Shutdown.exit(Shutdown.java:212)
> - locked <0x0006404e65b8> (a java.lang.Class for java.lang.Shutdown)
> at java.lang.Runtime.exit(Runtime.java:109)
> at java.lang.System.exit(System.java:971)
> at scala.sys.package$.exit(package.scala:40)
> at scala.sys.package$.exit(package.scala:33)
> at 
> actionmodel.ParallelAdvertiserBeaconModel$.main(ParallelAdvertiserBeaconModel.scala:252)
> at 
> actionmodel.ParallelAdvertiserBeaconModel.main(ParallelAdvertiserBeaconModel.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> "VM Thread" os_prio=0 tid=0x7f56cc0c1800 nid=0x1c91 runnable
> ...
> '''
>
> I have no clear idea what went wrong. I did call  awaitTermination to 
> terminate the thread pool. Or is there any way to force close all those 
> 'WAITING' threads associated with my spark application?
>
> On Wed, Jan 16, 2019 at 8:31 AM Marcelo Vanzin  wrote:
>>
>> If System.exit() doesn't work, you may have a bigger problem
>> somewhere. Check your threads (using e.g. jstack) to see what's going
>> on.
>>
>> On Wed, Jan 16, 2019 at 8:09 AM Pola Yao  wrote:
>> >
>> > Hi Marcelo,
>> >
>> > Thanks for your reply! It made sense to me. However, I've tried many ways 
>> > to exit the spark (e.g., System.exit()), but failed. Is there an explicit 
>> > way to shutdown all the alive threads in the spark application and then 
>> > quit afterwards?
>> >
>> >
>> > On Tue, Jan 15, 2019 at 2:38 PM Marcelo Vanzin  wrote:
>> >>
>> >> You should check the active threads in your app. Since your pool uses
>> >> non-daemon threads, that will prevent the app from exiting.
>> >>
>> >> spark.stop() should have stopped the Spark jobs in other threads, at
>> >> least. But if something is blocking one of those threads, or if
>> >> something is creating a non-daemon thread that stays alive somewhere,
>> >> you'll see that.
>> >>
>> >> Or you can force quit with sys.exit.
>> >>
>> >> On Tue, Jan 15, 2019 at 1:30 PM Pola Yao  wrote:
>> >> >
>> >> > I submitted a Spark job through ./spark-submit command, the code was 
>> >> > executed successfully, however, the application got stuck when trying 
>> >> > to quit spark.
>> >> >
>> >> > My code snippet:
>> >> > '''
>> >> > {
>> >> >
>> >> > val spark = SparkSession.builder.master(...).getOrCreate
>> >> >
>> >> > val pool = Executors.newFixedThreadPool(3)
>> >> > implicit val xc = ExecutionContext.fromExecutorService(pool)
>> >> > val taskList = List(train1, train2, train3)  // where train* is a 
>> >> > Future function which wrapped up some data reading and feature 
>> >> > engineering and machine learning steps
>> >> > val results = Await.result(Future.sequence(taskList), 20 minutes)
>> >> >
>> >> > println("Shutting down pool and executor service")
>> >> > pool.shutdown()
>> >> > xc.shutdown()
>> >> >
>> >> > println("Exiting spark")
>> >> > spark.stop()
>> >> >
>> >> > }
>> >> > '''
>> >> >
>> >> > After I submitted the job, from terminal, I could see the code was 
>> >> > executed and printing "Exiting spark", however, after printing that 
>> >> > line, it never existed spark, just got stuck.
>> >> >
>> >> > Does any body know what the reason is? Or how to force quitting?
>> >> >
>> >> > Thanks!
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to force-quit a Spark application?

2019-01-16 Thread Marcelo Vanzin
If System.exit() doesn't work, you may have a bigger problem
somewhere. Check your threads (using e.g. jstack) to see what's going
on.

On Wed, Jan 16, 2019 at 8:09 AM Pola Yao  wrote:
>
> Hi Marcelo,
>
> Thanks for your reply! It made sense to me. However, I've tried many ways to 
> exit the spark (e.g., System.exit()), but failed. Is there an explicit way to 
> shutdown all the alive threads in the spark application and then quit 
> afterwards?
>
>
> On Tue, Jan 15, 2019 at 2:38 PM Marcelo Vanzin  wrote:
>>
>> You should check the active threads in your app. Since your pool uses
>> non-daemon threads, that will prevent the app from exiting.
>>
>> spark.stop() should have stopped the Spark jobs in other threads, at
>> least. But if something is blocking one of those threads, or if
>> something is creating a non-daemon thread that stays alive somewhere,
>> you'll see that.
>>
>> Or you can force quit with sys.exit.
>>
>> On Tue, Jan 15, 2019 at 1:30 PM Pola Yao  wrote:
>> >
>> > I submitted a Spark job through ./spark-submit command, the code was 
>> > executed successfully, however, the application got stuck when trying to 
>> > quit spark.
>> >
>> > My code snippet:
>> > '''
>> > {
>> >
>> > val spark = SparkSession.builder.master(...).getOrCreate
>> >
>> > val pool = Executors.newFixedThreadPool(3)
>> > implicit val xc = ExecutionContext.fromExecutorService(pool)
>> > val taskList = List(train1, train2, train3)  // where train* is a Future 
>> > function which wrapped up some data reading and feature engineering and 
>> > machine learning steps
>> > val results = Await.result(Future.sequence(taskList), 20 minutes)
>> >
>> > println("Shutting down pool and executor service")
>> > pool.shutdown()
>> > xc.shutdown()
>> >
>> > println("Exiting spark")
>> > spark.stop()
>> >
>> > }
>> > '''
>> >
>> > After I submitted the job, from terminal, I could see the code was 
>> > executed and printing "Exiting spark", however, after printing that line, 
>> > it never existed spark, just got stuck.
>> >
>> > Does any body know what the reason is? Or how to force quitting?
>> >
>> > Thanks!
>> >
>> >
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to force-quit a Spark application?

2019-01-15 Thread Marcelo Vanzin
You should check the active threads in your app. Since your pool uses
non-daemon threads, that will prevent the app from exiting.

spark.stop() should have stopped the Spark jobs in other threads, at
least. But if something is blocking one of those threads, or if
something is creating a non-daemon thread that stays alive somewhere,
you'll see that.

Or you can force quit with sys.exit.

On Tue, Jan 15, 2019 at 1:30 PM Pola Yao  wrote:
>
> I submitted a Spark job through ./spark-submit command, the code was executed 
> successfully, however, the application got stuck when trying to quit spark.
>
> My code snippet:
> '''
> {
>
> val spark = SparkSession.builder.master(...).getOrCreate
>
> val pool = Executors.newFixedThreadPool(3)
> implicit val xc = ExecutionContext.fromExecutorService(pool)
> val taskList = List(train1, train2, train3)  // where train* is a Future 
> function which wrapped up some data reading and feature engineering and 
> machine learning steps
> val results = Await.result(Future.sequence(taskList), 20 minutes)
>
> println("Shutting down pool and executor service")
> pool.shutdown()
> xc.shutdown()
>
> println("Exiting spark")
> spark.stop()
>
> }
> '''
>
> After I submitted the job, from terminal, I could see the code was executed 
> and printing "Exiting spark", however, after printing that line, it never 
> existed spark, just got stuck.
>
> Does any body know what the reason is? Or how to force quitting?
>
> Thanks!
>
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to reissue a delegated token after max lifetime passes for a spark streaming application on a Kerberized cluster

2019-01-03 Thread Marcelo Vanzin
Ah, man, there are a few known issues with KMS delegation tokens. The main
one we've run into is HADOOP-14445, but it's only fixed in new versions of
Hadoop. I wouldn't expect you guys to be running those, but if you are, it
would be good to know.

In our forks we added a hack to work around that issue, maybe you can try
it out:
https://github.com/cloudera/spark/commit/108c1312d3a2b52090cb2713e7f8d68b9a0be8b1#diff-585a75e78c688c892d640281cfc56fed


On Thu, Jan 3, 2019 at 10:12 AM Paolo Platter 
wrote:

> Hi,
>
>
>
> The spark default behaviour is to request a brand new token every 24
> hours, it is not going to renew delegation tokens, and it is the better
> approach for long running applications like streaming ones.
>
>
>
> In our use case using keytab and principal is working fine with
> hdfs_delegation_token but is NOT working with “kms-dt”.
>
>
>
> Anyone knows why this is happening ? Any suggestion to make it working
> with KMS ?
>
>
>
> Thanks
>
>
>
>
>
>
>
> [image: cid:image001.jpg@01D41D15.E01B6F00]
>
> *Paolo Platter*
>
> *CTO*
>
> E-mail:paolo.plat...@agilelab.it
>
> Web Site:   www.agilelab.it
>
>
>
>
> --
> *Da:* Marcelo Vanzin 
> *Inviato:* Thursday, January 3, 2019 7:03:22 PM
> *A:* alinazem...@gmail.com
> *Cc:* user
> *Oggetto:* Re: How to reissue a delegated token after max lifetime passes
> for a spark streaming application on a Kerberized cluster
>
> If you are using the principal / keytab params, Spark should create
> tokens as needed. If it's not, something else is going wrong, and only
> looking at full logs for the app would help.
> On Wed, Jan 2, 2019 at 5:09 PM Ali Nazemian  wrote:
> >
> > Hi,
> >
> > We are using a headless keytab to run our long-running spark streaming
> application. The token is renewed automatically every 1 day until it hits
> the max life limit. The problem is token is expired after max life (7 days)
> and we need to restart the job. Is there any way we can re-issue the token
> and pass it to a job that is already running? It doesn't feel right at all
> to restart the job every 7 days only due to the token issue.
> >
> > P.S: We use  "--keytab /path/to/the/headless-keytab", "--principal
> principalNameAsPerTheKeytab" and "--conf
> spark.hadoop.fs.hdfs.impl.disable.cache=true" as the arguments for
> spark-submit command.
> >
> > Thanks,
> > Ali
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Marcelo


Re: How to reissue a delegated token after max lifetime passes for a spark streaming application on a Kerberized cluster

2019-01-03 Thread Marcelo Vanzin
If you are using the principal / keytab params, Spark should create
tokens as needed. If it's not, something else is going wrong, and only
looking at full logs for the app would help.
On Wed, Jan 2, 2019 at 5:09 PM Ali Nazemian  wrote:
>
> Hi,
>
> We are using a headless keytab to run our long-running spark streaming 
> application. The token is renewed automatically every 1 day until it hits the 
> max life limit. The problem is token is expired after max life (7 days) and 
> we need to restart the job. Is there any way we can re-issue the token and 
> pass it to a job that is already running? It doesn't feel right at all to 
> restart the job every 7 days only due to the token issue.
>
> P.S: We use  "--keytab /path/to/the/headless-keytab", "--principal 
> principalNameAsPerTheKeytab" and "--conf 
> spark.hadoop.fs.hdfs.impl.disable.cache=true" as the arguments for 
> spark-submit command.
>
> Thanks,
> Ali



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Metric Sink on Executor Always ClassNotFound

2018-12-20 Thread Marcelo Vanzin
First, it's really weird to use "org.apache.spark" for a class that is
not in Spark.

For executors, the jar file of the sink needs to be in the system
classpath; the application jar is not in the system classpath, so that
does not work. There are different ways for you to get it there, most
of them manual (YARN is, I think, the only RM supported in Spark where
the application itself can do it).

On Thu, Dec 20, 2018 at 1:48 PM prosp4300  wrote:
>
> Hi, Spark Users
>
> I'm play with spark metric monitoring, and want to add a custom sink which is 
> HttpSink that send the metric through Restful API
> A subclass of Sink "org.apache.spark.metrics.sink.HttpSink" is created and 
> packaged within application jar
>
> It works for driver instance, but once enabled for executor instance, 
> following ClassNotFoundException will be throw out. This seems due to 
> MetricSystem is started very early for executor before application jar is 
> loaded.
>
> I wonder is there any way or best practice to add custom sink for executor 
> instance?
>
> 18/12/21 04:58:32 ERROR MetricsSystem: Sink class 
> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
> 18/12/21 04:58:32 WARN UserGroupInformation: PriviledgedActionException 
> as:yarn (auth:SIMPLE) cause:java.lang.ClassNotFoundException: 
> org.apache.spark.metrics.sink.HttpSink
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1933)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.metrics.sink.HttpSink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
> at 
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at 
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
> ... 4 more
> stdout0,*container_e81_1541584460930_3814_01_05�
> spark.log36118/12/21 04:58:00 ERROR 
> org.apache.spark.metrics.MetricsSystem.logError:70 - Sink class 
> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
>
>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Marcelo Vanzin
+user@

>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds 
>> Barrier Execution Mode for better integration with deep learning frameworks, 
>> introduces 30+ built-in and higher-order functions to deal with complex data 
>> type easier, improves the K8s integration, along with experimental Scala 
>> 2.12 support. Other major updates include the built-in Avro data source, 
>> Image data source, flexible streaming sinks, elimination of the 2GB block 
>> size limitation during transfer, Pandas UDF improvements. In addition, this 
>> release continues to focus on usability, stability, and polish while 
>> resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and 
>> early feedback to this release. This release would not have been possible 
>> without you.
>>
>> To download Spark 2.4.0, head over to the download page: 
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: 
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or published 
>> artifacts, please contact me directly off-list.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-25 Thread Marcelo Vanzin
Ah that makes more sense. Could you file a bug with that information
so we don't lose track of this?

Thanks
On Wed, Oct 24, 2018 at 6:13 PM Patrick Brown
 wrote:
>
> On my production application I am running ~200 jobs at once, but continue to 
> submit jobs in this manner for sometimes ~1 hour.
>
> The reproduction code above generally only has 4 ish jobs running at once, 
> and as you can see runs through 50k jobs in this manner.
>
> I guess I should clarify my above statement, the issue seems to appear when 
> running multiple jobs at once as well as in sequence for a while and may as 
> well have something to do with high master CPU usage (thus the collect in the 
> code). My rough guess would be whatever is managing clearing out completed 
> jobs gets overwhelmed (my master was a 4 core machine while running this, and 
> htop reported almost full CPU usage across all 4 cores).
>
> The attached screenshot shows the state of the webui after running the repro 
> code, you can see the ui is displaying some 43k completed jobs (takes a long 
> time to load) after a few minutes of inactivity this will clear out, however 
> as my production application continues to submit jobs every once in a while, 
> the issue persists.
>
> On Wed, Oct 24, 2018 at 5:05 PM Marcelo Vanzin  wrote:
>>
>> When you say many jobs at once, what ballpark are you talking about?
>>
>> The code in 2.3+ does try to keep data about all running jobs and
>> stages regardless of the limit. If you're running into issues because
>> of that we may have to look again at whether that's the right thing to
>> do.
>> On Tue, Oct 23, 2018 at 10:02 AM Patrick Brown
>>  wrote:
>> >
>> > I believe I may be able to reproduce this now, it seems like it may be 
>> > something to do with many jobs at once:
>> >
>> > Spark 2.3.1
>> >
>> > > spark-shell --conf spark.ui.retainedJobs=1
>> >
>> > scala> import scala.concurrent._
>> > scala> import scala.concurrent.ExecutionContext.Implicits.global
>> > scala> for (i <- 0 until 5) { Future { println(sc.parallelize(0 until 
>> > i).collect.length) } }
>> >
>> > On Mon, Oct 22, 2018 at 11:25 AM Marcelo Vanzin  
>> > wrote:
>> >>
>> >> Just tried on 2.3.2 and worked fine for me. UI had a single job and a
>> >> single stage (+ the tasks related to that single stage), same thing in
>> >> memory (checked with jvisualvm).
>> >>
>> >> On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin  
>> >> wrote:
>> >> >
>> >> > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown
>> >> >  wrote:
>> >> > > I recently upgraded to spark 2.3.1 I have had these same settings in 
>> >> > > my spark submit script, which worked on 2.0.2, and according to the 
>> >> > > documentation appear to not have changed:
>> >> > >
>> >> > > spark.ui.retainedTasks=1
>> >> > > spark.ui.retainedStages=1
>> >> > > spark.ui.retainedJobs=1
>> >> >
>> >> > I tried that locally on the current master and it seems to be working.
>> >> > I don't have 2.3 easily in front of me right now, but will take a look
>> >> > Monday.
>> >> >
>> >> > --
>> >> > Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-24 Thread Marcelo Vanzin
When you say many jobs at once, what ballpark are you talking about?

The code in 2.3+ does try to keep data about all running jobs and
stages regardless of the limit. If you're running into issues because
of that we may have to look again at whether that's the right thing to
do.
On Tue, Oct 23, 2018 at 10:02 AM Patrick Brown
 wrote:
>
> I believe I may be able to reproduce this now, it seems like it may be 
> something to do with many jobs at once:
>
> Spark 2.3.1
>
> > spark-shell --conf spark.ui.retainedJobs=1
>
> scala> import scala.concurrent._
> scala> import scala.concurrent.ExecutionContext.Implicits.global
> scala> for (i <- 0 until 5) { Future { println(sc.parallelize(0 until 
> i).collect.length) } }
>
> On Mon, Oct 22, 2018 at 11:25 AM Marcelo Vanzin  wrote:
>>
>> Just tried on 2.3.2 and worked fine for me. UI had a single job and a
>> single stage (+ the tasks related to that single stage), same thing in
>> memory (checked with jvisualvm).
>>
>> On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin  wrote:
>> >
>> > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown
>> >  wrote:
>> > > I recently upgraded to spark 2.3.1 I have had these same settings in my 
>> > > spark submit script, which worked on 2.0.2, and according to the 
>> > > documentation appear to not have changed:
>> > >
>> > > spark.ui.retainedTasks=1
>> > > spark.ui.retainedStages=1
>> > > spark.ui.retainedJobs=1
>> >
>> > I tried that locally on the current master and it seems to be working.
>> > I don't have 2.3 easily in front of me right now, but will take a look
>> > Monday.
>> >
>> > --
>> > Marcelo
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-22 Thread Marcelo Vanzin
Just tried on 2.3.2 and worked fine for me. UI had a single job and a
single stage (+ the tasks related to that single stage), same thing in
memory (checked with jvisualvm).

On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin  wrote:
>
> On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown
>  wrote:
> > I recently upgraded to spark 2.3.1 I have had these same settings in my 
> > spark submit script, which worked on 2.0.2, and according to the 
> > documentation appear to not have changed:
> >
> > spark.ui.retainedTasks=1
> > spark.ui.retainedStages=1
> > spark.ui.retainedJobs=1
>
> I tried that locally on the current master and it seems to be working.
> I don't have 2.3 easily in front of me right now, but will take a look
> Monday.
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-20 Thread Marcelo Vanzin
On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown
 wrote:
> I recently upgraded to spark 2.3.1 I have had these same settings in my spark 
> submit script, which worked on 2.0.2, and according to the documentation 
> appear to not have changed:
>
> spark.ui.retainedTasks=1
> spark.ui.retainedStages=1
> spark.ui.retainedJobs=1

I tried that locally on the current master and it seems to be working.
I don't have 2.3 easily in front of me right now, but will take a look
Monday.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Marcelo Vanzin
Spark only does Kerberos authentication on the driver. For executors it
currently only supports Hadoop's delegation tokens for Kerberos.

To use something that does not support delegation tokens you have to
manually manage the Kerberos login in your code that runs in executors,
which might be tricky. It means distributing the keytab yourself (not with
Spark's --keytab argument) and calling into the UserGroupInformation API
directly.

I don't have any examples of that, though, maybe someone does. (We have a
similar example for Kafka on our blog somewhere, but not sure how far that
will get you with MS SQL.)


On Mon, Oct 15, 2018 at 12:04 AM Foster Langbein <
foster.langb...@riskfrontiers.com> wrote:

> Has anyone gotten spark to write to SQL server using Kerberos
> authentication with Microsoft's JDBC driver? I'm having limited success,
> though in theory it should work.
>
> I'm using a YARN-mode 4-node Spark 2.3.0 cluster and trying to write a
> simple table to SQL Server 2016. I can get it to work if I use SQL server
> credentials, however this is not an option in my application. I need to
> use windows authentication - so-called integratedSecurity - and in
> particular I want to use a keytab file.
>
> The solution half works - the spark driver creates a table on SQL server -
> so I'm pretty confident the Kerberos implementation/credentials etc are
> setup correctly and valid. However the executors then fail to write any
> data to the table with an exception: "java.security.PrivilegedActionException:
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)"
>
> After much tracing/debugging it seems executors are behaving differently
> to the spark driver and ignoring the specification to use the credentials
> supplied in the keytab and instead trying to use the default spark cluster
> user. I simply haven't been able to force them to use what's in the keytab
> after trying many. many variations.
>
> Very grateful if anyone has any help/suggestions/ideas on how to get this
> to work.
>
>
> --
>
>
>
> *Dr Foster Langbein* | Chief Technology Officer | Risk Frontiers
>
> Level 2, 100 Christie St, St Leonards, NSW, 2065
>
>
> Telephone: +61 2 8459 9777
>
> Email: foster.langb...@riskfrontiers.com | Website: www.riskfrontiers.com
>
>
>
>
> *Risk Modelling | Risk Management | Resilience | Disaster Management
> | Social Research Australia | New Zealand | Asia Pacific*
>
>
>


-- 
Marcelo


Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-05 Thread Marcelo Vanzin
Sorry, I can't help you if that doesn't work. Your YARN RM really
should not have SPARK_HOME set if you want to use more than one Spark
version.
On Thu, Oct 4, 2018 at 9:54 PM Jianshi Huang  wrote:
>
> Hi Marcelo,
>
> I see what you mean. Tried it but still got same error message.
>
>> Error from python worker:
>>   Traceback (most recent call last):
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> _run_module_as_main
>>   mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> _get_module_details
>>   __import__(pkg_name)
>> File 
>> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 
>> 46, in 
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", 
>> line 29, in 
>>   ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>>   
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar
>
>
> On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin  wrote:
>>
>> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
>> expanded by the shell).
>>
>> But it's really weird to be setting SPARK_HOME in the environment of
>> your node managers. YARN shouldn't need to know about that.
>> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang  
>> wrote:
>> >
>> > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>> >
>> > The code shows Spark will try to find the path if SPARK_HOME is specified. 
>> > And on my worker node, SPARK_HOME is specified in .bashrc , for the 
>> > pre-installed 2.2.1 path.
>> >
>> > I don't want to make any changes to worker node configuration, so any way 
>> > to override the order?
>> >
>> > Jianshi
>> >
>> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin  wrote:
>> >>
>> >> Normally the version of Spark installed on the cluster does not
>> >> matter, since Spark is uploaded from your gateway machine to YARN by
>> >> default.
>> >>
>> >> You probably have some configuration (in spark-defaults.conf) that
>> >> tells YARN to use a cached copy. Get rid of that configuration, and
>> >> you can use whatever version you like.
>> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang  
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a problem using multiple versions of Pyspark on YARN, the driver 
>> >> > and worker nodes are all preinstalled with Spark 2.2.1, for production 
>> >> > tasks. And I want to use 2.3.2 for my personal EDA.
>> >> >
>> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), 
>> >> > however on the worker node, the PYTHONPATH still uses the system 
>> >> > SPARK_HOME.
>> >> >
>> >> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >> >
>> >> > Here's the error message,
>> >> >>
>> >> >>
>> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>> >> >> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 
>> >> >> in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): 
>> >> >> org.apache.spark.SparkException:
>> >> >> Error from python worker:
>> >> >> Traceback (most recent call last):
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> >> >> _run_module_as_main
>> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> >> >> _get_module_details
>> >> >> __import__(pkg_name)
>> >> >> File 
>> >> >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", 
>> >> >> line 46, in 
>> >> >> File 
>> >> >>

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Marcelo Vanzin
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
expanded by the shell).

But it's really weird to be setting SPARK_HOME in the environment of
your node managers. YARN shouldn't need to know about that.
On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang  wrote:
>
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>
> The code shows Spark will try to find the path if SPARK_HOME is specified. 
> And on my worker node, SPARK_HOME is specified in .bashrc , for the 
> pre-installed 2.2.1 path.
>
> I don't want to make any changes to worker node configuration, so any way to 
> override the order?
>
> Jianshi
>
> On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin  wrote:
>>
>> Normally the version of Spark installed on the cluster does not
>> matter, since Spark is uploaded from your gateway machine to YARN by
>> default.
>>
>> You probably have some configuration (in spark-defaults.conf) that
>> tells YARN to use a cached copy. Get rid of that configuration, and
>> you can use whatever version you like.
>> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang  wrote:
>> >
>> > Hi,
>> >
>> > I have a problem using multiple versions of Pyspark on YARN, the driver 
>> > and worker nodes are all preinstalled with Spark 2.2.1, for production 
>> > tasks. And I want to use 2.3.2 for my personal EDA.
>> >
>> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however 
>> > on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>> >
>> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >
>> > Here's the error message,
>> >>
>> >>
>> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
>> >> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
>> >> stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): 
>> >> org.apache.spark.SparkException:
>> >> Error from python worker:
>> >> Traceback (most recent call last):
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> >> _run_module_as_main
>> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> >> _get_module_details
>> >> __import__(pkg_name)
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", 
>> >> line 46, in 
>> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", 
>> >> line 29, in 
>> >> ModuleNotFoundError: No module named 'py4j'
>> >> PYTHONPATH was:
>> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >
>> >
>> > And here's how I started Pyspark session in Jupyter.
>> >>
>> >>
>> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> import findspark
>> >> findspark.init()
>> >> import pyspark
>> >> sparkConf = pyspark.SparkConf()
>> >> sparkConf.setAll([
>> >> ('spark.cores.max', '96')
>> >> ,('spark.driver.memory', '2g')
>> >> ,('spark.executor.cores', '4')
>> >> ,('spark.executor.instances', '2')
>> >> ,('spark.executor.memory', '4g')
>> >> ,('spark.network.timeout', '800')
>> >> ,('spark.scheduler.mode', 'FAIR')
>> >> ,('spark.shuffle.service.enabled', 'true')
>> >> ,('spark.dynamicAllocation.enabled', 'true')
>> >> ])
>> >> py_files = 
>> >> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", 
>> >> conf=sparkConf, pyFiles=py_files)
>> >>
>> >
>> >
>> > Thanks,
>> > --
>> > Jianshi Huang
>> >
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Marcelo Vanzin
Normally the version of Spark installed on the cluster does not
matter, since Spark is uploaded from your gateway machine to YARN by
default.

You probably have some configuration (in spark-defaults.conf) that
tells YARN to use a cached copy. Get rid of that configuration, and
you can use whatever version you like.
On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang  wrote:
>
> Hi,
>
> I have a problem using multiple versions of Pyspark on YARN, the driver and 
> worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And 
> I want to use 2.3.2 for my personal EDA.
>
> I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on 
> the worker node, the PYTHONPATH still uses the system SPARK_HOME.
>
> Anyone knows how to override the PYTHONPATH on worker nodes?
>
> Here's the error message,
>>
>>
>> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
>> (TID 3, emr-worker-8.cluster-68492, executor 2): 
>> org.apache.spark.SparkException:
>> Error from python worker:
>> Traceback (most recent call last):
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> _run_module_as_main
>> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> _get_module_details
>> __import__(pkg_name)
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", 
>> line 46, in 
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", 
>> line 29, in 
>> ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>
>
> And here's how I started Pyspark session in Jupyter.
>>
>>
>> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> %env PYSPARK_PYTHON=/usr/bin/python3
>> import findspark
>> findspark.init()
>> import pyspark
>> sparkConf = pyspark.SparkConf()
>> sparkConf.setAll([
>> ('spark.cores.max', '96')
>> ,('spark.driver.memory', '2g')
>> ,('spark.executor.cores', '4')
>> ,('spark.executor.instances', '2')
>> ,('spark.executor.memory', '4g')
>> ,('spark.network.timeout', '800')
>> ,('spark.scheduler.mode', 'FAIR')
>> ,('spark.shuffle.service.enabled', 'true')
>> ,('spark.dynamicAllocation.enabled', 'true')
>> ])
>> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", 
>> conf=sparkConf, pyFiles=py_files)
>>
>
>
> Thanks,
> --
> Jianshi Huang
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Marcelo Vanzin
See SPARK-4160. Long story short: you need to upload the files and
jars to some shared storage (like HDFS) manually.
On Wed, Sep 5, 2018 at 2:17 AM Guillermo Ortiz Fernández
 wrote:
>
> I'm using standalone cluster and the final command I'm trying is:
> spark-submit --verbose --deploy-mode cluster --driver-java-options 
> "-Dlogback.configurationFile=conf/i${1}Logback.xml" \
> --class com.example.Launcher --driver-class-path 
> lib/spark-streaming-kafka-0-10_2.11-2.0.2.jar:lib/kafka-clients-0.10.0.1.jar  
> \
> --files conf/${1}Conf.json iris-core-0.0.1-SNAPSHOT.jar conf/${1}Conf.json
>
> El mié., 5 sept. 2018 a las 11:11, Guillermo Ortiz Fernández 
> () escribió:
>>
>> I want to execute my processes in cluster mode. As I don't know where the 
>> driver has been executed I have to do available all the file it needs. I 
>> undertand that they are two options. Copy all the files to all nodes of copy 
>> them to HDFS.
>>
>> My doubt is,, if I want to put all the files in HDFS, isn't it automatic 
>> with --files and --jar parameters in the spark-submit command? or do I have 
>> to copy to HDFS manually?
>>
>> My idea is to execute something like:
>> spark-submit --driver-java-options 
>> "-Dlogback.configurationFile=conf/${1}Logback.xml" \
>> --class com.example.Launcher --driver-class-path 
>> lib/spark-streaming-kafka-0-10_2.11-2.0.2.jar:lib/kafka-clients-1.0.0.jar \
>> --files /conf/${1}Conf.json example-0.0.1-SNAPSHOT.jar conf/${1}Conf.json
>> I have tried to with --files hdfs:// without copying anything to hdfs 
>> and it doesn't work either.
>>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Marcelo Vanzin
I'm not familiar with PyCharm. But if you can run "pyspark" from the
command line and not hit this, then this might be an issue with
PyCharm or your environment - e.g. having an old version of the
pyspark code around, or maybe PyCharm itself might need to be updated.

On Thu, Jun 14, 2018 at 10:01 PM, Aakash Basu
 wrote:
> Hi,
>
> Downloaded the latest Spark version because the of the fix for "ERROR
> AsyncEventQueue:70 - Dropping event from queue appStatus."
>
> After setting environment variables and running the same code in PyCharm,
> I'm getting this error, which I can't find a solution of.
>
> Exception in thread "main" java.util.NoSuchElementException: key not found:
> _PYSPARK_DRIVER_CONN_INFO_PATH
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at
> org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:64)
> at
> org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> Any help?
>
> Thanks,
> Aakash.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark user classpath setting

2018-06-14 Thread Marcelo Vanzin
I only know of a way to do that with YARN.

You can distribute the jar files using "--files" and add just their
names (not the full path) to the "extraClassPath" configs. You don't
need "userClassPathFirst" in that case.

On Thu, Jun 14, 2018 at 1:28 PM, Arjun kr  wrote:
> Hi All,
>
>
> I am trying to execute a sample spark script ( that use spark jdbc ) which
> has dependencies on a set of custom jars. These custom jars need to be added
> first in the classpath. Currently, I have copied custom lib directory to all
> the nodes and able to execute it with below command.
>
>
> bin/spark-shell  --conf spark.driver.extraClassPath=/custom-jars/* --conf
> "spark.driver.userClassPathFirst=true" --conf
> spark.executor.extraClassPath=/custom-jars/* --conf
> "spark.executor.userClassPathFirst=true" --master yarn -i
> /tmp/spark-test.scala
>
>
> Are there any options that do not require jars to be copied to all the nodes
> (with option to be added to class path first) ? The --jars and --archives
> option seems to not working for me. Any suggestion would be appreciated!
>
>
> Thanks,
>
>
> Arjun
>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-11 Thread Marcelo Vanzin
We are happy to announce the availability of Spark 2.3.1!

Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-1.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread Marcelo Vanzin
That feature has not been implemented yet.
https://issues.apache.org/jira/browse/SPARK-11033

On Wed, Jun 6, 2018 at 5:18 AM, Behroz Sikander  wrote:
> I have a client application which launches multiple jobs in Spark Cluster
> using SparkLauncher. I am using Standalone cluster mode. Launching jobs
> works fine till now. I use launcher.startApplication() to launch.
>
> But now, I have a requirement to check the states of my Driver process. I
> added a Listener implementing the SparkAppHandle.Listener but I don't get
> any events. I am following the approach mentioned here
> https://www.linkedin.com/pulse/spark-launcher-amol-kale
>
> I tried the same code with client code and I receive all the events as
> expected.
>
> So, I am guessing that something different needs to be done in cluster mode.
> Is there any example with cluster mode?
>
> Regards,
> Behroz



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
I already gave my recommendation in my very first reply to this thread...

On Fri, May 25, 2018 at 10:23 AM, raksja  wrote:
> ok, when to use what?
> do you have any recommendation?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
On Fri, May 25, 2018 at 10:18 AM, raksja  wrote:
> InProcessLauncher would just start a subprocess as you mentioned earlier.

No. As the name says, it runs things in the same process.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
That's what Spark uses.

On Fri, May 25, 2018 at 10:09 AM, raksja  wrote:
> thanks for the reply.
>
> Have you tried submit a spark job directly to Yarn using YarnClient.
> https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html
>
> Not sure whether its performant and scalable?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-23 Thread Marcelo Vanzin
On Wed, May 23, 2018 at 12:04 PM, raksja  wrote:
> So InProcessLauncher wouldnt use the native memory, so will it overload the
> mem of parent process?

I will still use "native memory" (since the parent process will still
use memory), just less of it. But yes, it will use more memory in the
parent process.

> Is there any way that we can overcome this?

Try to launch less applications concurrently.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Encounter 'Could not find or load main class' error when submitting spark job on kubernetes

2018-05-22 Thread Marcelo Vanzin
On Tue, May 22, 2018 at 12:45 AM, Makoto Hashimoto
 wrote:
> local:///usr/local/oss/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar

Is that the path of the jar inside your docker image? The default
image puts that in /opt/spark IIRC.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit many spark applications

2018-05-16 Thread Marcelo Vanzin
You can either:

- set spark.yarn.submit.waitAppCompletion=false, which will make
spark-submit go away once the app starts in cluster mode.
- use the (new in 2.3) InProcessLauncher class + some custom Java code
to submit all the apps from the same "launcher" process.

On Wed, May 16, 2018 at 1:45 PM, Shiyuan  wrote:
> Hi Spark-users,
>  I want to submit as many spark applications as the resources permit. I am
> using cluster mode on a yarn cluster.  Yarn can queue and launch these
> applications without problems. The problem lies on spark-submit itself.
> Spark-submit starts a jvm which could fail due to insufficient memory on the
> machine where I run spark-submit if many spark-submit jvm are running. Any
> suggestions on how to solve this problem? Thank you!



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark UI Source Code

2018-05-09 Thread Marcelo Vanzin
(-dev)

The KVStore API is private to Spark, it's not really meant to be used
by others. You're free to try, and there's a lot of javadocs on the
different interfaces, but it's not a general purpose database, so
you'll need to figure out things like that by yourself.

On Tue, May 8, 2018 at 9:53 PM, Anshi Shrivastava
<anshi.shrivast...@exadatum.com> wrote:
> Hi Marcelo, Dev,
>
> Thanks for your response.
> I have used SparkListeners to fetch the metrics (the public REST API uses
> the same) but to monitor these metrics over time, I have to persist them
> (using KVStore library of spark).  Is there a way to fetch data from this
> KVStore (which uses levelDb for storage) and filter it on basis on
> timestamp?
>
> Thanks,
> Anshi
>
> On Mon, May 7, 2018 at 9:51 PM, Marcelo Vanzin [via Apache Spark User List]
> <ml+s1001560n32114...@n3.nabble.com> wrote:
>>
>> On Mon, May 7, 2018 at 1:44 AM, Anshi Shrivastava
>> <[hidden email]> wrote:
>> > I've found a KVStore wrapper which stores all the metrics in a LevelDb
>> > store. This KVStore wrapper is available as a spark-dependency but we
>> > cannot
>> > access the metrics directly from spark since they are all private.
>>
>> I'm not sure what it is you're trying to do exactly, but there's a
>> public REST API that exposes all the data Spark keeps about
>> applications. There's also a programmatic status tracker
>> (SparkContext.statusTracker) that's easier to use from within the
>> running Spark app, but has a lot less info.
>>
>> > Can we use this store to store our own metrics?
>>
>> No.
>>
>> > Also can we retrieve these metrics based on timestamp?
>>
>> Only if the REST API has that feature, don't remember off the top of my
>> head.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-UI-Source-Code-tp32114.html
>> To start a new topic under Apache Spark User List, email
>> ml+s1001560n1...@n3.nabble.com
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>
>
>
>
> DISCLAIMER:
> All the content in email is intended for the recipient and not to be
> published elsewhere without Exadatum consent. And attachments shall be send
> only if required and with ownership of the sender. This message contains
> confidential information and is intended only for the individual named. If
> you are not the named addressee, you should not disseminate, distribute or
> copy this email. Please notify the sender immediately by email if you have
> received this email by mistake and delete this email from your system. Email
> transmission cannot be guaranteed to be secure or error-free, as information
> could be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
> or contain viruses. The sender, therefore, does not accept liability for any
> errors or omissions in the contents of this message which arise as a result
> of email transmission. If verification is required, please request a
> hard-copy version.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Guava dependency issue

2018-05-08 Thread Marcelo Vanzin
Using a custom Guava version with Spark is not that simple. Spark
shades Guava, but a lot of libraries Spark uses do not - the main one
being all of the Hadoop ones, and they need a quite old Guava.

So you have two options: shade/relocate Guava in your application, or
use spark.{driver|executor}.userClassPath first.

There really isn't anything easier until we get shaded Hadoop client
libraries...

On Tue, May 8, 2018 at 8:44 AM, Stephen Boesch  wrote:
>
> I downgraded to spark 2.0.1 and it fixed that particular runtime exception:
> but then a similar one appears when saving to parquet:
>
> An  SOF question on this was created a month ago and today further details
> plus an open bounty were added to it:
>
> https://stackoverflow.com/questions/49713485/spark-error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c
>
> The new but similar exception is shown below:
>
> The hack to downgrade to 2.0.1 does help - i.e. execution proceeds further :
> but then when writing out to parquet the above error does happen.
>
> 8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0 (TID
> 2618)
> java.lang.NoSuchMethodError:
> com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
> at
> org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
> at org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
> at
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.(CodecFactory.java:92)
> at
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
>
>
>
> 2018-05-07 10:30 GMT-07:00 Stephen Boesch :
>>
>> I am intermittently running into guava dependency issues across mutiple
>> spark projects.  I have tried maven shade / relocate but it does not resolve
>> the issues.
>>
>> The current project is extremely simple: *no* additional dependencies
>> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
>> clean was re-applied).
>>
>> Is there a reliable approach to handling the versioning for guava within
>> spark dependency projects?
>>
>>
>> [INFO]
>> 
>> [INFO] Building ccapps_final 1.0-SNAPSHOT
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
>> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> [WARNING]
>> java.lang.NoSuchMethodError:
>> com.google.common.cache.CacheBuilder.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/google/common/cache/CacheBuilder;
>> at org.apache.hadoop.security.Groups.(Groups.java:96)
>> at org.apache.hadoop.security.Groups.(Groups.java:73)
>> at
>> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
>> at
>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
>> at
>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
>> at
>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
>> at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
>> at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
>> at
>> 

Re: Spark UI Source Code

2018-05-07 Thread Marcelo Vanzin
On Mon, May 7, 2018 at 1:44 AM, Anshi Shrivastava
 wrote:
> I've found a KVStore wrapper which stores all the metrics in a LevelDb
> store. This KVStore wrapper is available as a spark-dependency but we cannot
> access the metrics directly from spark since they are all private.

I'm not sure what it is you're trying to do exactly, but there's a
public REST API that exposes all the data Spark keeps about
applications. There's also a programmatic status tracker
(SparkContext.statusTracker) that's easier to use from within the
running Spark app, but has a lot less info.

> Can we use this store to store our own metrics?

No.

> Also can we retrieve these metrics based on timestamp?

Only if the REST API has that feature, don't remember off the top of my head.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark launcher listener not getting invoked k8s Spark 2.3

2018-04-30 Thread Marcelo Vanzin
Please include the mailing list in your replies.

Yes, you'll be able to launch the jobs, but the k8s backend isn't
hooked up to the listener functionality.

On Mon, Apr 30, 2018 at 8:13 PM, purna m <kittu45...@gmail.com> wrote:
> I’m able to submit the job though !! I mean spark application is running on
> k8 but listener is not getting invoked
>
>
> On Monday, April 30, 2018, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> I'm pretty sure this feature hasn't been implemented for the k8s backend.
>>
>> On Mon, Apr 30, 2018 at 4:51 PM, purna m <kittu45...@gmail.com> wrote:
>> > HI im using below code to submit a spark 2.3 application on kubernetes
>> > cluster in scala using play framework
>> >
>> > I have also tried as a simple scala program without using play framework
>> >
>> > im trying to spark submit which was mentioned below
>> >
>> > programaticallyhttps://spark.apache.org/docs/latest/running-on-kubernetes.html
>> >
>> >
>> >
>> > $ bin/spark-submit \
>> >
>> > --master k8s://https://: \
>> >
>> > --deploy-mode cluster \
>> >
>> > --name spark-pi \
>> >
>> > --class org.apache.spark.examples.SparkPi \
>> >
>> > --conf spark.executor.instances=5 \
>> >
>> > --conf spark.kubernetes.container.image= \
>> >
>> > local:///path/to/examples.jar
>> >
>> >
>> >
>> >   def index = Action {
>> >
>> > try
>> >
>> > {
>> >
>> > val spark = new SparkLauncher()
>> >
>> >   .setMaster("my k8 apiserver host")
>> >
>> >   .setVerbose(true)
>> >
>> >   .addSparkArg("--verbose")
>> >
>> >   .setAppResource("hdfs://server/inputs/my.jar")
>> >
>> >   .setConf("spark.app.name","myapp")
>> >
>> >   .setConf("spark.executor.instances","5")
>> >
>> >   .setConf("spark.kubernetes.container.image","mydockerimage")
>> >
>> >   .setDeployMode("cluster")
>> >
>> >   .startApplication(new SparkAppHandle.Listener(){
>> >
>> > def infoChanged(handle: SparkAppHandle): Unit = {
>> >
>> >   System.out.println("Spark App Id [" + handle.getAppId + "]
>> > Info
>> > Changed.  State [" + handle.getState + "]")
>> >
>> > }
>> >
>> >def stateChanged(handle: SparkAppHandle): Unit = {
>> >
>> >   System.out.println("Spark App Id [" + handle.getAppId + "]
>> > State
>> > Changed. State [" + handle.getState + "]")
>> >
>> >   if (handle.getState.toString == "FINISHED") System.exit(0)
>> >
>> > }
>> >
>> >   } )
>> >
>> > Ok(spark.getState().toString())
>> >
>> > }
>> >
>> > catch
>> >
>> > {
>> >
>> >   case NonFatal(e)=>{
>> >
>> > println("failed with exception: " + e)
>> >
>> >   }
>> >
>> > }
>> >
>> > Ok
>> >
>> >   }
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Marcelo Vanzin
There are two things you're doing wrong here:

On Thu, Apr 12, 2018 at 6:32 PM, jb44  wrote:
> Then I can add the alluxio client library like so:
> sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)

First one, you can't modify JVM configuration after it has already
started. So this line does nothing since it can't re-launch your
application with a new JVM.

> sparkSession.conf.set("spark.executor.extraClassPath", ALLUXIO_SPARK_CLIENT)

There is a lot of configuration that you cannot set after the
application has already started. For example, after the session is
created, most probably this option will be ignored, since executors
will already have started.

I'm not so sure about what happens when you use dynamic allocation,
but these post-hoc config changes in general are not expected to take
effect.

The documentation could be clearer about this (especially stuff that
only applies to spark-submit), but that's the gist of it.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Marcelo Vanzin
This is the problem:

> :/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

Seems like some code is confusing things when mixing OSes. It's using
the Windows separator when building a command line ti be run on a
Linux host.


On Tue, Apr 10, 2018 at 11:02 AM, Dmitry  wrote:
> Previous example was bad paste( I tried a lot of variants, so sorry for
> wrong paste )
> PS C:\WINDOWS\system32> spark-submit --master k8s://https://ip:8443
> --deploy-mode cluster  --name spark-pi --class
> org.apache.spark.examples.SparkPi --conf spark.executor.instances=1
> --executor-memory 1G --conf spark.kubernete
> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
> local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> Returns
> Image:
> andrusha/spark-k8s:2.3.0-hadoop2.7
> Environment variables:
> SPARK_DRIVER_MEMORY: 1g
> SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi
> SPARK_DRIVER_ARGS:
> SPARK_DRIVER_BIND_ADDRESS:
> SPARK_MOUNTED_CLASSPATH:
> /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> SPARK_JAVA_OPT_0:
> -Dspark.kubernetes.driver.pod.name=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
> SPARK_JAVA_OPT_1:
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
> SPARK_JAVA_OPT_2: -Dspark.app.name=spark-pi
> SPARK_JAVA_OPT_3:
> -Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
> SPARK_JAVA_OPT_4: -Dspark.submit.deployMode=cluster
> SPARK_JAVA_OPT_5: -Dspark.driver.blockManager.port=7079
> SPARK_JAVA_OPT_6: -Dspark.master=k8s://https://ip:8443
> SPARK_JAVA_OPT_7:
> -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> SPARK_JAVA_OPT_8:
> -Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
> SPARK_JAVA_OPT_9: -Dspark.executor.instances=1
> SPARK_JAVA_OPT_10: -Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
> SPARK_JAVA_OPT_11: -Dspark.executor.memory=1G
> SPARK_JAVA_OPT_12: -Dspark.driver.port=7078
>
>
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS
> $SPARK_DRIVER_ARGS)
> + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java
> -Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
> -Dspark.executor.memory=1G -Dspark.driver.port=7078
> -Dspark.driver.blockManager.port=7079 -Dspark.submit.deployMode=cluster
> -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> -Dspark.master=k8s://https://172.20.10.12:8443
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
> -Dspark.kubernetes.driver.pod.name=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
> -Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
> -Dspark.app.name=spark-pi -Dspark.executor.instances=1
> -Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7 -cp
> ':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> -Xms1g -Xmx1g -Dspark.driver.bindAddress=172.17.0.2
> org.apache.spark.examples.SparkPi
> Error: Could not find or load main class org.apache.spark.examples.SparkPi
>
> Found this stackoverflow question
> https://stackoverflow.com/questions/49331570/spark-2-3-minikube-kubernetes-windows-demo-sparkpi-not-found
> but there is no answer.
> I also checked container file system, it contains
> /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
>
>
> 2018-04-11 1:17 GMT+08:00 Yinan Li :
>>
>> The example jar path should be
>> local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar.
>>
>> On Tue, Apr 10, 2018 at 1:34 AM, Dmitry  wrote:
>>>
>>> Hello spent a lot of time to find what I did wrong , but not found.
>>> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
>>> to run examples against Spark 2.3. Tried several  docker images builds:
>>> * several  builds that I build myself
>>> * andrusha/spark-k8s:2.3.0-hadoop2.7 from docker  hub
>>> But when I try to submit job driver log returns  class not found
>>> exception
>>> org.apache.spark.examples.SparkPi
>>>
>>> spark-submit --master k8s://https://ip:8443  --deploy-mode cluster
>>> --name spark-pi --class org.apache.spark.examples.SparkPi --conf
>>> spark.executor.instances=1 --executor-memory 1G --conf spark.kubernete
>>> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
>>> local:///opt/spark/examples/spark-examples_2.11-2.3.0.jar
>>>
>>> I tried to use https://github.com/apache-spark-on-k8s/spark fork and it
>>> is works without problems, more complex examples work also.
>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: all spark settings end up being system properties

2018-03-30 Thread Marcelo Vanzin
Why: it's part historical, part "how else would you do it".

SparkConf needs to read properties read from the command line, but
SparkConf is something that user code instantiates, so we can't easily
make it read data from arbitrary locations. You could use thread
locals and other tricks, but user code can always break those.

Where: this is done by the SparkSubmit class (look for the Scala
version, "sys.props").


On Fri, Mar 30, 2018 at 11:41 AM, Koert Kuipers  wrote:
> does anyone know why all spark settings end up being system properties, and
> where this is done?
>
> for example when i pass "--conf spark.foo=bar" into spark-submit then
> System.getProperty("spark.foo") will be equal to "bar"
>
> i grepped the spark codebase for System.setProperty or System.setProperties
> and i see it being used in some places but never for all spark settings.
>
> we are running into some weird side effects because of this since we use
> typesafe config which has system properties as overrides so we see them pop
> up there again unexpectedly.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Local dirs

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen
 wrote:
> Is there a way to change this value without changing yarn-site.xml ?

No. Local dirs are defined by the NodeManager, and Spark cannot override them.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 11:01 AM, Fawze Abujaber  wrote:
> Weird, I just ran spark-shell and it's log is comprised but  my spark jobs
> that scheduled using oozie is not getting compressed.

Ah, then it's probably a problem with how Oozie is generating the
config for the Spark job. Given your env it's potentially related to
Cloudera Manager so I'd try to ask questions in the Cloudera forums
first...

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
You're either doing something wrong, or talking about different logs.
I just added that to my config and ran spark-shell.

$ hdfs dfs -ls /user/spark/applicationHistory | grep
application_1522085988298_0002
-rwxrwx---   3 blah blah   9844 2018-03-26 10:54
/user/spark/applicationHistory/application_1522085988298_0002.snappy



On Mon, Mar 26, 2018 at 10:48 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
> I distributed this config to all the nodes cross the cluster and with no
> success, new spark logs still uncompressed.
>
> On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Spark should be using the gateway's configuration. Unless you're
>> launching the application from a different node, if the setting is
>> there, Spark should be using it.
>>
>> You can also look in the UI's environment page to see the
>> configuration that the app is using.
>>
>> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber <fawz...@gmail.com>
>> wrote:
>> > I see this configuration only on the spark gateway server, and my spark
>> > is
>> > running on Yarn, so I think I missing something ...
>> >
>> > I’m using cloudera manager to set this parameter, maybe I need to add
>> > this
>> > parameter in other configuration
>> >
>> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin <van...@cloudera.com> wrote:
>> >>
>> >> If the spark-defaults.conf file in the machine where you're starting
>> >> the Spark app has that config, then that's all that should be needed.
>> >>
>> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com>
>> >> wrote:
>> >> > Thanks Marcelo,
>> >> >
>> >> > Yes I was was expecting to see the new apps compressed but I don’t ,
>> >> > do
>> >> > I
>> >> > need to perform restart to spark or Yarn?
>> >> >
>> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> Log compression is a client setting. Doing that will make new apps
>> >> >> write event logs in compressed format.
>> >> >>
>> >> >> The SHS doesn't compress existing logs.
>> >> >>
>> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi All,
>> >> >> >
>> >> >> > I'm trying to compress the logs at SPark history server, i added
>> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> >> >> > Client
>> >> >> > Advanced Configuration Snippet (Safety Valve) for
>> >> >> > spark-conf/spark-defaults.conf
>> >> >> >
>> >> >> > which i see applied only to the spark gateway servers spark conf.
>> >> >> >
>> >> >> > What i missing to get this working ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Spark should be using the gateway's configuration. Unless you're
launching the application from a different node, if the setting is
there, Spark should be using it.

You can also look in the UI's environment page to see the
configuration that the app is using.

On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
> I see this configuration only on the spark gateway server, and my spark is
> running on Yarn, so I think I missing something ...
>
> I’m using cloudera manager to set this parameter, maybe I need to add this
> parameter in other configuration
>
> On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> If the spark-defaults.conf file in the machine where you're starting
>> the Spark app has that config, then that's all that should be needed.
>>
>> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com>
>> wrote:
>> > Thanks Marcelo,
>> >
>> > Yes I was was expecting to see the new apps compressed but I don’t , do
>> > I
>> > need to perform restart to spark or Yarn?
>> >
>> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com> wrote:
>> >>
>> >> Log compression is a client setting. Doing that will make new apps
>> >> write event logs in compressed format.
>> >>
>> >> The SHS doesn't compress existing logs.
>> >>
>> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com>
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > I'm trying to compress the logs at SPark history server, i added
>> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> >> > Client
>> >> > Advanced Configuration Snippet (Safety Valve) for
>> >> > spark-conf/spark-defaults.conf
>> >> >
>> >> > which i see applied only to the spark gateway servers spark conf.
>> >> >
>> >> > What i missing to get this working ?
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
If the spark-defaults.conf file in the machine where you're starting
the Spark app has that config, then that's all that should be needed.

On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
> Thanks Marcelo,
>
> Yes I was was expecting to see the new apps compressed but I don’t , do I
> need to perform restart to spark or Yarn?
>
> On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Log compression is a client setting. Doing that will make new apps
>> write event logs in compressed format.
>>
>> The SHS doesn't compress existing logs.
>>
>> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <fawz...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I'm trying to compress the logs at SPark history server, i added
>> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> > Client
>> > Advanced Configuration Snippet (Safety Valve) for
>> > spark-conf/spark-defaults.conf
>> >
>> > which i see applied only to the spark gateway servers spark conf.
>> >
>> > What i missing to get this working ?
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Log compression is a client setting. Doing that will make new apps
write event logs in compressed format.

The SHS doesn't compress existing logs.

On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber  wrote:
> Hi All,
>
> I'm trying to compress the logs at SPark history server, i added
> spark.eventLog.compress=true to spark-defaults.conf to spark Spark Client
> Advanced Configuration Snippet (Safety Valve) for
> spark-conf/spark-defaults.conf
>
> which i see applied only to the spark gateway servers spark conf.
>
> What i missing to get this working ?



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: HadoopDelegationTokenProvider

2018-03-21 Thread Marcelo Vanzin
They should be available in the current user.

UserGroupInformation.getCurrentUser().getCredentials()

On Wed, Mar 21, 2018 at 7:32 AM, Jorge Machado  wrote:
> Hey spark group,
>
> I want to create a Delegation Token Provider for Accumulo I have One
> Question:
>
> How can I get the token that I added to the credentials from the Executor
> side ?  the SecurityManager class is private…
>
> Thanks
>
>
> Jorge Machado
>
>
>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Accessing a file that was passed via --files to spark submit

2018-03-19 Thread Marcelo Vanzin
>From spark-submit -h:

  --files FILES   Comma-separated list of files to be
placed in the working
 directory of each executor. File paths of
these files
 in executors can be accessed via
SparkFiles.get(fileName).

On Sun, Mar 18, 2018 at 1:06 AM, Vitaliy Pisarev
 wrote:
> I am submitting a script to spark-submit and passing it a file using --files
> property. Later on I need to read it in a worker.
>
> I don't understand what API I should use to do that. I figured I'd try just:
>
> with open('myfile'):
>
> but this did not work.
>
> I am able to pass the file using the addFile mechanism but it may not be
> good enough for me.
>
> This may seem like a very simple question but I did not find any
> comprehensive documentation on spark-submit. The docs sure doen't cover it.
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
Looks like you either have a misconfigured HDFS service, or you're
using the wrong configuration on the client.

BTW, as I said in the previous response, the message you saw initially
is *not* an error. If you're just trying things out, you don't need to
do anything and Spark should still work.

On Mon, Mar 12, 2018 at 6:13 PM, kant kodali <kanth...@gmail.com> wrote:
> Hi,
>
> I read that doc several times now. I am stuck with the below error message
> when I run ./spark-shell --master yarn --deploy-mode client.
>
> I have my HADOOP_CONF_DIR set to /usr/local/hadoop-2.7.3/etc/hadoop and
> SPARK_HOME set to /usr/local/spark on all 3 machines (1 node for Resource
> Manager and NameNode, 2 Nodes for Node Manager and DataNodes).
>
> Any idea?
>
>
>
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /user/centos/.sparkStaging/application_1520898664848_0003/__spark_libs__2434167523839846774.zip
> could only be replicated to 0 nodes instead of minReplication (=1).  There
> are 2 datanode(s) running and no node(s) are excluded in this operation.
>
>
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> java.security.AccessController.doPrivileged(Native Method)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> javax.security.auth.Subject.doAs(Subject.java:422)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 18/03/13 00:19:13 INFO LineBufferedStream: stdout:  at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> 18/03/13
>
>
> Thanks!
>
>
> On Mon, Mar 12, 2018 at 4:46 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> That's not an error, just a warning. The docs [1] have more info about
>> the config options mentioned in that message.
>>
>> [1] http://spark.apache.org/docs/latest/running-on-yarn.html
>>
>> On Mon, Mar 12, 2018 at 4:42 PM, kant kodali <kanth...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I am trying to use YARN for the very first time. I believe I configured
>> > all
>> > the resource manager and name node fine. And then I run the below
>> > command
>> >
>> > ./spark-shell --master yarn --deploy-mode client
>> >
>> > I get the below output and it hangs there forever (I had been waiting
>> > over
>> > 10 minutes)
>> >
>> > 18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
>> > spark.yarn.archive is set, falling back to uploading libraries under
>> > SPARK_HOME.
>> >
>> > Any idea?
>> >
>> > Thanks!
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to run spark shell using YARN

2018-03-12 Thread Marcelo Vanzin
That's not an error, just a warning. The docs [1] have more info about
the config options mentioned in that message.

[1] http://spark.apache.org/docs/latest/running-on-yarn.html

On Mon, Mar 12, 2018 at 4:42 PM, kant kodali  wrote:
> Hi All,
>
> I am trying to use YARN for the very first time. I believe I configured all
> the resource manager and name node fine. And then I run the below command
>
> ./spark-shell --master yarn --deploy-mode client
>
> I get the below output and it hangs there forever (I had been waiting over
> 10 minutes)
>
> 18/03/12 23:36:32 WARN Client: Neither spark.yarn.jars nor
> spark.yarn.archive is set, falling back to uploading libraries under
> SPARK_HOME.
>
> Any idea?
>
> Thanks!



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [spark-sql] Custom Query Execution listener via conf properties

2018-02-16 Thread Marcelo Vanzin
According to https://issues.apache.org/jira/browse/SPARK-19558 this
feature was added in 2.3.

On Fri, Feb 16, 2018 at 12:43 AM, kurian vs  wrote:
> Hi,
>
> I was trying to create a custom Query execution listener by extending the
> org.apache.spark.sql.util.QueryExecutionListener class. My custom listener
> just contains some logging statements. But i do not see those logging
> statements when i run a spark job.
>
> Here are the steps that i did:
>
> Create a custom listener by extending the QueryExecutionLIstener class
> Created a jar file for the above project
> Edited spark-defaults.conf to add the following properties:
> spark.sql.queryExecutionListeners
> com.customListener.spark.customSparkListener spark.driver.extraClassPath
> /pathtoJarFile/CustomListener-1.0-SNAPSHOT.jar
> Restarted everything using SPARK-HOME/sbin/start-all.sh
> Ran a sample job using spark-submit
>
> I do not see any of the logging statements from the custom listener being
> printed[i don't see them in the console]
>
> Is there anything else that i need to do to make it work other than the
> above steps?
> I'm adding this in the config properties because i need some info from all
> the spark jobs being executed on that cluster. My assumption is that this
> will prevent the need to do it from the code by adding an extra
> ExecutionListenerManager.register(customListener)  line. Is this assumption
> correct?
> From which version of spark is this supported? (i'm using spark V 2.2.1)
>
>
> Can someone point me in the right direction?
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-04 Thread Marcelo Vanzin
On Wed, Jan 3, 2018 at 8:18 PM, John Zhuge  wrote:
> Something like:
>
> Note: When running Spark on YARN, environment variables for the executors
> need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file or on the command line.
> Environment variables that are set in spark-env.sh will not be reflected in
> the executor process.

I'm not against adding docs, but that's probably true for all
backends. No backend I know sources spark-env.sh before starting
executors.

For example, the standalone worker sources spark-env.sh before
starting the daemon, and those env variables "leak" to the executors.
But you can't customize an individual executor's environment that way
without restarting the service.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Marcelo Vanzin
Because spark-env.sh is something that makes sense only on the gateway
machine (where the app is being submitted from).

On Wed, Jan 3, 2018 at 6:46 PM, John Zhuge <john.zh...@gmail.com> wrote:
> Thanks Jacek and Marcelo!
>
> Any reason it is not sourced? Any security consideration?
>
>
> On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge <jzh...@apache.org> wrote:
>> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
>> > spark-env.sh sourced when starting the Spark AM container or the
>> > executor
>> > container?
>>
>> No, it's not.
>>
>> --
>> Marcelo
>
>
>
>
> --
> John



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Marcelo Vanzin
On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge  wrote:
> I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
> spark-env.sh sourced when starting the Spark AM container or the executor
> container?

No, it's not.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: flatMap() returning large class

2017-12-14 Thread Marcelo Vanzin
This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake  wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Marcelo Vanzin
On Wed, Dec 13, 2017 at 11:21 AM, Toy  wrote:
> I'm wondering why am I seeing 5 attempts for my Spark application? Does Spark 
> application restart itself?

It restarts itself if it fails (up to a limit that can be configured
either per Spark application or globally in YARN).


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Loading a spark dataframe column into T-Digest using java

2017-12-11 Thread Marcelo Vanzin
The closure in your "foreach" loop runs in a remote executor, no the
local JVM, so it's updating its own copy of the t-digest instance. The
one on the driver side is never touched.

On Sun, Dec 10, 2017 at 10:27 PM, Himasha de Silva  wrote:
> Hi,
>
> I want to load a spark dataframe column into T-Digest using java to
> calculate quantile values. I write this code to do this, but it's giving
> zero for size of tdigest. values are not added to tDigest.
>
> my code - https://gist.github.com/anonymous/1f2e382fdda002580154b5c43fbe9b3a
>
> Thank you.
>
> Himasha De Silva
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Marcelo Vanzin
That's the Spark Master's view of the application. I don't know
exactly what it means in the different run modes, I'm more familiar
with YARN. But I wouldn't be surprised if, as with others, it mostly
tracks the driver's state.

On Thu, Dec 7, 2017 at 12:06 PM, bsikander  wrote:
> 
>
> See the image. I am referring to this state when I say "Application State".
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread Marcelo Vanzin
On Thu, Dec 7, 2017 at 11:40 AM, bsikander  wrote:
> For example, if an application wanted 4 executors
> (spark.executor.instances=4) but the spark cluster can only provide 1
> executor. This means that I will only receive 1 onExecutorAdded event. Will
> the application state change to RUNNING (even if 1 executor was allocated)?

What application state are you talking about? That's the thing that
you seem to be confused about here.

As you've already learned, SparkLauncher only cares about the driver.
So RUNNING means the driver is running.

And there's no concept of running anywhere else I know of that is
exposed to Spark applications. So I don't know which code you're
referring to when you say "the application state change to RUNNING".

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
On Tue, Dec 5, 2017 at 12:43 PM, bsikander  wrote:
> 2) If I use context.addSparkListener, I can customize the listener but then
> I miss the onApplicationStart event. Also, I don't know the Spark's logic to
> changing the state of application from WAITING -> RUNNING.

I'm not sure I follow you here. This is something that you are
defining, not Spark.

"SparkLauncher" has its own view of that those mean, and it doesn't match yours.

"SparkListener" has no notion of whether an app is running or not.

It's up to you to define what waiting and running mean in your code,
and map the events Spark provides you to those concepts.

e.g., a job is running after your listener gets an "onJobStart" event.
But the application might have been running already before that job
started.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
SparkLauncher operates at a different layer than Spark applications.
It doesn't know about executors or driver or anything, just whether
the Spark application was started or not. So it doesn't work for your
case.

The best option for your case is to install a SparkListener and
monitor events. But that will not tell you when things do not happen,
just when they do happen, so maybe even that is not enough for you.


On Mon, Dec 4, 2017 at 1:06 AM, bsikander  wrote:
> So, I tried to use SparkAppHandle.Listener with SparkLauncher as you
> suggested. The behavior of Launcher is not what I expected.
>
> 1- If I start the job (using SparkLauncher) and my Spark cluster has enough
> cores available, I receive events in my class extending
> SparkAppHandle.Listener and I see the status getting changed from
> UNKOWN->CONNECTED -> SUBMITTED -> RUNNING. All good here.
>
> 2- If my Spark cluster has cores only for my Driver process (running in
> cluster mode) but no cores for my executor, then I still receive the RUNNING
> event. I was expecting something else since my executor has no cores and
> Master UI shows WAITING state for executors, listener should respond with
> SUBMITTED state instead of RUNNING.
>
> 3- If my Spark cluster has no cores for even the driver process then
> SparkLauncher invokes no events at all. The state stays in UNKNOWN. I would
> have expected it to be in SUBMITTED state atleast.
>
> *Is there any way with which I can reliably get the WAITING state of job?*
> Driver=RUNNING, executor=RUNNING, overall state should be RUNNING
> Driver=RUNNING, executor=WAITING overall state should be SUBMITTED/WAITING
> Driver=WAITING, executor=WAITING overall state should be
> CONNECTED/SUBMITTED/WAITING
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does the builtin hive jars talk of spark to HiveMetaStore(2.1) without any issues?

2017-11-09 Thread Marcelo Vanzin
I'd recommend against using the built-in jars for a different version
of Hive. You don't need to build your own Spark; just set
spark.sql.hive.metastore.jars / spark.sql.hive.metastore.version (see
documentation).

On Thu, Nov 9, 2017 at 2:10 AM, yaooqinn  wrote:
> Hi, all
> The builtin hive version for spark 2.x is hive-1.2.1.spark2, I'd like know
> whether it works for hive meta store version 2.1 or not.
>
> If not, I'd like to build a spark package with -Dhive.version=2.x.spark2 but
> find no such a maven artifact there, is there any process to deploy one?
>
> Or I just need to specify *spark.sql.hive.metastore.jars *to the hive 2.1
> client jars.
>
> Best Regards!
> Kent Yao
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin
You don't need to collect data in the driver to save it. The code in
the original question doesn't use "collect()", so it's actually doing
a distributed write.


On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin  wrote:
> Steve,
>
>
>
> If I refer to the collect() API, it says “Running collect requires moving
> all the data into the application's driver process, and doing so on a very
> large dataset can crash the driver process with OutOfMemoryError.” So why
> would you need a distributed FS?
>
>
>
> jg
>
>
>
> From: Steve Loughran [mailto:ste...@hortonworks.com]
> Sent: Saturday, September 30, 2017 6:10 AM
> To: JG Perrin 
> Cc: Alexander Czech ; user@spark.apache.org
> Subject: Re: HDFS or NFS as a cache?
>
>
>
>
>
> On 29 Sep 2017, at 20:03, JG Perrin  wrote:
>
>
>
> You will collect in the driver (often the master) and it will save the data,
> so for saving, you will not have to set up HDFS.
>
>
>
> no, it doesn't work quite like that.
>
>
>
> 1. workers generate their data and save somwhere
>
> 2. on "task commit" they move their data to some location where it will be
> visible for "job commit" (rename, upload, whatever)
>
> 3. job commit —which is done in the driver,— takes all the committed task
> data and makes it visible in the destination directory.
>
> 4. Then they create a _SUCCESS file to say "done!"
>
>
>
>
>
> This is done with Spark talking between workers and drivers to guarantee
> that only one task working on a specific part of the data commits their
> work, only
>
> committing the job once all tasks have finished
>
>
>
> The v1 mapreduce committer implements (2) by moving files under a job
> attempt dir, and (3) by moving it from the job attempt dir to the
> destination. one rename per task commit, another rename of every file on job
> commit. In HFDS, Azure wasb and other stores with an O(1) atomic rename,
> this isn't *too* expensve, though that final job commit rename still takes
> time to list and move lots of files
>
>
>
> The v2 committer implements (2) by renaming to the destination directory and
> (3) as a no-op. Rename in the tasks then, but not not that second,
> serialized one at the end
>
>
>
> There's no copy of data from workers to driver, instead you need a shared
> output filesystem so that the job committer can do its work alongside the
> tasks.
>
>
>
> There are alternatives committer agorithms,
>
>
>
> 1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo
>
> 2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code
> (https://github.com/SparkTC/stocator/)
>
> 3. Ongoing work in Hadoop itself for better committers. Goal: year end &
> Hadoop 3.1 https://issues.apache.org/jira/browse/HADOOP-13786 . The oode is
> all there, Parquet is a troublespot, and more testing is welcome from anyone
> who wants to help.
>
> 4. Databricks have "something"; specifics aren't covered, but I assume its
> dynamo DB based
>
>
>
>
>
> -Steve
>
>
>
>
>
>
>
>
>
>
>
>
>
> From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
> Sent: Friday, September 29, 2017 8:15 AM
> To: user@spark.apache.org
> Subject: HDFS or NFS as a cache?
>
>
>
> I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
> parquet files to S3. But the S3 performance for various reasons is bad when
> I access s3 through the parquet write method:
>
> df.write.parquet('s3a://bucket/parquet')
>
> Now I want to setup a small cache for the parquet output. One output is
> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
> master, write the output to it and then move it to S3? Or should I setup a
> HDFS on the Master? Or should I even opt for an additional cluster running a
> HDFS solution on more than one node?
>
> thanks!
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Marcelo Vanzin
Jars distributed using --jars are not added to the system classpath,
so log4j cannot see them.

To work around that, you need to manually add the *name* jar to the
driver executor classpaths:

spark.driver.extraClassPath=some.jar
spark.executor.extraClassPath=some.jar

In client mode you should use spark.yarn.dist.jars instead of --jars,
and change the driver classpath above to point to the local copy of
the jar.


On Wed, Aug 9, 2017 at 2:52 PM, Mikhailau, Alex  wrote:
> I have log4j json layout jars added via spark-submit on EMR
>
>
>
> /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars
> /home/hadoop/lib/jsonevent-layout-1.7.jar,/home/hadoop/lib/json-smart-1.1.1.jar
> --driver-java-options "-XX:+AlwaysPreTouch -XX:MaxPermSize=6G" --class
> com.mlbam.emr.XXX  s3://xxx/aa/jars/ spark-job-assembly-1.4.1-SNAPSHOT.jar
> ActionOnFailure=CONTINUE
>
>
>
>
>
> this is the process running on the executor:
>
>
>
> /usr/lib/jvm/java-1.8.0/bin/java -server -Xmx8192m -XX:+AlwaysPreTouch
> -XX:MaxPermSize=6G
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1502310393755_0003/container_1502310393755_0003_01_05/tmp
> -Dspark.driver.port=32869 -Dspark.history.ui.port=18080 -Dspark.ui.port=0
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1502310393755_0003/container_1502310393755_0003_01_05
> -XX:OnOutOfMemoryError=kill %p
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@10.202.138.158:32869 --executor-id 3
> --hostname ip-10-202-138-98.mlbam.qa.us-east-1.bamgrid.net --cores 8
> --app-id application_1502310393755_0003 --user-class-path
> file:/mnt/yarn/usercache/hadoop/appcache/application_1502310393755_0003/container_1502310393755_0003_01_05/__app__.jar
> --user-class-path
> file:/mnt/yarn/usercache/hadoop/appcache/application_1502310393755_0003/container_1502310393755_0003_01_05/jsonevent-layout-1.7.jar
> --user-class-path
> file:/mnt/yarn/usercache/hadoop/appcache/application_1502310393755_0003/container_1502310393755_0003_01_05/json-smart-1.1.1.jar
>
>
>
> I see that jsonevent-layout-1.7.jar is passed as –user-class-path to the job
> (see the above process), yet, I see the following log exception in my
> stderr:
>
>
>
> log4j:ERROR Could not instantiate class
> [net.logstash.log4j.JSONEventLayoutV1].
>
> java.lang.ClassNotFoundException: net.logstash.log4j.JSONEventLayoutV1
>
>
>
>
>
> Am I doing something wrong?
>
>
>
> Thank you,
>
>
>
> Alex



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark2.1 installation issue

2017-07-27 Thread Marcelo Vanzin
Hello,

This is a CDH-specific issue, please use the Cloudera forums / support
line instead of the Apache group.

On Thu, Jul 27, 2017 at 10:54 AM, Vikash Kumar
 wrote:
> I have installed spark2 parcel through cloudera CDH 12.0. I see some issue
> there. Look like it didn't got configured properly.
>
> $ spark2-shell
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/FSDataInputStream
> at
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> at
> org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at
> org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
> at
> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:104)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.fs.FSDataInputStream
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
>  I have Hadoop version:
>
> $ hadoop version
> Hadoop 2.6.0-cdh5.12.0
> Subversion http://github.com/cloudera/hadoop -r
> dba647c5a8bc5e09b572d76a8d29481c78d1a0dd
> Compiled by jenkins on 2017-06-29T11:31Z
> Compiled with protoc 2.5.0
> From source with checksum 7c45ae7a4592ce5af86bc4598c5b4
> This command was run using
> /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/jars/hadoop-common-2.6.0-cdh5.12.0.jar
>
> also ,
>
> $ ls /etc/spark/conf shows :
>
> classpath.txt__cloudera_metadata__
> navigator.lineage.client.properties  spark-env.sh
> __cloudera_generation__  log4j.properties   spark-defaults.conf
> yarn-conf
>
>
> while, /etc/spark2/conf is empty .
>
>
> How should I fix this ? Do I need to do any manual configuration ?
>
>
>
> Regards,
> Vikash



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark application compiled with 1.6 on spark 2.1 cluster

2017-07-27 Thread Marcelo Vanzin
On Wed, Jul 26, 2017 at 10:45 PM, satishl  wrote:
> is this a supported scenario - i.e., can I run app compiled with spark 1.6
> on a 2.+ spark cluster?

In general, no.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
> However, I see some JIRAs are assigned to someone time to time. Were those
> mistakes or would you mind if I ask when someone is assigned?

I'm not sure if there are any guidelines of when to assign; since
there has been an agreement that bugs should remain unassigned I don't
think I've personally done it, although I have seen others do it. In
general I'd say it's ok if there's a good justification for it (e.g.
"this is a large change and this person who is an active contributor
will work on it"), but in the general case should be avoided.

I agree it's a little confusing, especially comparing to other
projects, but it's how it's been done for a couple of years at least
(or at least what I have understood).


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
We don't generally set assignees. Submit a PR on github and the PR
will be linked on JIRA; if your PR is submitted, then the bug is
assigned to you.

On Mon, Jul 24, 2017 at 5:57 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
> If I want to do some work about an issue registed in JIRA, how to set the
> assignee to me please?
>
> thanks
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-21 Thread Marcelo Vanzin
On Fri, Jul 21, 2017 at 5:00 AM, Gokula Krishnan D  wrote:
> Is there anyway can we setup the scheduler mode in Spark Cluster level
> besides application (SC level).

That's called the cluster (or resource) manager. e.g., configure
separate queues in YARN with a maximum number of resources for each.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-20 Thread Marcelo Vanzin
Also, things seem to work with all your settings if you disable use of
the shuffle service (which also means no dynamic allocation), if that
helps you make progress in what you wanted to do.

On Thu, Jul 20, 2017 at 4:25 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
> Hmm... I tried this with the new shuffle service (I generally have an
> old one running) and also see failures. I also noticed some odd things
> in your logs that I'm also seeing in mine, but it's better to track
> these in a bug instead of e-mail.
>
> Please file a bug and attach your logs there, I'll take a look at this.
>
> On Thu, Jul 20, 2017 at 2:06 PM, Udit Mehrotra
> <udit.mehrotr...@gmail.com> wrote:
>> Hi Marcelo,
>>
>> I ran with setting DEBUG level logging for 'org.apache.spark.network.crypto'
>> for both Spark and Yarn.
>>
>> However, the DEBUG logs still do not convey anything meaningful. Please find
>> it attached. Can you please take a quick look, and let me know if you see
>> anything suspicious ?
>>
>> If not, do you think I should open a JIRA for this ?
>>
>> Thanks !
>>
>> On Wed, Jul 19, 2017 at 3:14 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>>
>>> Hmm... that's not enough info and logs are intentionally kept silent
>>> to avoid flooding, but if you enable DEBUG level logging for
>>> org.apache.spark.network.crypto in both YARN and the Spark app, that
>>> might provide more info.
>>>
>>> On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
>>> <udit.mehrotr...@gmail.com> wrote:
>>> > So I added these settings in yarn-site.xml as well. Now I get a
>>> > completely
>>> > different error, but atleast it seems like it is using the crypto
>>> > library:
>>> >
>>> > ExecutorLostFailure (executor 1 exited caused by one of the running
>>> > tasks)
>>> > Reason: Unable to create executor due to Unable to register with
>>> > external
>>> > shuffle server due to : java.lang.IllegalArgumentException:
>>> > Authentication
>>> > failed.
>>> > at
>>> >
>>> > org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
>>> > at
>>> >
>>> > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>>> >
>>> > Any clue about this ?
>>> >
>>> >
>>> > On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin <van...@cloudera.com>
>>> > wrote:
>>> >>
>>> >> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>>> >> <udit.mehrotr...@gmail.com> wrote:
>>> >> > Is there any additional configuration I need for external shuffle
>>> >> > besides
>>> >> > setting the following:
>>> >> > spark.network.crypto.enabled true
>>> >> > spark.network.crypto.saslFallback false
>>> >> > spark.authenticate   true
>>> >>
>>> >> Have you set these options on the shuffle service configuration too
>>> >> (which is the YARN xml config file, not spark-defaults.conf)?
>>> >>
>>> >> If you have there might be an issue, and you should probably file a
>>> >> bug and include your NM's log file.
>>> >>
>>> >> --
>>> >> Marcelo
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-20 Thread Marcelo Vanzin
Hmm... I tried this with the new shuffle service (I generally have an
old one running) and also see failures. I also noticed some odd things
in your logs that I'm also seeing in mine, but it's better to track
these in a bug instead of e-mail.

Please file a bug and attach your logs there, I'll take a look at this.

On Thu, Jul 20, 2017 at 2:06 PM, Udit Mehrotra
<udit.mehrotr...@gmail.com> wrote:
> Hi Marcelo,
>
> I ran with setting DEBUG level logging for 'org.apache.spark.network.crypto'
> for both Spark and Yarn.
>
> However, the DEBUG logs still do not convey anything meaningful. Please find
> it attached. Can you please take a quick look, and let me know if you see
> anything suspicious ?
>
> If not, do you think I should open a JIRA for this ?
>
> Thanks !
>
> On Wed, Jul 19, 2017 at 3:14 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Hmm... that's not enough info and logs are intentionally kept silent
>> to avoid flooding, but if you enable DEBUG level logging for
>> org.apache.spark.network.crypto in both YARN and the Spark app, that
>> might provide more info.
>>
>> On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
>> <udit.mehrotr...@gmail.com> wrote:
>> > So I added these settings in yarn-site.xml as well. Now I get a
>> > completely
>> > different error, but atleast it seems like it is using the crypto
>> > library:
>> >
>> > ExecutorLostFailure (executor 1 exited caused by one of the running
>> > tasks)
>> > Reason: Unable to create executor due to Unable to register with
>> > external
>> > shuffle server due to : java.lang.IllegalArgumentException:
>> > Authentication
>> > failed.
>> > at
>> >
>> > org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
>> > at
>> >
>> > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
>> > at
>> >
>> > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
>> > at
>> >
>> > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>> >
>> > Any clue about this ?
>> >
>> >
>> > On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin <van...@cloudera.com>
>> > wrote:
>> >>
>> >> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>> >> <udit.mehrotr...@gmail.com> wrote:
>> >> > Is there any additional configuration I need for external shuffle
>> >> > besides
>> >> > setting the following:
>> >> > spark.network.crypto.enabled true
>> >> > spark.network.crypto.saslFallback false
>> >> > spark.authenticate   true
>> >>
>> >> Have you set these options on the shuffle service configuration too
>> >> (which is the YARN xml config file, not spark-defaults.conf)?
>> >>
>> >> If you have there might be an issue, and you should probably file a
>> >> bug and include your NM's log file.
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Hmm... that's not enough info and logs are intentionally kept silent
to avoid flooding, but if you enable DEBUG level logging for
org.apache.spark.network.crypto in both YARN and the Spark app, that
might provide more info.

On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra
<udit.mehrotr...@gmail.com> wrote:
> So I added these settings in yarn-site.xml as well. Now I get a completely
> different error, but atleast it seems like it is using the crypto library:
>
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks)
> Reason: Unable to create executor due to Unable to register with external
> shuffle server due to : java.lang.IllegalArgumentException: Authentication
> failed.
> at
> org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:125)
> at
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:157)
> at
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>
> Any clue about this ?
>
>
> On Wed, Jul 19, 2017 at 1:13 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
>> <udit.mehrotr...@gmail.com> wrote:
>> > Is there any additional configuration I need for external shuffle
>> > besides
>> > setting the following:
>> > spark.network.crypto.enabled true
>> > spark.network.crypto.saslFallback false
>> > spark.authenticate   true
>>
>> Have you set these options on the shuffle service configuration too
>> (which is the YARN xml config file, not spark-defaults.conf)?
>>
>> If you have there might be an issue, and you should probably file a
>> bug and include your NM's log file.
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra
 wrote:
> Is there any additional configuration I need for external shuffle besides
> setting the following:
> spark.network.crypto.enabled true
> spark.network.crypto.saslFallback false
> spark.authenticate   true

Have you set these options on the shuffle service configuration too
(which is the YARN xml config file, not spark-defaults.conf)?

If you have there might be an issue, and you should probably file a
bug and include your NM's log file.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Well, how did you install the Spark shuffle service on YARN? It's not
part of YARN.

If you really have the Spark 2.2 shuffle service jar deployed in your
YARN service, then perhaps you didn't configure it correctly to use
the new auth mechanism.

On Wed, Jul 19, 2017 at 12:47 PM, Udit Mehrotra
<udit.mehrotr...@gmail.com> wrote:
> Sorry about that. Will keep the list in my replies.
>
> So, just to clarify I am not using an older version of sparks shuffle
> service. This is a brand new cluster with just Spark 2.2.0 installed
> alongside hadoop 2.7.3. Could there be anything else I am missing, or I can
> try differently ?
>
>
> Thanks !
>
>
> On Wed, Jul 19, 2017 at 12:03 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>>
>> Please include the list on your replies, so others can benefit from
>> the discussion too.
>>
>> On Wed, Jul 19, 2017 at 11:43 AM, Udit Mehrotra
>> <udit.mehrotr...@gmail.com> wrote:
>> > Hi Marcelo,
>> >
>> > Thanks a lot for confirming that. Can you explain what you mean by
>> > upgrading
>> > the version of shuffle service ? Wont it automatically use the
>> > corresponding
>> > class from spark 2.2.0 to start the external shuffle service ?
>>
>> That depends on how you deploy your shuffle service. Normally YARN
>> will have no idea that your application is using a new Spark - it will
>> still have the old version of the service jar in its classpath.
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Please include the list on your replies, so others can benefit from
the discussion too.

On Wed, Jul 19, 2017 at 11:43 AM, Udit Mehrotra
 wrote:
> Hi Marcelo,
>
> Thanks a lot for confirming that. Can you explain what you mean by upgrading
> the version of shuffle service ? Wont it automatically use the corresponding
> class from spark 2.2.0 to start the external shuffle service ?

That depends on how you deploy your shuffle service. Normally YARN
will have no idea that your application is using a new Spark - it will
still have the old version of the service jar in its classpath.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 11:19 AM, Udit Mehrotra
 wrote:
> spark.network.crypto.saslFallback false
> spark.authenticate   true
>
> This seems to work fine with internal shuffle service of Spark. However,
> when in I try it with Yarn’s external shuffle service the executors are
> unable to register with the shuffle service as it still expects SASL
> authentication. Here is the error I get:
>
> Can someone confirm that this is expected behavior? Or provide some
> guidance, on how I can make it work with external shuffle service ?

Yes, that's the expected behavior, since you disabled SASL fallback in
your configuration. If you set it back on, then you can talk to the
old shuffle service.

Or you could upgrade the version of the shuffle service running on
your YARN cluster so that it also supports the new auth mechanism.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark history server running on Mongo

2017-07-19 Thread Marcelo Vanzin
On Tue, Jul 18, 2017 at 7:21 PM, Ivan Sadikov  wrote:
> Repository that I linked to does not require rebuilding Spark and could be
> used with current distribution, which is preferable in my case.

Fair enough, although that means that you're re-implementing the Spark
UI, which makes that project have to constantly be modified to keep up
with UI changes in Spark (or create its own UI and forget about what
Spark does). Which is what Spree does too.

In the long term I believe having these sort of enhancements in Spark
itself would benefit more people.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark history server running on Mongo

2017-07-18 Thread Marcelo Vanzin
See SPARK-18085. That has much of the same goals re: SHS resource
usage, and also provides a (currently non-public) API where you could
just create a MongoDB implementation if you want.

On Tue, Jul 18, 2017 at 12:56 AM, Ivan Sadikov  wrote:
> Hello everyone!
>
> I have been working on Spark history server that uses MongoDB as a datastore
> for processed events to iterate on idea that Spree project uses for Spark
> UI. Project was originally designed to improve on standalone history server
> with reduced memory footprint.
>
> Project lives here: https://github.com/lightcopy/history-server
>
> These are just very early days of the project, sort of pre-alpha (some
> features are missing, and metrics in some failed jobs cases are
> questionable). Code is being tested on several 8gb and 2gb logs and aims to
> lower resource usage since we run history server together with several other
> systems.
>
> Would greatly appreciate any feedback on repository (issues/pull
> requests/suggestions/etc.). Thanks a lot!
>
>
> Cheers,
>
> Ivan
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Yes.

On Mon, Jul 17, 2017 at 10:47 AM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> thanks Marcelo.
>
> are these files distributed through hdfs?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 17 July 2017 at 18:46, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> The YARN backend distributes all files and jars you submit with your
>> application.
>>
>> On Mon, Jul 17, 2017 at 10:45 AM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>> > thanks guys.
>> >
>> > just to clarify let us assume i am doing spark-submit as below:
>> >
>> > ${SPARK_HOME}/bin/spark-submit \
>> > --packages ${PACKAGES} \
>> > --driver-memory 2G \
>> > --num-executors 2 \
>> > --executor-memory 2G \
>> > --executor-cores 2 \
>> > --master yarn \
>> > --deploy-mode client \
>> > --conf "${SCHEDULER}" \
>> > --conf "${EXTRAJAVAOPTIONS}" \
>> > --jars ${JARS} \
>> > --class "${FILE_NAME}" \
>> > --conf "${SPARKUIPORT}" \
>> > --conf "${SPARKDRIVERPORT}" \
>> > --conf "${SPARKFILESERVERPORT}" \
>> > --conf "${SPARKBLOCKMANAGERPORT}" \
>> > --conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
>> > ${JAR_FILE}
>> >
>> > The ${JAR_FILE} is the one. As I understand Spark should distribute that
>> > ${JAR_FILE} to each container?
>> >
>> > Also --jars ${JARS} are the list of normal jar files that need to exist
>> > in
>> > the same directory on each executor node?
>> >
>> > cheers,
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The
>> > author will in no case be liable for any monetary damages arising from
>> > such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 17 July 2017 at 18:18, ayan guha <guha.a...@gmail.com> wrote:
>> >>
>> >> Hi Mitch
>> >>
>> >> your jar file can be anywhere in the file system, including hdfs.
>> >>
>> >> If using yarn, preferably use cluster mode in terms of deployment.
>> >>
>> >> Yarn will distribute the jar to each container.
>> >>
>> >> Best
>> >> Ayan
>> >>
>> >> On Tue, 18 Jul 2017 at 2:17 am, Marcelo Vanzin <van...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> Spark distributes your application jar for you.
>> >>>
>> >>> On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
>> >>> <mich.talebza...@gmail.com> wrote:
>> >>> > hi guys,
>> >>> >
>> >>> >
>> >>> > an uber/fat jar file has been created to run with spark in CDH yarc
>> >>> > client
>> >>> > mode.
>> >>> >
>> >>> > As usual job is submitted to the edge node.
>> >>> >
>> >>> > does the jar file has to be placed in the same directory ewith spark
>> >>> > is
>> >>> > running in the cluster to make it work?
>> >>> >
>> >>> > Also w

Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
The YARN backend distributes all files and jars you submit with your
application.

On Mon, Jul 17, 2017 at 10:45 AM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> thanks guys.
>
> just to clarify let us assume i am doing spark-submit as below:
>
> ${SPARK_HOME}/bin/spark-submit \
> --packages ${PACKAGES} \
> --driver-memory 2G \
> --num-executors 2 \
> --executor-memory 2G \
> --executor-cores 2 \
> --master yarn \
> --deploy-mode client \
> --conf "${SCHEDULER}" \
> --conf "${EXTRAJAVAOPTIONS}" \
> --jars ${JARS} \
> --class "${FILE_NAME}" \
> --conf "${SPARKUIPORT}" \
> --conf "${SPARKDRIVERPORT}" \
> --conf "${SPARKFILESERVERPORT}" \
> --conf "${SPARKBLOCKMANAGERPORT}" \
> --conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
> ${JAR_FILE}
>
> The ${JAR_FILE} is the one. As I understand Spark should distribute that
> ${JAR_FILE} to each container?
>
> Also --jars ${JARS} are the list of normal jar files that need to exist in
> the same directory on each executor node?
>
> cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 17 July 2017 at 18:18, ayan guha <guha.a...@gmail.com> wrote:
>>
>> Hi Mitch
>>
>> your jar file can be anywhere in the file system, including hdfs.
>>
>> If using yarn, preferably use cluster mode in terms of deployment.
>>
>> Yarn will distribute the jar to each container.
>>
>> Best
>> Ayan
>>
>> On Tue, 18 Jul 2017 at 2:17 am, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>>
>>> Spark distributes your application jar for you.
>>>
>>> On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
>>> <mich.talebza...@gmail.com> wrote:
>>> > hi guys,
>>> >
>>> >
>>> > an uber/fat jar file has been created to run with spark in CDH yarc
>>> > client
>>> > mode.
>>> >
>>> > As usual job is submitted to the edge node.
>>> >
>>> > does the jar file has to be placed in the same directory ewith spark is
>>> > running in the cluster to make it work?
>>> >
>>> > Also what will happen if say out of 9 nodes running spark, 3 have not
>>> > got
>>> > the jar file. will that job fail or it will carry on on the fremaing 6
>>> > nodes
>>> > that have that jar file?
>>> >
>>> > thanks
>>> >
>>> > Dr Mich Talebzadeh
>>> >
>>> >
>>> >
>>> > LinkedIn
>>> >
>>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >
>>> >
>>> >
>>> > http://talebzadehmich.wordpress.com
>>> >
>>> >
>>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> > loss, damage or destruction of data or any other property which may
>>> > arise
>>> > from relying on this email's technical content is explicitly
>>> > disclaimed. The
>>> > author will in no case be liable for any monetary damages arising from
>>> > such
>>> > loss, damage or destruction.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>> --
>> Best Regards,
>> Ayan Guha
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Spark distributes your application jar for you.

On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh
 wrote:
> hi guys,
>
>
> an uber/fat jar file has been created to run with spark in CDH yarc client
> mode.
>
> As usual job is submitted to the edge node.
>
> does the jar file has to be placed in the same directory ewith spark is
> running in the cluster to make it work?
>
> Also what will happen if say out of 9 nodes running spark, 3 have not got
> the jar file. will that job fail or it will carry on on the fremaing 6 nodes
> that have that jar file?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark job profiler results showing high TCP cpu time

2017-06-23 Thread Marcelo Vanzin
That thread looks like the connection between the Spark process and
jvisualvm. It's expected to show high up when doing sampling if the
app is not doing much else.

On Fri, Jun 23, 2017 at 10:46 AM, Reth RM  wrote:
> Running a spark job on local machine and profiler results indicate that
> highest time spent in sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.
> Screenshot of profiler result can be seen here : https://jpst.it/10i-V
>
> Spark job(program) is performing IO (sc.wholeTextFile method of spark apis),
> Reads files from local file system and analyses the text to obtain tokens.
>
> Any thoughts and suggestions?
>
> Thanks.
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkAppHandle.Listener.infoChanged behaviour

2017-06-04 Thread Marcelo Vanzin
On Sat, Jun 3, 2017 at 7:16 PM, Mohammad Tariq  wrote:
> I am having a bit of difficulty in understanding the exact behaviour of
> SparkAppHandle.Listener.infoChanged(SparkAppHandle handle) method. The
> documentation says :
>
> Callback for changes in any information that is not the handle's state.
>
> What exactly is meant by any information here? Apart from state other pieces
> of information I can see is ID

So, you answered your own question.

If there's ever any new kind of information, it would use the same event.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkAppHandle - get Input and output streams

2017-05-18 Thread Marcelo Vanzin
On Thu, May 18, 2017 at 10:10 AM, Nipun Arora  wrote:
> I wanted to know how to get the the input and output streams from
> SparkAppHandle?

You can't. You can redirect the output, but not directly get the streams.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread Marcelo Vanzin
scalastyle runs on the "verify" phase, which is after package but
before install.

On Wed, May 17, 2017 at 5:47 PM, yiskylee  wrote:
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> package
> works, but
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> install
> triggers scalastyle violation error.
>
> Is the scalastyle check not used on package but only on install? To install,
> should I turn off "failOnViolation" in the pom?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-package-tp28693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Shuffle Encryption

2017-05-12 Thread Marcelo Vanzin
http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

All the options you need to know are there.

On Fri, May 12, 2017 at 9:11 AM, Shashi Vishwakarma
 wrote:
> Hi
>
> I was doing research on encrypting spark shuffle data and found that Spark
> 2.1 has got that feature.
>
> https://issues.apache.org/jira/browse/SPARK-5682
>
> Does anyone has more documentation around it ? How do I aim to use this
> feature in real production environment keeping mind that I need to secure
> spark job. ?
>
> Thanks
> Shashi



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
On Tue, May 2, 2017 at 9:07 AM, Nan Zhu  wrote:
> I have no easy way to pass jar path to those forked Spark
> applications? (except that I download jar from a remote path to a local temp
> dir after resolving some permission issues, etc.?)

Yes, that's the only way currently in client mode.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: --jars does not take remote jar?

2017-05-02 Thread Marcelo Vanzin
Remote jars are added to executors' classpaths, but not the driver's.
In YARN cluster mode, they would also be added to the driver's class
path.

On Tue, May 2, 2017 at 8:43 AM, Nan Zhu  wrote:
> Hi, all
>
> For some reason, I tried to pass in a HDFS path to the --jars option in
> spark-submit
>
> According to the document,
> http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management,
> --jars would accept remote path
>
> However, in the implementation,
> https://github.com/apache/spark/blob/c622a87c44e0621e1b3024fdca9b2aa3c508615b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L757,
> it does not look like so
>
> Did I miss anything?
>
> Best,
>
> Nan



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problem with Java and Scala interoperability // streaming

2017-04-19 Thread Marcelo Vanzin
I see a bunch of getOrCreate methods in that class. They were all
added in SPARK-6752, a long time ago.

On Wed, Apr 19, 2017 at 1:51 PM, kant kodali <kanth...@gmail.com> wrote:
> There is no getOrCreate for JavaStreamingContext however I do use
> JavaStreamingContext inside createStreamingContext() from my code in the
> previous email.
>
> On Wed, Apr 19, 2017 at 1:46 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Why are you not using JavaStreamingContext if you're writing Java?
>>
>> On Wed, Apr 19, 2017 at 1:42 PM, kant kodali <kanth...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I get the following errors whichever way I try either lambda or
>> > generics. I
>> > am using
>> > spark 2.1 and scalla 2.11.8
>> >
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > () ->
>> > {return createStreamingContext();}, null, false);
>> >
>> > ERROR
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > () ->
>> > {return createStreamingContext();}, null, false);
>> >
>> > multiple non-overriding abstract methods found in interface Function0
>> >
>> > Note: Some messages have been simplified; recompile with -Xdiags:verbose
>> > to
>> > get full output
>> >
>> > 1 error
>> >
>> > :compileJava FAILED
>> >
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > new
>> > Function0() {
>> > @Override
>> > public StreamingContext apply() {
>> > return createStreamingContext();
>> > }
>> > }, null, false);
>> >
>> >
>> > ERROR
>> >
>> > is not abstract and does not override abstract method apply$mcV$sp() in
>> > Function0
>> >
>> > StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir,
>> > new
>> > Function0() {
>> > ^
>> >
>> > 1 error
>> >
>> > :compileJava FAILED
>> >
>> >
>> > Thanks!
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problem with Java and Scala interoperability // streaming

2017-04-19 Thread Marcelo Vanzin
Why are you not using JavaStreamingContext if you're writing Java?

On Wed, Apr 19, 2017 at 1:42 PM, kant kodali  wrote:
> Hi All,
>
> I get the following errors whichever way I try either lambda or generics. I
> am using
> spark 2.1 and scalla 2.11.8
>
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, () ->
> {return createStreamingContext();}, null, false);
>
> ERROR
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, () ->
> {return createStreamingContext();}, null, false);
>
> multiple non-overriding abstract methods found in interface Function0
>
> Note: Some messages have been simplified; recompile with -Xdiags:verbose to
> get full output
>
> 1 error
>
> :compileJava FAILED
>
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, new
> Function0() {
> @Override
> public StreamingContext apply() {
> return createStreamingContext();
> }
> }, null, false);
>
>
> ERROR
>
> is not abstract and does not override abstract method apply$mcV$sp() in
> Function0
>
> StreamingContext ssc = StreamingContext.getOrCreate(hdfsCheckpointDir, new
> Function0() {
> ^
>
> 1 error
>
> :compileJava FAILED
>
>
> Thanks!
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Monitoring ongoing Spark Job when run in Yarn Cluster mode

2017-03-13 Thread Marcelo Vanzin
It's linked from the YARN RM's Web UI (see the "Application Master"
link for the running application).

On Mon, Mar 13, 2017 at 6:53 AM, Sourav Mazumder
 wrote:
> Hi,
>
> Is there a way to monitor an ongoing Spark Job when running in Yarn Cluster
> mode ?
>
> In  my understanding in Yarn Cluster mode Spark Monitoring UI for the
> ongoing job would not be available in 4040 port. So is there an alternative
> ?
>
> Regards,
> Sourav



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
You're either running a really old version of Spark where there might
have been issues in that code, or you're actually missing some
backslashes in the command you pasted in your message.

On Tue, Feb 28, 2017 at 2:05 PM, Joe Olson <jo4...@outlook.com> wrote:
>> Everything after the jar path is passed to the main class as parameters.
>
> I don't think that is accurate if your application arguments contain double
> dashes. I've tried with several permutations of with and without '\'s and
> newlines.
>
> Just thought I'd ask here before I have to re-configure and re-compile all
> my jars.
>
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master spark://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   /path/to/examples.jar
>   --num-decimals=1000
>   --second-argument=Arg2
>
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "2.1.0",
>   "submissionId" : "driver-20170228155848-0016",
>   "success" : true
> }
> ./test3.sh: line 15: --num-decimals=1000: command not found
> ./test3.sh: line 16: --second-argument=Arg2: command not found
>
>
> 
> From: Marcelo Vanzin <van...@cloudera.com>
> Sent: Tuesday, February 28, 2017 12:17:49 PM
> To: Joe Olson
> Cc: user@spark.apache.org
> Subject: Re: spark-submit question
>
> Everything after the jar path is passed to the main class as
> parameters. So if it's not working you're probably doing something
> wrong in your code (that you haven't posted).
>
> On Tue, Feb 28, 2017 at 7:05 AM, Joe Olson <jo4...@outlook.com> wrote:
>> For spark-submit, I know I can submit application level command line
>> parameters to my .jar.
>>
>>
>> However, can I prefix them with switches? My command line params are
>> processed in my applications using JCommander. I've tried several
>> variations
>> of the below with no success.
>>
>>
>> An example of what I am trying to do is below in the --num-decimals
>> argument.
>>
>>
>> ./bin/spark-submit \
>>   --class org.apache.spark.examples.SparkPi \
>>   --master spark://207.184.161.138:7077 \
>>   --deploy-mode cluster \
>>   --supervise \
>>   --executor-memory 20G \
>>   --total-executor-cores 100 \
>>   /path/to/examples.jar \
>>   --num-decimals=1000 \
>>   --second-argument=Arg2
>>
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-submit question

2017-02-28 Thread Marcelo Vanzin
Everything after the jar path is passed to the main class as
parameters. So if it's not working you're probably doing something
wrong in your code (that you haven't posted).

On Tue, Feb 28, 2017 at 7:05 AM, Joe Olson  wrote:
> For spark-submit, I know I can submit application level command line
> parameters to my .jar.
>
>
> However, can I prefix them with switches? My command line params are
> processed in my applications using JCommander. I've tried several variations
> of the below with no success.
>
>
> An example of what I am trying to do is below in the --num-decimals
> argument.
>
>
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master spark://207.184.161.138:7077 \
>   --deploy-mode cluster \
>   --supervise \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   /path/to/examples.jar \
>   --num-decimals=1000 \
>   --second-argument=Arg2
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPark - YARN Cluster Mode

2017-02-27 Thread Marcelo Vanzin
>  none of my Config settings

Is it none of the configs or just the queue? You can't set the YARN
queue in cluster mode through code, it has to be set in the command
line. It's a chicken & egg problem (in cluster mode, the YARN app is
created before your code runs).

 --property-file works the same as setting options in the command
line, so you can use that instead.


On Sun, Feb 26, 2017 at 4:52 PM, ayan guha  wrote:
> Hi
>
> I am facing an issue with Cluster Mode, with pyspark
>
> Here is my code:
>
> conf = SparkConf()
> conf.setAppName("Spark Ingestion")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","22g")
> conf.set("spark.yarn.executor.memoryOverhead","4096")
> conf.set("spark.executor.cores","4")
> conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> r = sc.parallelize(xrange(1,1))
> print r.count()
>
> sc.stop()
>
> The problem is none of my Config settings are passed on to Yarn.
>
> spark-submit --master yarn --deploy-mode cluster ayan_test.py
>
> I tried the same code with deploy-mode=client and all config are passing
> fine.
>
> Am I missing something? Will introducing --property-file be of any help? Can
> anybody share some working example?
>
> Best
> Ayan
>
> --
> Best Regards,
> Ayan Guha



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Jars directory in Spark 2.0

2017-02-01 Thread Marcelo Vanzin
Spark has never shaded dependencies (in the sense of renaming the classes),
with a couple of exceptions (Guava and Jetty). So that behavior is nothing
new. Spark's dependencies themselves have a lot of other dependencies, so
doing that would have limited benefits anyway.

On Tue, Jan 31, 2017 at 11:23 PM, Sidney Feiner 
wrote:

> Is this done on purpose? Because it really makes it hard to deploy
> applications. Is there a reason they didn't shade the jars they use to
> begin with?
>
>
>
> *Sidney Feiner*   */*  SW Developer
>
> M: +972.528197720 <+972%2052-819-7720>  */*  Skype: sidney.feiner.startapp
>
>
>
> [image: StartApp] 
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Tuesday, January 31, 2017 7:26 PM
> *To:* Sidney Feiner 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Jars directory in Spark 2.0
>
>
>
> you basically have to keep your versions of dependencies in line with
> sparks or shade your own dependencies.
>
> you cannot just replace the jars in sparks jars folder. if you wan to
> update them you have to build spark yourself with updated dependencies and
> confirm it compiles, passes tests etc.
>
>
>
> On Tue, Jan 31, 2017 at 3:40 AM, Sidney Feiner 
> wrote:
>
> Hey,
>
> While migrating to Spark 2.X from 1.6, I've had many issues with jars that
> come preloaded with Spark in the "jars/" directory and I had to shade most
> of my packages.
>
> Can I replace the jars in this folder to more up to date versions? Are
> those jar used for anything internal in Spark which means I can't blindly
> replace them?
>
>
>
> Thanks J
>
>
>
>
>
> *Sidney Feiner*   */*  SW Developer
>
> M: +972.528197720 <+972%2052-819-7720>  */*  Skype: sidney.feiner.startapp
>
>
>
> [image: StartApp] 
>
>
>
> 
>
>   
>



-- 
Marcelo


Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
As I said.

Each app gets their own UI. Look at the logs printed to the output.
The port will depend on whether they're running on the same host at
the same time.

This is irrespective of how they are run.

On Mon, Jan 23, 2017 at 12:40 PM, kant kodali <kanth...@gmail.com> wrote:
> yes I meant submitting through spark-submit.
>
> so If I do spark-submit A.jar and spark-submit A.jar again. Do I get two
> UI's or one UI'? and which ports do they run on when using the stand alone
> mode?
>
> On Mon, Jan 23, 2017 at 12:19 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>>
>> Depends on what you mean by "job". Which is why I prefer "app", which
>> is clearer (something you submit using "spark-submit", for example).
>>
>> But really, I'm not sure what you're asking now.
>>
>> On Mon, Jan 23, 2017 at 12:15 PM, kant kodali <kanth...@gmail.com> wrote:
>> > hmm..I guess in that case my assumption of "app" is wrong. I thought the
>> > app
>> > is a client jar that you submit. no? If so, say I submit multiple jobs
>> > then
>> > I get two UI'S?
>> >
>> > On Mon, Jan 23, 2017 at 12:07 PM, Marcelo Vanzin <van...@cloudera.com>
>> > wrote:
>> >>
>> >> No. Each app has its own UI which runs (starting on) port 4040.
>> >>
>> >> On Mon, Jan 23, 2017 at 12:05 PM, kant kodali <kanth...@gmail.com>
>> >> wrote:
>> >> > I am using standalone mode so wouldn't be 8080 for my app web ui as
>> >> > well?
>> >> > There is nothing running on 4040 in my cluster.
>> >> >
>> >> >
>> >> > http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>> >> >
>> >> > On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin
>> >> > <van...@cloudera.com>
>> >> > wrote:
>> >> >>
>> >> >> That's the Master, whose default port is 8080 (not 4040). The
>> >> >> default
>> >> >> port for the app's UI is 4040.
>> >> >>
>> >> >> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali <kanth...@gmail.com>
>> >> >> wrote:
>> >> >> > I am not sure why Spark web UI keeps changing its port every time
>> >> >> > I
>> >> >> > restart
>> >> >> > a cluster? how can I make it run always on one port? I did make
>> >> >> > sure
>> >> >> > there
>> >> >> > is no process running on 4040(spark default web ui port) however
>> >> >> > it
>> >> >> > still
>> >> >> > starts at 8080. any ideas?
>> >> >> >
>> >> >> >
>> >> >> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> >> >> > http://x.x.x.x:8080
>> >> >> >
>> >> >> >
>> >> >> > Thanks!
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
Depends on what you mean by "job". Which is why I prefer "app", which
is clearer (something you submit using "spark-submit", for example).

But really, I'm not sure what you're asking now.

On Mon, Jan 23, 2017 at 12:15 PM, kant kodali <kanth...@gmail.com> wrote:
> hmm..I guess in that case my assumption of "app" is wrong. I thought the app
> is a client jar that you submit. no? If so, say I submit multiple jobs then
> I get two UI'S?
>
> On Mon, Jan 23, 2017 at 12:07 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>>
>> No. Each app has its own UI which runs (starting on) port 4040.
>>
>> On Mon, Jan 23, 2017 at 12:05 PM, kant kodali <kanth...@gmail.com> wrote:
>> > I am using standalone mode so wouldn't be 8080 for my app web ui as
>> > well?
>> > There is nothing running on 4040 in my cluster.
>> >
>> > http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>> >
>> > On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin <van...@cloudera.com>
>> > wrote:
>> >>
>> >> That's the Master, whose default port is 8080 (not 4040). The default
>> >> port for the app's UI is 4040.
>> >>
>> >> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali <kanth...@gmail.com>
>> >> wrote:
>> >> > I am not sure why Spark web UI keeps changing its port every time I
>> >> > restart
>> >> > a cluster? how can I make it run always on one port? I did make sure
>> >> > there
>> >> > is no process running on 4040(spark default web ui port) however it
>> >> > still
>> >> > starts at 8080. any ideas?
>> >> >
>> >> >
>> >> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> >> > http://x.x.x.x:8080
>> >> >
>> >> >
>> >> > Thanks!
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
No. Each app has its own UI which runs (starting on) port 4040.

On Mon, Jan 23, 2017 at 12:05 PM, kant kodali <kanth...@gmail.com> wrote:
> I am using standalone mode so wouldn't be 8080 for my app web ui as well?
> There is nothing running on 4040 in my cluster.
>
> http://spark.apache.org/docs/latest/security.html#standalone-mode-only
>
> On Mon, Jan 23, 2017 at 11:51 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>>
>> That's the Master, whose default port is 8080 (not 4040). The default
>> port for the app's UI is 4040.
>>
>> On Mon, Jan 23, 2017 at 11:47 AM, kant kodali <kanth...@gmail.com> wrote:
>> > I am not sure why Spark web UI keeps changing its port every time I
>> > restart
>> > a cluster? how can I make it run always on one port? I did make sure
>> > there
>> > is no process running on 4040(spark default web ui port) however it
>> > still
>> > starts at 8080. any ideas?
>> >
>> >
>> > MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
>> > http://x.x.x.x:8080
>> >
>> >
>> > Thanks!
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: why does spark web UI keeps changing its port?

2017-01-23 Thread Marcelo Vanzin
That's the Master, whose default port is 8080 (not 4040). The default
port for the app's UI is 4040.

On Mon, Jan 23, 2017 at 11:47 AM, kant kodali  wrote:
> I am not sure why Spark web UI keeps changing its port every time I restart
> a cluster? how can I make it run always on one port? I did make sure there
> is no process running on 4040(spark default web ui port) however it still
> starts at 8080. any ideas?
>
>
> MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at
> http://x.x.x.x:8080
>
>
> Thanks!



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is restarting of SparkContext allowed?

2016-12-15 Thread Marcelo Vanzin
(-dev, +user. dev is for Spark development, not for questions about
using Spark.)

You haven't posted code here or the actual error. But you might be
running into SPARK-15754. Or into other issues with yarn-client mode
and "--principal / --keytab" (those have known issues in client mode).

If you have the above fix, you should be able to run the SparkContext
in client mode inside a UGI.doAs() block, after you login the user,
and later stop the context and start a new one. (And don't use
"--principal" / "--keytab" in that case.)


On Thu, Dec 15, 2016 at 1:46 PM, Alexey Klimov  wrote:
> Hello, my question is the continuation of a problem I described  here
> 
> .
>
> I've done some investigation and found out that nameNode.getDelegationToken
> is called during constructing SparkContext even if delegation token is
> already presented in token list of current logged user in object of
> UserGroupInforation class. The problem doesn't occur when waiting time
> before constructing a new context is less than 10 seconds, because rpc
> connection to namenode just isn't resetting because of
> ipc.client.connection.maxidletime property.
>
> As a workaround of this problem I do login from keytab before every
> constructing of SparkContext, which basically just resets token list of
> current logged user (as well as whole user structure) and the problem seems
> to be gone. Still I'm not really sure that it is correct way to deal with
> SparkContext.
>
> Having found a reason of the problem, I've got 2 assumptions now:
> First - SparkContext was designed to be restarted during JVM run and
> behaviour above is just a bug.
> Second - it wasn't and I'm just using SparkContext in a wrong manner.
>
> Since I haven't found any related bug in Jira and any solution on the
> internet (as well as too many users facing this error) I tend to think that
> it is rather a not allowed usage of SparkContext.
>
> Is that correct?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Is-restarting-of-SparkContext-allowed-tp20240.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



  1   2   3   4   5   >