RE: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-14 Thread Dong Lei
https://issues.apache.org/jira/browse/SPARK-8369  Created

And I’m working on a PR.

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, June 12, 2015 7:03 PM
To: Dong Lei
Cc: Dianfei (Keith) Han; dev@spark.apache.org
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Would you mind to file a JIRA for this? Thanks!

Cheng
On 6/11/15 2:40 PM, Dong Lei wrote:
I think in standalone cluster mode, spark is supposed to do:

1.   Download jars, files to driver

2.   Set the driver’s class path

3.   Driver setup a http file server to distribute these files

4.   Worker download from driver and setup classpath

Right?

But somehow, the first step fails.
Even if I can make the first step works(use option1), it seems that the 
classpath in driver is not correctly set.

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Thursday, June 11, 2015 2:32 PM
To: Dong Lei
Cc: Dianfei (Keith) Han; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add them to 
classpath, which is different from regular files.

Cheng
On 6/11/15 2:18 PM, Dong Lei wrote:
Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and files) on 
HDFS?

Do you mean the driver will not need to setup a HTTP file server for this 
scenario and the worker will fetch the jars and files from HDFS?

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Thursday, June 11, 2015 12:50 PM
To: Dong Lei; dev@spark.apache.orgmailto:dev@spark.apache.org
Cc: Dianfei (Keith) Han
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Since the jars are already on HDFS, you can access them directly in your Spark 
application without using --jars

Cheng
On 6/11/15 11:04 AM, Dong Lei wrote:
Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option when doing a 
spark-submit in a standalone cluster mode. For example:
Spark-submit  …   --jars hdfs://ip/1.jar  ….  hdfs://ip/app.jar 
(standalone cluster mode)
will not download 1.jar to driver’s http file server(but the app.jar will be 
downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that when doing 
sc.addJar to http file server, the function called is Files.copy which does not 
support a remote location.
And I think if spark can download the jars and add them to http file server, 
the classpath is not correctly set, because the classpath contains remote 
location.

So I’m trying to make it work and come up with two options, but neither of them 
seem to be elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the “--jars” and 
“--files” with the application jar.

This sounds more reasonable that option 1 for downloading files. But this way I 
need to read the “spark.jars” and “spark.files” on downloadUserJar or 
DriverRunnder.start and replace it with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the 
furture?

Thanks
Dong Lei





Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian

Would you mind to file a JIRA for this? Thanks!

Cheng

On 6/11/15 2:40 PM, Dong Lei wrote:


I think in standalone cluster mode, spark is supposed to do:

1.Download jars, files to driver

2.Set the driver’s class path

3.Driver setup a http file server to distribute these files

4.Worker download from driver and setup classpath

Right?

But somehow, the first step fails.

Even if I can make the first step works(use option1), it seems that 
the classpath in driver is not correctly set.


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 2:32 PM
*To:* Dong Lei
*Cc:* Dianfei (Keith) Han; dev@spark.apache.org
*Subject:* Re: How to support dependency jars and files on HDFS in 
standalone cluster mode?


Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
them to classpath, which is different from regular files.


Cheng

On 6/11/15 2:18 PM, Dong Lei wrote:

Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and
files) on HDFS?

Do you mean the driver will not need to setup a HTTP file server
for this scenario and the worker will fetch the jars and files
from HDFS?

Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org mailto:dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject:* Re: How to support dependency jars and files on HDFS in
standalone cluster mode?

Since the jars are already on HDFS, you can access them directly
in your Spark application without using --jars

Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files”
option when doing a spark-submit in a standalone cluster mode.
For example:

Spark-submit  …  --jars hdfs://ip/1.jar  ….
 hdfs://ip/app.jar (standalone cluster mode)

will not download 1.jar to driver’s http file server(but the
app.jar will be downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that
when doing sc.addJar to http file server, the function called
is Files.copy which does not support a remote location.

And I think if spark can download the jars and add them to
http file server, the classpath is not correctly set, because
the classpath contains remote location.

So I’m trying to make it work and come up with two options,
but neither of them seem to be elegant, and I want to hear
your advices:

Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs”
prefix.

This is not good because I think it breaks the scope of http
file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the
“--jars” and “--files” with the application jar.

This sounds more reasonable that option 1 for downloading
files. But this way I need to read the “spark.jars” and
“spark.files” on downloadUserJar or DriverRunnder.start and
replace it with a local path. How can I do that?

Do you have a more elegant solution, or do we have a plan to
support it in the furture?

Thanks

Dong Lei





Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Cheng Lian
Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
them to classpath, which is different from regular files.


Cheng

On 6/11/15 2:18 PM, Dong Lei wrote:


Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and 
files) on HDFS?


Do you mean the driver will not need to setup a HTTP file server for 
this scenario and the worker will fetch the jars and files from HDFS?


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject:* Re: How to support dependency jars and files on HDFS in 
standalone cluster mode?


Since the jars are already on HDFS, you can access them directly in 
your Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option
when doing a spark-submit in a standalone cluster mode. For example:

Spark-submit  …   --jars hdfs://ip/1.jar  ….
 hdfs://ip/app.jar (standalone cluster mode)

will not download 1.jar to driver’s http file server(but the
app.jar will be downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that
when doing sc.addJar to http file server, the function called is
Files.copy which does not support a remote location.

And I think if spark can download the jars and add them to http
file server, the classpath is not correctly set, because the
classpath contains remote location.

So I’m trying to make it work and come up with two options, but
neither of them seem to be elegant, and I want to hear your advices:

Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file
server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the
“--jars” and “--files” with the application jar.

This sounds more reasonable that option 1 for downloading files.
But this way I need to read the “spark.jars” and “spark.files” on
downloadUserJar or DriverRunnder.start and replace it with a local
path. How can I do that?

Do you have a more elegant solution, or do we have a plan to
support it in the furture?

Thanks

Dong Lei





RE: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Dong Lei
I think in standalone cluster mode, spark is supposed to do:

1.   Download jars, files to driver

2.   Set the driver’s class path

3.   Driver setup a http file server to distribute these files

4.   Worker download from driver and setup classpath

Right?

But somehow, the first step fails.
Even if I can make the first step works(use option1), it seems that the 
classpath in driver is not correctly set.

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Thursday, June 11, 2015 2:32 PM
To: Dong Lei
Cc: Dianfei (Keith) Han; dev@spark.apache.org
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add them to 
classpath, which is different from regular files.

Cheng
On 6/11/15 2:18 PM, Dong Lei wrote:
Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and files) on 
HDFS?

Do you mean the driver will not need to setup a HTTP file server for this 
scenario and the worker will fetch the jars and files from HDFS?

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Thursday, June 11, 2015 12:50 PM
To: Dong Lei; dev@spark.apache.orgmailto:dev@spark.apache.org
Cc: Dianfei (Keith) Han
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Since the jars are already on HDFS, you can access them directly in your Spark 
application without using --jars

Cheng
On 6/11/15 11:04 AM, Dong Lei wrote:
Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option when doing a 
spark-submit in a standalone cluster mode. For example:
Spark-submit  …   --jars hdfs://ip/1.jar  ….  hdfs://ip/app.jar 
(standalone cluster mode)
will not download 1.jar to driver’s http file server(but the app.jar will be 
downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that when doing 
sc.addJar to http file server, the function called is Files.copy which does not 
support a remote location.
And I think if spark can download the jars and add them to http file server, 
the classpath is not correctly set, because the classpath contains remote 
location.

So I’m trying to make it work and come up with two options, but neither of them 
seem to be elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the “--jars” and 
“--files” with the application jar.

This sounds more reasonable that option 1 for downloading files. But this way I 
need to read the “spark.jars” and “spark.files” on downloadUserJar or 
DriverRunnder.start and replace it with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the 
furture?

Thanks
Dong Lei




RE: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-11 Thread Dong Lei
Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and files) on 
HDFS?

Do you mean the driver will not need to setup a HTTP file server for this 
scenario and the worker will fetch the jars and files from HDFS?

Thanks
Dong Lei

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Thursday, June 11, 2015 12:50 PM
To: Dong Lei; dev@spark.apache.org
Cc: Dianfei (Keith) Han
Subject: Re: How to support dependency jars and files on HDFS in standalone 
cluster mode?

Since the jars are already on HDFS, you can access them directly in your Spark 
application without using --jars

Cheng
On 6/11/15 11:04 AM, Dong Lei wrote:
Hi spark-dev:

I can not use a hdfs location for the --jars or --files option when doing a 
spark-submit in a standalone cluster mode. For example:
Spark-submit  ...   --jars hdfs://ip/1.jar    
hdfs://ip/app.jar (standalone cluster mode)
will not download 1.jar to driver's http file server(but the app.jar will be 
downloaded to the driver's dir).

I figure out the reason spark not downloading the jars is that when doing 
sc.addJar to http file server, the function called is Files.copy which does not 
support a remote location.
And I think if spark can download the jars and add them to http file server, 
the classpath is not correctly set, because the classpath contains remote 
location.

So I'm trying to make it work and come up with two options, but neither of them 
seem to be elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a hdfs prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the --jars and 
--files with the application jar.

This sounds more reasonable that option 1 for downloading files. But this way I 
need to read the spark.jars and spark.files on downloadUserJar or 
DriverRunnder.start and replace it with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the 
furture?

Thanks
Dong Lei



Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
Since the jars are already on HDFS, you can access them directly in your 
Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:


Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files” option 
when doing a spark-submit in a standalone cluster mode. For example:


Spark-submit  …   --jars hdfs://ip/1.jar  …. 
 hdfs://ip/app.jar (standalone cluster mode)


will not download 1.jar to driver’s http file server(but the app.jar 
will be downloaded to the driver’s dir).


I figure out the reason spark not downloading the jars is that when 
doing sc.addJar to http file server, the function called is Files.copy 
which does not support a remote location.


And I think if spark can download the jars and add them to http file 
server, the classpath is not correctly set, because the classpath 
contains remote location.


So I’m trying to make it work and come up with two options, but 
neither of them seem to be elegant, and I want to hear your advices:


Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the “--jars” 
and “--files” with the application jar.


This sounds more reasonable that option 1 for downloading files. But 
this way I need to read the “spark.jars” and “spark.files” on 
downloadUserJar or DriverRunnder.start and replace it with a local 
path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support 
it in the furture?


Thanks

Dong Lei





How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Dong Lei
Hi spark-dev:

I can not use a hdfs location for the --jars or --files option when doing a 
spark-submit in a standalone cluster mode. For example:
Spark-submit  ...   --jars hdfs://ip/1.jar    
hdfs://ip/app.jar (standalone cluster mode)
will not download 1.jar to driver's http file server(but the app.jar will be 
downloaded to the driver's dir).

I figure out the reason spark not downloading the jars is that when doing 
sc.addJar to http file server, the function called is Files.copy which does not 
support a remote location.
And I think if spark can download the jars and add them to http file server, 
the classpath is not correctly set, because the classpath contains remote 
location.

So I'm trying to make it work and come up with two options, but neither of them 
seem to be elegant, and I want to hear your advices:

Option 1:
Modify HTTPFileServer.addFileToDir, let it recognize a hdfs prefix.

This is not good because I think it breaks the scope of http file server.

Option 2:
Modify DriverRunner.downloadUserJar, let it download all the --jars and 
--files with the application jar.

This sounds more reasonable that option 1 for downloading files. But this way I 
need to read the spark.jars and spark.files on downloadUserJar or 
DriverRunnder.start and replace it with a local path. How can I do that?


Do you have a more elegant solution, or do we have a plan to support it in the 
furture?

Thanks
Dong Lei