[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-03 Thread Tom Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157946#comment-14157946
 ] 

Tom Weber commented on SPARK-3769:
--

I believe I originally called it on the driver side, but the addfile call makes 
a local copy, so when you call it there, you get the local copy path which 
isn't the same path as where it ends up on the remote worker nodes.
I'm good with striping the path off and only passing the file name itself to 
the get call.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157156#comment-14157156
 ] 

Sean Owen commented on SPARK-3769:
--

My understanding is that you execute:

{code}
sc.addFile(/opt/tom/SparkFiles.sas);
...
SparkFiles.get(SparkFiles.sas);
{code}

I would not expect the key used by remote workers must be aware of the location 
on the driver that the file came from. The path may not be absolute in all 
cases anyway. I can see the argument that it feels like both should be the same 
key but really the key being set is the file name, not path.

You don't have to parse it by hand though. Usually you might do something like 
this anyway:

{code}
File myFile = new File(args[1]);
sc.addFile(myFile.getAbsolutePath());
String fileName = myFile.getName();
...
SparkFiles.get(fileName);
{code}

AFAIK this is as intended.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Tom Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157263#comment-14157263
 ] 

Tom Weber commented on SPARK-3769:
--

Thanks for the quick turnaround!

I can see that it wouldn't necessarily make sense that a fully qualified path 
(relative to the driver programs filesystem)
would be what the .get method would take on the worker node systems. But, at 
the same time, the .get seems like it
just takes what you give it and blindly concatenates it to the 
.getRootDirectory result w/out even validating it or failing
if that file doesn't exist.

I appreciate the File object methods for pulling the path name apart; I'll use 
that and that will work just fine. First time
playing around with all of this, so sometimes what you expect it to do is just 
a matter of thinking about it a particular way :)

You can close this ticket out, as I'm sure I'll be able to work fine by using 
the full path on the driver side and only the file
name on the worker side. Seems like it might be convenient though if these 
matched set of routines did this themselves
since the driver side needs a qualified path to find the file, and the worker 
side, by definition, strips that off and only put's
the file in the designated work directory (which make sense of course). No big 
deal though.

Thanks again,
Tom




 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157389#comment-14157389
 ] 

Josh Rosen commented on SPARK-3769:
---

I think that {{SparkFiles.get()}} can be called from driver code, too, so 
that's one option if you'd like to achieve consistency between driver and 
executor code.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org