[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046004#comment-15046004 ] Nakul Jindal commented on SPARK-11046: -- [~shivaram], [~sunrui] - Is it ok to depend on / import the [jsonlite|https://cran.r-project.org/web/packages/jsonlite/index.html] package? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046269#comment-15046269 ] Nakul Jindal commented on SPARK-11046: -- I am trying to understand the benefit of doing it using JSON as opposed to the format that it currently is in. We have 3 cases: Case 1 - Leave things the way they are. Here is what we have currently: Let us say, our type is array >> - The R function structField.character (in schema.R) is passed this exact string - In turn it calls checkType to recursively validate the schema string - The scala function SQLUtils.getSQLDataType (in SQLUtils.scala), recursively converts this to an object of type DataType Case 2 - Expect the user to specify the input schema in JSON If we converted the schema format to JSON, it would look like this: { "type": "array", "elementType": { "type": "map", "keyType": "string", "valueType": { "type": "struct", "fields": [{ "name": "a", "type": "integer", "nullable": true, "metadata": {} }, { "name": "b", "type": "long", "nullable": true, "metadata": {} }, { "name": "c", "type": "string", "nullable": true, "metadata": {} }] }, "valueContainsNull": false }, "containsNull": true } (based on what DataType.fromJson expects). which is placing way too much burden on the sparkR user. - I am not entirely sure about this, but I think we do not want to or cannot (or simply haven't implemented) a way to communicate exceptions encountered in the scala code back to R. - We'd need to write a way to validate the JSON schema in R code (or use a JSON parsing library to do it in some way). - The code in SQLUtils.getSQLDataType will now be significantly reduced as we can just call DataType.fromJson. Case 3 - Convert the schema to JSON in R code before calling the JVM function org.apache.spark.sql.api.r.SQLUtils.createStructField - This is essentially moving the work done in SQLUtils.getSQLDataType to R code. This IMHO is significantly more complicated to write and maintain. TLDR: At the cost of inconvenience to the sparkR user, we will convert specifying the schema from its current (IMHO - simple) form to JSON. [~shivaram], [~sunrui] - Any thoughts? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format
[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042513#comment-15042513 ] Nakul Jindal commented on SPARK-11046: -- Hi, I am trying to look into this. When you say that SparkR passes a DataFrame schema from R to JVM backend using regular expression, do you mean this format: mapor array Also, is "structField.character" the only function where this "regular expression" format is passed from R to JVM (using org.apache.spark.sql.api.r.SQLUtils", "createDF)? > Pass schema from R to JVM using JSON format > --- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007487#comment-15007487 ] Nakul Jindal commented on SPARK-11439: -- Thanks [~lewuathe]. I've also updated the comment in the LinearRegressionSuite.scala file with an R snippet to reproduce the results. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11392) GroupedIterator's hasNext is not idempotent
[ https://issues.apache.org/jira/browse/SPARK-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003184#comment-15003184 ] Nakul Jindal commented on SPARK-11392: -- Sorry, it's been a while since I last worked on this. [~yhuai] - After looking at the code, I am not entirely clear on what you mean when you say {quote} If we call GroupedIterator's hasNext immediately after its next, we will generate an extra group (CoGroupedIterator has this behavior). {quote} The title however makes sense to me - about {{hasNext}} not being idempotent. Per my understanding {{hasNext}} in iterators should not be modifying the underlying iterator in general, but it does for GroupedIterator. I can think of two things we can do to make {{hasNext}} idempotent, both of which are less than ideal: * Eagerly evaluate the GroupedIterator - This is probably not what we want to do. * Do the work done in {{fetchNextGroupIterator}} twice, specifically this loop: [L118-L120|https://github.com/apache/spark/blob/14d08b99085d4e609aeae0cf54d4584e860eb552/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala#L118-L120] {code} while (input.hasNext && keyOrdering.compare(currentGroup, currentRow) == 0) { currentRow = input.next() } {code} Once for {{hasNext}} and one for {{next}}. This obviously introduces some inefficiency. *Thoughts?* > GroupedIterator's hasNext is not idempotent > --- > > Key: SPARK-11392 > URL: https://issues.apache.org/jira/browse/SPARK-11392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > If we call > [GroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala]'s > {{hasNext}} immediately after its {{next}}, we will generate an extra group > ([CoGroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CoGroupedIterator.scala] > has this behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994861#comment-14994861 ] Nakul Jindal commented on SPARK-11439: -- This is the piece of R code that is used as reference for the test : predictions <- predict(fit, newx=features) residuals <- label - predictions mean(residuals^2) # MSE mean(abs(residuals)) # MAD cor(predictions, label)^2# r^2 How do I create the "fit" object? NOTE : I have no experience with R and have scrounged whatever little knowledge I could get by asking around and from the internet. I tried this: In a Spark REPL: import org.apache.spark.mllib.util.LinearDataGenerator val data = sc.parallelize(LinearDataGenerator.generateLinearInput(6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 1, 42, 0.1), 2) data.map(x=> x.label + ", " + x.features(0) + ", " + x.features(1)).coalesce(1).saveAsTextFile("path") Then, in an R Shell: library("glmnet") d1 <- read.csv("path/part-0", header=FALSE, stringsAsFactors=FALSE) features <- as.matrix(data.frame(as.numeric(d1$V2), as.numeric(d1$V3))) label <- as.numeric(d1$V1) fit <- glmnet(features, label, family="gaussian", alpha = 0, lambda = 0) I then used this fit object in the earlier snippet of R code. The results were too way off. > mean(residuals^2) [1] 10885.15 > > mean(abs(residuals)) [1] 103.959 > > cor(predictions, label)^2 [,1] s0 0.9998749 So, I guess, that is not how you create the "fit" object. How do you create the "fit" object? > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992828#comment-14992828 ] Nakul Jindal commented on SPARK-11439: -- I seem to be running into a problem. 1. [This|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L124-L165] is the current implementation. 2. [This|https://gist.github.com/nakul02/9341a9ed67cd192d98df] is the implementation that I tried first (and it passed all tests). 3. [This|https://gist.github.com/nakul02/4f5392c7d5997871da7b] is an improved implementation that doesn't form the "x" array, but it fails tests in suites - * org.apache.spark.ml.regression.LinearRegressionSuite * org.apache.spark.ml.evaluation.RegressionEvaluatorSuite The difference between 2 and 3 is the way in which the random number generator is used. Could this possibly cause the tests to fail? Maybe I am doing something obviously stupid here. This is frustrating and any insight would help! > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993281#comment-14993281 ] Nakul Jindal commented on SPARK-11439: -- [info] - linear regression model training summary *** FAILED *** (966 milliseconds) [info] Expected 0.009955579236410212 and 0.00972035 to be within 1.0E-5 using relative tolerance. (TestingUtils.scala:78) [info] org.scalatest.exceptions.TestFailedException: [info] at org.apache.spark.mllib.util.TestingUtils$DoubleWithAlmostEquals.$tilde$eq$eq(TestingUtils.scala:78) [info] at org.apache.spark.ml.regression.LinearRegressionSuite$$anonfun$11$$anonfun$apply$mcV$sp$9.apply(LinearRegressionSuite.scala:606) [info] at org.apache.spark.ml.regression.LinearRegressionSuite$$anonfun$11$$anonfun$apply$mcV$sp$9.apply(LinearRegressionSuite.scala:559) . > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990273#comment-14990273 ] Nakul Jindal commented on SPARK-11439: -- Yes, this sounds good. Also, for the sake of uniformity, it would make sense to convert the other blas.ddot call to the one from BLAS.scala. > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985890#comment-14985890 ] Nakul Jindal commented on SPARK-11439: -- I will work on this. > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986684#comment-14986684 ] Nakul Jindal commented on SPARK-11439: -- [~holdenk] [~lewuathe] - A couple of places where there could be work savings : 1. [L144|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L144] - Here is where sparsity data is first populated. The index array and values array can be maintained and populated at this line. The problem is that this won't sit well with blas.ddot at line [L153|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L153]. Either a new weights array would need to be created or the ddot function would need to be rewritten. 2. [L162|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L162] - If done here, we would essentially be doing what toSparse does internally. Either of these cases don't make sense to me. Suggestions on what direction to take? > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11392) GroupedIterator's hasNext is not idempotent
[ https://issues.apache.org/jira/browse/SPARK-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983487#comment-14983487 ] Nakul Jindal commented on SPARK-11392: -- [~yhuai], [~cloud_fan] - SPARK-11393 works around the problem the mentioned in this JIRA. Would we need to revert back the changes made by the associates PR if this JIRA were to be resolved? > GroupedIterator's hasNext is not idempotent > --- > > Key: SPARK-11392 > URL: https://issues.apache.org/jira/browse/SPARK-11392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > If we call > [GroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala]'s > {{hasNext}} immediately after its {{next}}, we will generate an extra group > ([CoGroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CoGroupedIterator.scala] > has this behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11392) GroupedIterator's hasNext is not idempotent
[ https://issues.apache.org/jira/browse/SPARK-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982949#comment-14982949 ] Nakul Jindal commented on SPARK-11392: -- I will work on this > GroupedIterator's hasNext is not idempotent > --- > > Key: SPARK-11392 > URL: https://issues.apache.org/jira/browse/SPARK-11392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > If we call > [GroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala]'s > {{hasNext}} immediately after its {{next}}, we will generate an extra group > ([CoGroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CoGroupedIterator.scala] > has this behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11385) Add foreach API to MLLib's vector API
[ https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978991#comment-14978991 ] Nakul Jindal commented on SPARK-11385: -- I'll be working on this. > Add foreach API to MLLib's vector API > - > > Key: SPARK-11385 > URL: https://issues.apache.org/jira/browse/SPARK-11385 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Priority: Minor > > Add a foreach API to MLLib's vector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11386) Refactor appropriate uses of Vector to use the new foreach API
[ https://issues.apache.org/jira/browse/SPARK-11386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978980#comment-14978980 ] Nakul Jindal commented on SPARK-11386: -- I'll be working on this. > Refactor appropriate uses of Vector to use the new foreach API > -- > > Key: SPARK-11386 > URL: https://issues.apache.org/jira/browse/SPARK-11386 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Priority: Minor > > Once SPARK-11385 - Add foreach API to MLLib's vector API is in look for > places where it should be used internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private
[ https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975436#comment-14975436 ] Nakul Jindal commented on SPARK-11332: -- I'll be working on this. > WeightedLeastSquares should use ml features generic Instance class instead of > private > - > > Key: SPARK-11332 > URL: https://issues.apache.org/jira/browse/SPARK-11332 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Minor > > WeightedLeastSquares should use the common Instance class in ml.feature > instead of a private one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename
[ https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941225#comment-14941225 ] Nakul Jindal edited comment on SPARK-10436 at 10/2/15 3:01 PM: --- I am new to Spark and will be working on this. was (Author: nakul02): I am new to Spark and will take a look at it too. > spark-submit overwrites spark.files defaults with the job script filename > - > > Key: SPARK-10436 > URL: https://issues.apache.org/jira/browse/SPARK-10436 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 > Environment: Ubuntu, Spark 1.4.0 Standalone >Reporter: axel dahl >Priority: Minor > Labels: easyfix, feature > > In my spark-defaults.conf I have configured a set of libararies to be > uploaded to my Spark 1.4.0 Standalone cluster. The entry appears as: > spark.files libarary.zip,file1.py,file2.py > When I execute spark-submit -v test.py > I see that spark-submit reads the defaults correctly, but that it overwrites > the "spark.files" default entry and replaces it with the name if the job > script, i.e. "test.py". > This behavior doesn't seem intuitive. test.py, should be added to the spark > working folder, but it should not overwrite the "spark.files" defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename
[ https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941225#comment-14941225 ] Nakul Jindal commented on SPARK-10436: -- I am new to Spark and will take a look at it too. > spark-submit overwrites spark.files defaults with the job script filename > - > > Key: SPARK-10436 > URL: https://issues.apache.org/jira/browse/SPARK-10436 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 > Environment: Ubuntu, Spark 1.4.0 Standalone >Reporter: axel dahl >Priority: Minor > Labels: easyfix, feature > > In my spark-defaults.conf I have configured a set of libararies to be > uploaded to my Spark 1.4.0 Standalone cluster. The entry appears as: > spark.files libarary.zip,file1.py,file2.py > When I execute spark-submit -v test.py > I see that spark-submit reads the defaults correctly, but that it overwrites > the "spark.files" default entry and replaces it with the name if the job > script, i.e. "test.py". > This behavior doesn't seem intuitive. test.py, should be added to the spark > working folder, but it should not overwrite the "spark.files" defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org