[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381236#comment-15381236 ] Rahul Palamuttam commented on SPARK-13634: -- Understood and thank you for explaining. I agree that it is pretty implicit that you can't serialize context-like objects, but it's a little strange when the object gets pulled in without the user even writing code that explicitly does so (in the shell). I agree with your latter point as well, and will take that into consideration. It could just be too specific to the use case. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381169#comment-15381169 ] Sean Owen commented on SPARK-13634: --- Go ahead, though in general I think it's pretty implicit that you can't serialize context-like objects anywhere. This may in fact be just a hack, and you need to redesign your code so that objects that are sent around do not capture a context object to begin with. Your use case is not normal shell usage; you're writing a custom framework. You can suggest doc changes (in a PR); just consider what is quite specific to your usage vs what is likely widely applicable enough to go in the docs. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381168#comment-15381168 ] Rahul Palamuttam commented on SPARK-13634: -- Kai Chen, thank you. I apologize for not responding sooner. This does resolve our issue. As a little background : We utilize a wrapper class for the SparkContext, and while I set the SparkContext variable inside the class to transient it didn't resolve our issue. Instead attaching @transient tag to an instance of the wrapper class resolved the issue. Before : val SciSc = new SciSparkContext(sc) After @transient SciSc = new SciSparkContext(sc) We utilize the wrapper class SciSparkContext to delegate to functions like BinaryFiles to read file formats like netcdf while abstracting the extra details to actually read it in that format. Sean Owen and Chris A. Mattmann - thank you for allowing the JIRA to be re-opened. I would like to resolve the issue, but first I did wanted to point out that I didn't see much or any documentation on this issue. I was looking at the quick start here : http://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell (I may have just missed it else where). The spark-shell as a mode of interacting with spark seems to be becoming more common - especially with notebook projects like zeppelin (which we are using). I do think that this is worth pointing out and mentioning - even if it is really an issue with scala. If we are in agreement, I would like to change this JIRA to a documentation JIRA and submit the patch (I've never submitted a doc patch and it would be a nice experience for me). I'll also respond sooner next time. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277427#comment-15277427 ] Kai Chen commented on SPARK-13634: -- [~Rahul Palamuttam] and [~chrismattmann] Try {code} @transient val newSC = sc {code} in the REPL to prevent SparkContext from being dragged into the serialization graph. Cheers! > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184597#comment-15184597 ] Chris A. Mattmann commented on SPARK-13634: --- Sean, thanks for your reply. We can agree to disagree on the semantics. I've been doing open source for a long time, and leaving JIRAs open for longer than 43 minutes is not damaging by any means. As a former Spark mentor too during its Incubation and its Champion, I also disagree, and was involved in Spark from its early inception here at the ASF and so have not always seen this type of behavior, which is why it's troubling to me. Your comparison of one end of the spectrum (10) to 1000s in size of JIRAs and activity also kind of leaves a sour taste in my mouth. I know Spark gets lots of activity. So do many of the projects I've helped start and contribute to (Hadoop, Lucene/Solr, Nutch during its hey day, etc etc). I left JIRAs open for longer than 43 mins in those projects as did many others wiser than me and that have been around a lot longer than me in open source. Thanks for taking time to think through what may be causing it. I'll choose to take the positive away from your reply and try to report back more on our workarounds in SciSpark and on our project. --Chris > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184563#comment-15184563 ] Sean Owen commented on SPARK-13634: --- JIRAs can be reopened, and should be if there's a change, like: you have a pull request to propose, or a different example or more analysis that suggests it's not just a Scala REPL thing. People can still comment on JIRAs too. All else equal, a reply in 43 minutes is a good thing. While I can appreciate that, ideally, we'd always let the reporter explicitly confirm they're done or something, that's not feasible in this project. On average a JIRA is opened every _hour_, many of which never receive any follow-up. Leaving them open is damaging too, since people inevitably parse that as "legitimate issue I should work on or wait on". If I see a quite-likely answer, I'd rather reflect it in JIRA, and once in a while overturn it, since reopening is a normal lightweight operation that can be performed by the reporter. Further, the reality is that about half of those JIRAs are not problems, badly described, poorly researched, etc (not this one), and actually _need_ rapid pushback with pointers to the contribution guide to discourage more of the behavior. This is why some things get resolved fast in general, and it's with the intent of putting limited time to best use for the most people, and getting most people some quick feedback. I understand it's not how a project with 10 JIRAs a month probably operates, but I disagree that my reply was wrong or impolite. Instead I'd certainly welcome materially more information and proposed change if you want to pursue and reopen this. For example, off the top of my head: does the ClosureCleaner specially treat {{sc}}? it may do so because there isn't supposed to be a second context in the application. However if this is your real code, I strongly suspect you have a simple workaround in refactoring the third line into a function on an {{object}} (i.e. static). The layer of indirection, or something similar, likely avoids tripping on this. This is what I've suggested you pursue next. If that works, that's great info to paste here, at least as confirmation. Or if not, add it here anyway to show what else doesn't work. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184536#comment-15184536 ] Chris A. Mattmann commented on SPARK-13634: --- I'm CC'ed b/c I'm the PI of the SciSpark project and I asked Rahul to file this issue here. It's not a toy example - it's a real example from our system. We have a work around but were wondering if Apache Spark had thought of anything better or seen something similar. Our code is here: https://github.com/Scispark/scispark/ The question I was asking was related to etiquette. I don't think it's good etiquette to close tickets under which the reporter has weighed in. This was closed literally in 43 minutes, without even waiting for Rahul to chime back in. Is it really that urgent to close an issue that a user has reported that quickly without hearing back from them to see if your suggestion helped or answered their question? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184505#comment-15184505 ] Sean Owen commented on SPARK-13634: --- Chris, I resolved this as a duplicate, of an issue that's "WontFix". I'm not suggesting there is a resolution in Spark. The implicit workaround here is to not declare newSC of course. There may be others, and that may matter since I suspect this is just a toy example. Without seeing real code, I couldn't say more about other workarounds. I'm not sure why you were CCed, but what are you taking issue with? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184442#comment-15184442 ] Chris A. Mattmann commented on SPARK-13634: --- Hi [~srowen] it would have been nice to make sure this resolves [~Rahul Palamuttam]'s issue before closing it? Isn't that simply good etiquette? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177093#comment-15177093 ] Rahul Palamuttam commented on SPARK-13634: -- [~chrismattmann] > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam > > The following lines of code cause a task serialization error when executed in > the spark-shell. Note that the error does not occur when submitting the code > as a batch job - via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org