[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017438#comment-17017438 ] Samuel Shepard commented on SPARK-6235: --- [~irashid] I followed your suggestion of looking in the user archive and [found an old PR |https://github.com/apache/spark/pull/17907] that tried to fix the PCA call itself. It was closed, but I linked it back here. [~srowen] is also on the thread. I leave this comment to help direct users to a workaround as much to encourage a future fix. Thanks for all you guys do. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998577#comment-16998577 ] Imran Rashid commented on SPARK-6235: - [~sammysheep] Spark's ML library uses the same jira project, if that is what you meant -- but I don't know of what specifically has already been implemented in spark to deal w/ large PCA or if there is another specific issue just for that. I'd suggest you first ask u...@spark.apache.org, since the first question is if there is another way of dealing with this > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997798#comment-16997798 ] Samuel Shepard commented on SPARK-6235: --- [~irashid] I meant the former (task result > 2G) as best I understand the architecture. Is there a different Jira for the ML library, since it affects PCA, that would be more appropriate? Thanks for the suggestions. Spark is a beautiful system with a lot of kind effort put into it. Computational biology has huge feature spaces all over the place. The two could really work well together, I think. This issue feels like some sort of left over from 32-bit Java, cramping Spark's style. :( > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997631#comment-16997631 ] Imran Rashid commented on SPARK-6235: - [~sammysheep] are you discussing the use case for task results > 2G? Or large records? Or did you mean one of the parts that was supposed to be fixed in the plan above? I don't deny there is _some_ use for large task result -- I just haven't heard much demand for it (in fact you're the first person I've heard from). Given that, I don't expect to see it fixed immediately. You could open another jira, though honestly for the moment I think it would be more of a place for folks to voice their interest. (I'm pretty sure nothing has changed since 2.4.0 on what is fixed and what is not.) > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995889#comment-16995889 ] Samuel Shepard commented on SPARK-6235: --- One use case could be fetching large results to the driver when computing PCA on large square matrices (e.g., distance matrices, similar to Classical MDS). This is very helpful in bioinformatics. Sorry if this already fixed past 2.4.0... > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820309#comment-16820309 ] Imran Rashid commented on SPARK-6235: - [~glal14] actually this was fixed in 2.4. There was one open issue, SPARK-24936, but I just closed that as its just improving an error msg which I think isn't really worth fixing just for spark 3.0, and so also resolved this umbrella. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820229#comment-16820229 ] Gowtam Lal commented on SPARK-6235: --- It would be great to see this go out. Any updates? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484069#comment-16484069 ] Imran Rashid commented on SPARK-6235: - Would be nice to find a better home for this, but for now I wanted to share the test code I'm running to see if there are cases I'm missing: https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482901#comment-16482901 ] Imran Rashid commented on SPARK-6235: - [~tgraves] WAL -- write-ahead-log for receiver-based streaming. This wouldn't effect a streaming source like the KafkaDirectDstream which isn't receiver based. It might not be that hard to fix this, but I don't know this code that well I don't think its nearly so important. I've also seen records larger than 2 GB. Actually this would probably be a good thing to support eventually as well. But I don't think its as important; I just want to put it out of scope here. For task results, I mean the results sent back to the driver in an action, from each partition. It would be hard to imagine that working if RDD records couldn't be greater than 2GB in general; I just thought it was worth calling out as something else I've seen users try to send back large results. A compelling use case might be if you're updating a statistical model in memory in your rdd action, and you want to send back the updates in a reduce to merge the updates together. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482746#comment-16482746 ] Thomas Graves commented on SPARK-6235: -- >> Still unsupported: * large task results * large blocks in the WAL * individual records larger than 2 GB Can you clarify what WAL is? I have seen individual records larger then 2GB, I don't think its as common though. Also can you clarify large task results? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482699#comment-16482699 ] Imran Rashid commented on SPARK-6235: - Linked a [design doc|https://docs.google.com/document/d/1ZialnQ0RSOkyYYND7nU609NJYBC6lnhS4xyl2YqG03A/edit?usp=sharing] > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477986#comment-16477986 ] Imran Rashid commented on SPARK-6235: - derp, I missed a pretty basic case -- if you cache a large block block in memory, then any remote request (either for replication or a remote read) will fail also. after a bit more testing I'll file some more jiras ... > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477827#comment-16477827 ] Imran Rashid commented on SPARK-6235: - I've been testing the current state of the 2GB block limit, and I think this has actually been whittled away quite a bit, with the appropriate configs, in particular setting "spark.maxRemoteBlockSizeFetchToMem" to something less than 2 GB so that large data is fetched to disk, rather than directly to memory. The only major thing which is not supported is replicating cached rdd blocks that are larger than 2 GB -- I plan to address that. Still unsupported: * large task results * large blocks in the WAL * individual records larger than 2 GB none of those seem particularly important, so I do not plan on addressing them (though users can speak up if I'm mistaken). I do not see a compelling reason to unable a fetch directly to memory for blocks larger than 2GB. (If there is a reason, lets just open a separate issue for that, as I wouldn't call it a bug fix.) So I intend to also open a jira to change the default value of spark.maxRemoteBlockSizeFetchToMem to something just under 2GB, as those requests were doomed to fail anyway. I'll post the test spark jobs I used here to get some feedback as well on whether I'm missing cases in a bit (need to do a little cleanup first). > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469991#comment-16469991 ] Cyanny commented on SPARK-6235: --- Hi, when will this Jira feature included in a spark releas? [~rxin] > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009808#comment-16009808 ] Jamie Hutton commented on SPARK-6235: - Hi there, Is there any update on when this will be included in a spark release? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15515042#comment-15515042 ] Guoqiang Li commented on SPARK-6235: ping [~rxin] > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508670#comment-15508670 ] Guoqiang Li commented on SPARK-6235: [~rxin] Any comments? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470410#comment-15470410 ] Apache Spark commented on SPARK-6235: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/14995 > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.02.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467563#comment-15467563 ] Apache Spark commented on SPARK-6235: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/14977 > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.01.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421048#comment-15421048 ] Guoqiang Li commented on SPARK-6235: Yes, it contains a lot of minor changes, eg: Replace ByteBuffer with ChunkedByteBuffer > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421018#comment-15421018 ] Apache Spark commented on SPARK-6235: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/14647 > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420746#comment-15420746 ] Sean Owen commented on SPARK-6235: -- How does this relate to the existing subtasks and their design/content? this looks like a piece of managing chunked data, but this isn't the hard part at all. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420554#comment-15420554 ] Guoqiang Li commented on SPARK-6235: [~hvanhovell] The main changes. 1. Replace DiskStore method {{def getBytes (blockId: BlockId): ChunkedByteBuffer}} to {{def getBlockData(blockId: BlockId): ManagedBuffer}}. 2. ManagedBuffer's nioByteBuffer method return ChunkedByteBuffer. 3. Add Class Chunk Fetch InputStream, used for flow control and code as follows: {noformat} package org.apache.spark.network.client; import java.io.IOException; import java.io.InputStream; import java.nio.channels.ClosedChannelException; import java.util.Iterator; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.atomic.AtomicBoolean; import java.util.concurrent.atomic.AtomicReference; import com.google.common.primitives.UnsignedBytes; import io.netty.buffer.ByteBuf; import io.netty.channel.Channel; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.spark.network.buffer.ChunkedByteBuffer; import org.apache.spark.network.buffer.ManagedBuffer; import org.apache.spark.network.protocol.StreamChunkId; import org.apache.spark.network.util.LimitedInputStream; import org.apache.spark.network.util.TransportFrameDecoder; public class ChunkFetchInputStream extends InputStream { private final Logger logger = LoggerFactory.getLogger(ChunkFetchInputStream.class); private final TransportResponseHandler handler; private final Channel channel; private final StreamChunkId streamId; private final long byteCount; private final ChunkReceivedCallback callback; private final LinkedBlockingQueue buffers = new LinkedBlockingQueue<>(1024); public final TransportFrameDecoder.Interceptor interceptor; private ByteBuf curChunk; private boolean isCallbacked = false; private long writerIndex = 0; private final AtomicReference cause = new AtomicReference<>(null); private final AtomicBoolean isClosed = new AtomicBoolean(false); public ChunkFetchInputStream( TransportResponseHandler handler, Channel channel, StreamChunkId streamId, long byteCount, ChunkReceivedCallback callback) { this.handler = handler; this.channel = channel; this.streamId = streamId; this.byteCount = byteCount; this.callback = callback; this.interceptor = new StreamInterceptor(); } @Override public int read() throws IOException { if (isClosed.get()) return -1; pullChunk(); if (curChunk != null) { byte b = curChunk.readByte(); return UnsignedBytes.toInt(b); } else { return -1; } } @Override public int read(byte[] dest, int offset, int length) throws IOException { if (isClosed.get()) return -1; pullChunk(); if (curChunk != null) { int amountToGet = Math.min(curChunk.readableBytes(), length); curChunk.readBytes(dest, offset, amountToGet); return amountToGet; } else { return -1; } } @Override public long skip(long bytes) throws IOException { if (isClosed.get()) return 0L; pullChunk(); if (curChunk != null) { int amountToSkip = (int) Math.min(bytes, curChunk.readableBytes()); curChunk.skipBytes(amountToSkip); return amountToSkip; } else { return 0L; } } @Override public void close() throws IOException { if (!isClosed.get()) { releaseCurChunk(); isClosed.set(true); resetChannel(); Iterator itr = buffers.iterator(); while (itr.hasNext()) { itr.next().release(); } buffers.clear(); } } private void pullChunk() throws IOException { if (curChunk != null && !curChunk.isReadable()) releaseCurChunk(); if (curChunk == null && cause.get() == null && !isClosed.get()) { try { curChunk = buffers.take(); // if channel.read() will be not invoked automatically, // the method is called by here if (!channel.config().isAutoRead()) channel.read(); } catch (Throwable e) { setCause(e); } } if (cause.get() != null) throw new IOException(cause.get()); } private void setCause(Throwable e) { if (cause.get() == null) cause.set(e); } private void releaseCurChunk() { if (curChunk != null) { curChunk.release(); curChunk = null; } } private void onSuccess() throws IOException { if (isCallbacked) return; if (cause.get() != null) { callback.onFailure(streamId.chunkIndex, cause.get()); } else { InputStream inputStream = new LimitedInputStream(this, byteCount); ManagedBuffer managedBuffer = new InputStreamManagedBuffer(inputStream, byteCount); callback.onSuccess(streamId.chunkIndex, managedBuffer); } isCallbacked = true; } private void resetChannel() { if (!channel.config().isAutoRead()) { channel.confi
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419151#comment-15419151 ] Herman van Hovell commented on SPARK-6235: -- [~gq] it might be a good idea to share some design before pressing ahead. This seems to be a complex issue, that probably needs some discussion on the approach, before pressing ahead with a PR. If we don't take this precaution, you might end up putting a lot of time in a very complex and very difficult to review PR. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418225#comment-15418225 ] Guoqiang Li commented on SPARK-6235: I'm doing this work and I'll put the patch in this month. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413055#comment-15413055 ] Sean Owen commented on SPARK-6235: -- I think the short answer is, it's very hard. I am not sure it's useful to say "I guess you all don't care". Please have a look at Imran's tickets and jump in. In practice, it's not a big limit, since hitting it means something else in the app can be designed better. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412727#comment-15412727 ] Brian commented on SPARK-6235: -- How is it possible that Spark 2.0 comes and and this bug isn't solved? A quick Google search fort "Spark 2GB limit" or "Spark Integer.MAX_VALUE" shows that this is a very real problem that affects lots of users. From the outside looking in, it seems like the Spark developers don't have an interest in solving this bug since it's been around for years at this point (including the jiras this consolidated ticket replaced). Can you provide some sort of an update? Maybe if you don't plan on fixing this issue, you can close the ticket or mark it as won't fix. At least that way we'd have some insight in to your plansThanks! > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957554#comment-14957554 ] Glenn Strycker commented on SPARK-6235: --- I don't think so, but I can check. My RDD came from an RDD of type (K,V) that was partitioned by key and worked just fine... my new RDD that is failing is attempting to map the value V to the K, so that (V, K) is now going to be partitioned by the value (now the key) instead. So I can try running some checks of multiplicity to see if my values have some kind of skew... unfortunately most of those checks are going to involve reduceByKey-like operations that will probably result in 2GB failures themselves... I was hoping to get the mapping and partitioning of (K,V) -> (V,K) accomplished first before running such checks. Thanks for the suggestion, though! > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957546#comment-14957546 ] Reynold Xin commented on SPARK-6235: Is your data skewed? i.e. maybe there is a single key that's enormous? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957540#comment-14957540 ] Glenn Strycker commented on SPARK-6235: --- Until this issue and sub-issue tickets are solved, are there any known work-arounds? Increase number of partitions, or decrease? Split up RDDs into parts, run your command, and then union? Turn off Kryo? Use dataframes? Help!! I am encountering the 2GB bug on attempting to simply (re)partition by key an RDD of modest size (84GB) and low skew (AFAIK). I have my memory requests per executor, per master node, per Java, etc. all cranked up as far as they'll go, and I'm currently attempting to partition this RDD across 6800 partitions. Unless my skew is really bad, I don't see why 12MB per partition would be causing a shuffle to hit the 2GB limit, unless the overhead of so many partitions is actually hurting rather than helping. I'm going to try adjusting my partition number and see what happens, but I wanted to know if there is a standard work-around answer to this 2GB issue. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744746#comment-14744746 ] Sean McKibben commented on SPARK-6235: -- When reading from HBase into spark, the regions seem to dictate the spark partition and thus the block size. Makes things very difficult. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744605#comment-14744605 ] Ram Gande commented on SPARK-6235: -- Any progress on this. We are seeing this issue constantly in our Spark jobs. Really appreciate if you could provide us with an update. :) > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724609#comment-14724609 ] Zhen Peng commented on SPARK-6235: -- Hi [~rxin], is there any update for this issue? > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org