[jira] [Commented] (SPARK-6235) Address various 2G limits

2020-01-16 Thread Samuel Shepard (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017438#comment-17017438
 ] 

Samuel Shepard commented on SPARK-6235:
---

[~irashid] I followed your suggestion of looking in the user archive and [found 
an old PR |https://github.com/apache/spark/pull/17907] that tried to fix the 
PCA call itself.  It was closed, but I linked it back here. [~srowen] is also 
on the thread. I leave this comment to help direct users to a workaround as 
much to encourage a future fix.

Thanks for all you guys do.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-17 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998577#comment-16998577
 ] 

Imran Rashid commented on SPARK-6235:
-

[~sammysheep] Spark's ML library uses the same jira project, if that is what 
you meant -- but I don't know of what specifically has already been implemented 
in spark to deal w/ large PCA or if there is another specific issue just for 
that.  I'd suggest you first ask u...@spark.apache.org, since the first 
question is if there is another way of dealing with this

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-16 Thread Samuel Shepard (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997798#comment-16997798
 ] 

Samuel Shepard commented on SPARK-6235:
---

[~irashid] I meant the former (task result > 2G) as best I understand the 
architecture. Is there a different Jira for the ML library, since it affects 
PCA, that would be more appropriate?

Thanks for the suggestions. Spark is a beautiful system with a lot of kind 
effort put into it. Computational biology has huge feature spaces all over the 
place. The two could really work well together, I think. This issue feels like 
some sort of left over from 32-bit Java, cramping Spark's style. :(

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-16 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997631#comment-16997631
 ] 

Imran Rashid commented on SPARK-6235:
-

[~sammysheep] are you discussing the use case for task results > 2G?  Or large 
records?  Or did you mean one of the parts that was supposed to be fixed in the 
plan above?

I don't deny there is _some_ use for large task result -- I just haven't heard 
much demand for it (in fact you're the first person I've heard from).  Given 
that, I don't expect to see it fixed immediately.  You could open another jira, 
though honestly for the moment I think it would be more of a place for folks to 
voice their interest.

(I'm pretty sure nothing has changed since 2.4.0 on what is fixed and what is 
not.)

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-13 Thread Samuel Shepard (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995889#comment-16995889
 ] 

Samuel Shepard commented on SPARK-6235:
---

One use case could be fetching large results to the driver when computing PCA 
on large square matrices (e.g., distance matrices, similar to Classical MDS). 
This is very helpful in bioinformatics. Sorry if this already fixed past 
2.4.0...

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-04-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820309#comment-16820309
 ] 

Imran Rashid commented on SPARK-6235:
-

[~glal14] actually this was fixed in 2.4.  There was one open issue, 
SPARK-24936, but I just closed that as its just improving an error msg which I 
think isn't really worth fixing just for spark 3.0, and so also resolved this 
umbrella.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-04-17 Thread Gowtam Lal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820229#comment-16820229
 ] 

Gowtam Lal commented on SPARK-6235:
---

It would be great to see this go out. Any updates?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-22 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484069#comment-16484069
 ] 

Imran Rashid commented on SPARK-6235:
-

Would be nice to find a better home for this, but for now I wanted to share the 
test code I'm running to see if there are cases I'm missing:

https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482901#comment-16482901
 ] 

Imran Rashid commented on SPARK-6235:
-

[~tgraves]
WAL -- write-ahead-log for receiver-based streaming. This wouldn't effect a 
streaming source like the KafkaDirectDstream which isn't receiver based.  It 
might not be that hard to fix this, but I don't know this code that well I 
don't think its nearly so important.


I've also seen records larger than 2 GB.  Actually this would probably be a 
good thing to support eventually as well.   But I don't think its as important; 
I just want to put it out of scope here.

For task results, I mean the results sent back to the driver in an action, from 
each partition.  It would be hard to imagine that working if RDD records 
couldn't be greater than 2GB in general; I just thought it was worth calling 
out as something else I've seen users try to send back large results.  A 
compelling use case might be if you're updating a statistical model in memory 
in your rdd action, and you want to send back the updates in a reduce to merge 
the updates together.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482746#comment-16482746
 ] 

Thomas Graves commented on SPARK-6235:
--

>> Still unsupported:
 * large task results

 * large blocks in the WAL
 * individual records larger than 2 GB

 

Can you clarify what WAL is?

I have seen individual records larger then 2GB, I don't think its as common 
though.

Also can you clarify large task results? 

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482699#comment-16482699
 ] 

Imran Rashid commented on SPARK-6235:
-

Linked a [design 
doc|https://docs.google.com/document/d/1ZialnQ0RSOkyYYND7nU609NJYBC6lnhS4xyl2YqG03A/edit?usp=sharing]

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-16 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477986#comment-16477986
 ] 

Imran Rashid commented on SPARK-6235:
-

derp, I missed a pretty basic case -- if you cache a large block block in 
memory, then any remote request (either for replication or a remote read) will 
fail also.  after a bit more testing I'll file some more jiras ...

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-16 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477827#comment-16477827
 ] 

Imran Rashid commented on SPARK-6235:
-

I've been testing the current state of the 2GB block limit, and I think this 
has actually been whittled away quite a bit, with the appropriate configs, in 
particular setting "spark.maxRemoteBlockSizeFetchToMem" to something less than 
2 GB so that large data is fetched to disk, rather than directly to memory.

The only major thing which is not supported is replicating cached rdd blocks 
that are larger than 2 GB -- I plan to address that.

Still unsupported:
* large task results
* large blocks in the WAL
* individual records larger than 2 GB

none of those seem particularly important, so I do not plan on addressing them 
(though users can speak up if I'm mistaken).

I do not see a compelling reason to unable a fetch directly to memory for 
blocks larger than 2GB.  (If there is a reason, lets just open a separate issue 
for that, as I wouldn't call it a bug fix.)  So I intend to also open a jira to 
change the default value of spark.maxRemoteBlockSizeFetchToMem to something 
just under 2GB, as those requests were doomed to fail anyway.

I'll post the test spark jobs I used here to get some feedback as well on 
whether I'm missing cases in a bit (need to do a little cleanup first).

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-09 Thread Cyanny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469991#comment-16469991
 ] 

Cyanny commented on SPARK-6235:
---

Hi, when will this Jira feature included in a spark releas? [~rxin]

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2017-05-14 Thread Jamie Hutton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009808#comment-16009808
 ] 

Jamie Hutton commented on SPARK-6235:
-

Hi there, Is there any update on when this will be included in a spark release?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-09-22 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15515042#comment-15515042
 ] 

Guoqiang Li commented on SPARK-6235:


ping [~rxin]

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-09-20 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508670#comment-15508670
 ] 

Guoqiang Li commented on SPARK-6235:


[~rxin] Any comments?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470410#comment-15470410
 ] 

Apache Spark commented on SPARK-6235:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/14995

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467563#comment-15467563
 ] 

Apache Spark commented on SPARK-6235:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/14977

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.01.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-15 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421048#comment-15421048
 ] 

Guoqiang Li commented on SPARK-6235:


Yes, it contains a lot of minor changes, eg: Replace ByteBuffer with 
ChunkedByteBuffer

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421018#comment-15421018
 ] 

Apache Spark commented on SPARK-6235:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/14647

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420746#comment-15420746
 ] 

Sean Owen commented on SPARK-6235:
--

How does this relate to the existing subtasks and their design/content? this 
looks like a piece of managing chunked data, but this isn't the hard part at 
all.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-14 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420554#comment-15420554
 ] 

Guoqiang Li commented on SPARK-6235:


[~hvanhovell]
The main changes.

1. Replace DiskStore method {{def getBytes (blockId: BlockId): 
ChunkedByteBuffer}} to {{def getBlockData(blockId: BlockId): ManagedBuffer}}.

2. ManagedBuffer's nioByteBuffer method return ChunkedByteBuffer.

3. Add Class Chunk Fetch InputStream, used for flow control and code as follows:

{noformat}

package org.apache.spark.network.client;

import java.io.IOException;
import java.io.InputStream;
import java.nio.channels.ClosedChannelException;
import java.util.Iterator;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicReference;

import com.google.common.primitives.UnsignedBytes;
import io.netty.buffer.ByteBuf;
import io.netty.channel.Channel;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.apache.spark.network.buffer.ChunkedByteBuffer;
import org.apache.spark.network.buffer.ManagedBuffer;
import org.apache.spark.network.protocol.StreamChunkId;
import org.apache.spark.network.util.LimitedInputStream;
import org.apache.spark.network.util.TransportFrameDecoder;

public class ChunkFetchInputStream extends InputStream {
  private final Logger logger = 
LoggerFactory.getLogger(ChunkFetchInputStream.class);

  private final TransportResponseHandler handler;
  private final Channel channel;
  private final StreamChunkId streamId;
  private final long byteCount;
  private final ChunkReceivedCallback callback;
  private final LinkedBlockingQueue buffers = new 
LinkedBlockingQueue<>(1024);
  public final TransportFrameDecoder.Interceptor interceptor;

  private ByteBuf curChunk;
  private boolean isCallbacked = false;
  private long writerIndex = 0;

  private final AtomicReference cause = new AtomicReference<>(null);
  private final AtomicBoolean isClosed = new AtomicBoolean(false);

  public ChunkFetchInputStream(
  TransportResponseHandler handler,
  Channel channel,
  StreamChunkId streamId,
  long byteCount,
  ChunkReceivedCallback callback) {
this.handler = handler;
this.channel = channel;
this.streamId = streamId;
this.byteCount = byteCount;
this.callback = callback;
this.interceptor = new StreamInterceptor();
  }

  @Override
  public int read() throws IOException {
if (isClosed.get()) return -1;
pullChunk();
if (curChunk != null) {
  byte b = curChunk.readByte();
  return UnsignedBytes.toInt(b);
} else {
  return -1;
}
  }

  @Override
  public int read(byte[] dest, int offset, int length) throws IOException {
if (isClosed.get()) return -1;
pullChunk();
if (curChunk != null) {
  int amountToGet = Math.min(curChunk.readableBytes(), length);
  curChunk.readBytes(dest, offset, amountToGet);
  return amountToGet;
} else {
  return -1;
}
  }

  @Override
  public long skip(long bytes) throws IOException {
if (isClosed.get()) return 0L;
pullChunk();
if (curChunk != null) {
  int amountToSkip = (int) Math.min(bytes, curChunk.readableBytes());
  curChunk.skipBytes(amountToSkip);
  return amountToSkip;
} else {
  return 0L;
}
  }

  @Override
  public void close() throws IOException {
if (!isClosed.get()) {
  releaseCurChunk();
  isClosed.set(true);
  resetChannel();
  Iterator itr = buffers.iterator();
  while (itr.hasNext()) {
itr.next().release();
  }
  buffers.clear();
}
  }

  private void pullChunk() throws IOException {
if (curChunk != null && !curChunk.isReadable()) releaseCurChunk();
if (curChunk == null && cause.get() == null && !isClosed.get()) {
  try {
curChunk = buffers.take();
// if channel.read() will be not invoked automatically,
// the method is called by here
if (!channel.config().isAutoRead()) channel.read();
  } catch (Throwable e) {
setCause(e);
  }
}
if (cause.get() != null) throw new IOException(cause.get());
  }

  private void setCause(Throwable e) {
if (cause.get() == null) cause.set(e);
  }

  private void releaseCurChunk() {
if (curChunk != null) {
  curChunk.release();
  curChunk = null;
}
  }

  private void onSuccess() throws IOException {
if (isCallbacked) return;
if (cause.get() != null) {
  callback.onFailure(streamId.chunkIndex, cause.get());
} else {
  InputStream inputStream = new LimitedInputStream(this, byteCount);
  ManagedBuffer managedBuffer = new InputStreamManagedBuffer(inputStream, 
byteCount);
  callback.onSuccess(streamId.chunkIndex, managedBuffer);
}
isCallbacked = true;
  }

  private void resetChannel() {
if (!channel.config().isAutoRead()) {
  channel.confi

[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-12 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419151#comment-15419151
 ] 

Herman van Hovell commented on SPARK-6235:
--

[~gq] it might be a good idea to share some design before pressing ahead. This 
seems to be a complex issue, that probably needs some discussion on the 
approach, before pressing ahead with a PR.

If we don't take this precaution, you might end up putting a lot of time in a 
very complex and very difficult to review PR.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-11 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418225#comment-15418225
 ] 

Guoqiang Li commented on SPARK-6235:


 I'm doing this work and I'll put the patch in this month.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413055#comment-15413055
 ] 

Sean Owen commented on SPARK-6235:
--

I think the short answer is, it's very hard. I am not sure it's useful to say 
"I guess you all don't care". Please have a look at Imran's tickets and jump 
in. In practice, it's not a big limit, since hitting it means something else in 
the app can be designed better.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-08 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412727#comment-15412727
 ] 

Brian commented on SPARK-6235:
--

How is it possible that Spark 2.0 comes and and this bug isn't solved?  A quick 
Google search fort "Spark 2GB limit" or "Spark Integer.MAX_VALUE" shows that 
this is a very real problem that affects lots of users.  From the outside 
looking in, it seems like the Spark developers don't have an interest in 
solving this bug since it's been around for years at this point (including the 
jiras this consolidated ticket replaced).  Can you provide some sort of an 
update?  Maybe if you don't plan on fixing this issue, you can close the ticket 
or mark it as won't fix.  At least that way we'd have some insight in to your 
plansThanks!

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957554#comment-14957554
 ] 

Glenn Strycker commented on SPARK-6235:
---

I don't think so, but I can check.  My RDD came from an RDD of type (K,V) that 
was partitioned by key and worked just fine... my new RDD that is failing is 
attempting to map the value V to the K, so that (V, K) is now going to be 
partitioned by the value (now the key) instead.  So I can try running some 
checks of multiplicity to see if my values have some kind of skew... 
unfortunately most of those checks are going to involve reduceByKey-like 
operations that will probably result in 2GB failures themselves... I was hoping 
to get the mapping and partitioning of (K,V) -> (V,K) accomplished first before 
running such checks.  Thanks for the suggestion, though!

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957546#comment-14957546
 ] 

Reynold Xin commented on SPARK-6235:


Is your data skewed? i.e. maybe there is a single key that's enormous?


> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957540#comment-14957540
 ] 

Glenn Strycker commented on SPARK-6235:
---

Until this issue and sub-issue tickets are solved, are there any known 
work-arounds?  Increase number of partitions, or decrease?  Split up RDDs into 
parts, run your command, and then union?  Turn off Kryo?  Use dataframes?  
Help!!

I am encountering the 2GB bug on attempting to simply (re)partition by key an 
RDD of modest size (84GB) and low skew (AFAIK).  I have my memory requests per 
executor, per master node, per Java, etc. all cranked up as far as they'll go, 
and I'm currently attempting to partition this RDD across 6800 partitions.  
Unless my skew is really bad, I don't see why 12MB per partition would be 
causing a shuffle to hit the 2GB limit, unless the overhead of so many 
partitions is actually hurting rather than helping.  I'm going to try adjusting 
my partition number and see what happens, but I wanted to know if there is a 
standard work-around answer to this 2GB issue.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-09-14 Thread Sean McKibben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744746#comment-14744746
 ] 

Sean McKibben commented on SPARK-6235:
--

When reading from HBase into spark, the regions seem to dictate the spark 
partition and thus the block size. Makes things very difficult.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-09-14 Thread Ram Gande (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744605#comment-14744605
 ] 

Ram Gande commented on SPARK-6235:
--

Any progress on this.  We are seeing this issue constantly in our Spark jobs. 
Really appreciate if you could provide us with an update. :)

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-08-31 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724609#comment-14724609
 ] 

Zhen Peng commented on SPARK-6235:
--

Hi [~rxin], is there any update for this issue?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org