date:20150220

[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-20 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329940#comment-14329940
 ] 

Florian Verhein commented on SPARK-5629:


Agree. 

My point was more about avoiding tying `--machine-readable` to a particular 
encoding, as adding any more later (if needed, and it appears from this 
discussion that this is possible) would then make backwards compatibility 
ugly/hard.

My vote would be for json too.

Aside: I saw some value in bash (export variables) because spark_ec2.py is 
suited for use via its CLI,  so scripting bash around it is natural. However on 
further thought I don't think that would be a good idea, because a) its ugly to 
implement, and b) once a script gets to the complexity of requiring these 
variables, it should really be refactored into something more suitable like 
python. Also the existing `--get-master` should be sufficient for most use 
cases.


> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329885#comment-14329885
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~vanzin]  We should centralized all comments and reviews in one place, instead 
of going to different links. Also, we want to the reviewed code is updated, 
instead of based on some old version.   

Let's go to technical:

1. We all agree on this one about timeline client, and this is why it is alpha 
feature. Hive is a good example, but nobody can deny its importance in spark.
2. ACL is included in the patch, but not in the spec.
3. I understand your question, but the scope of my respond may be too big. To 
solve this, more work is needed on the entity design.

Let's keep an eye on these issues. 


> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329854#comment-14329854
 ] 

Kay Ousterhout commented on SPARK-5928:
---

One more question: are you sure this is a problem with the maximum remote 
shuffle block size, as opposed to a problem with the maximum record size? If 
you change your code to have more records, but each record is smaller, does it 
fix the problem?

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventL

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329852#comment-14329852
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

Hi [~zhzhan],

bq. But It is hard to comment or review patch given a hyper-link. 

Perhaps you're not familiar with all of Github's features, but you can click on 
each individual commit and comment on the code right there, just like you can 
on a PR created from those commits. Even if that doesn't sound very appealing, 
it's not hard to copy & paste the code and comment here if you really want to. 
Or generate a downloadable diff from the commits (just add ".diff" at the end 
of the commit URL, e.g. 
https://github.com/vanzin/spark/commit/c1365e0de264daa015c61a2248c80dfdea705786.diff).

bq. REST client: Currently Timeline client does not provide retrieve API.

That's the main reason why this feature hasn't moved forward. Using internal 
APIs to achieve that is something we're not willing to do in Spark, because it 
exposes us to future breakages and makes compatibility harder to maintain (just 
look at what has been done for Hive). So we either need the new API in Yarn, or 
we need to invest time to create a client API that does not use Yarn's classes.

bq. ACL: Timeline has ACL control as in hadoop-2.6

I'll believe you here since I haven't looked at that code yet. But it seems 
like it requires work on the client side, which is not currently covered in 
your spec.
bq. Read overhead and scalability: The effort is in the roadmap in yarn 
timeline service. This is a critical feature to use timeline service. Current 
HDFS approach in spark may not scalable due to similar reason

I think we're talking about different things. What I'm referring to is that the 
current code that reads from the ATS reads all events of a particular entity at 
the same time. If that entity has a large number of events, that will require a 
lot of memory on the ATS side to serialize the data, and a lot of memory on the 
Spark History Server side to deserialize it. It's orthogonal to whether the 
backing store is scalable or not.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329851#comment-14329851
 ] 

Kay Ousterhout commented on SPARK-5928:
---

Is it possible this is caused because the shuffle block is larger than 
spark.io.compression.snappy.block.size, so Snappy can't decompress it?

I'm not sure why it sometimes fails as a fetch failure and sometimes from 
Snappy.  But I think the reason that that these two exceptions lead to slightly 
different outcomes is as follows: when Spark gets a fetch failure, IIRC, we 
assume that the executor we were trying to fetch from failed, and remove it 
from the list of active executors.  That's where (I think) you see the case 
with the block manager re-registering.  On the other hand, if a task just has 
an exception failure, then we assume the executor is still fine.  Does that 
make sense?

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerCont

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329828#comment-14329828
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~sowen] In JIRA, we share the code so that other people can comment and 
review. I am not waiting for patch. But It is hard to comment or review patch 
given a hyper-link. 

I never think to make my change alone. Actually from the beginning I 
acknowledge his contribution, and don't mind closing my PR and help to review 
his at all if you follow the PR record.  Do you agree?

You mention you sense some insinuation and conspiracy. I didn't sense it. Can 
you please educate me if you figure it out?

Let's go back to technical: Overall, it is early adoption for timeline service. 
It is alpha feature, but most functionality is working although with some 
walkaround.

REST client: Currently Timeline client does not provide retrieve API. So we 
walk around with the similar approach to the timeclient its own implementation. 
 This needs to be changed after timeline component provide more mature API.

Read overhead and scalability: The effort is in the roadmap in yarn timeline 
service.  This is a critical  feature to use timeline service. Current HDFS 
approach in spark may not scalable due to similar reason (point me out if I am 
wrong), and timeline service may be more promising, although it is not there 
yet.

Security: The security is handled transparently in timeline client.   

ACL: Timeline has ACL control as in hadoop-2.6, and client can create and set 
domain with R/W so that control the permission. 





> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5937) [YARN] ClientSuite must set YARN mode to true to ensure correct SparkHadoopUtil implementation is used.

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329814#comment-14329814
 ] 

Apache Spark commented on SPARK-5937:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/4711

> [YARN] ClientSuite must set YARN mode to true to ensure correct 
> SparkHadoopUtil implementation is used.
> ---
>
> Key: SPARK-5937
> URL: https://issues.apache.org/jira/browse/SPARK-5937
> Project: Spark
>  Issue Type: Bug
>Reporter: Hari Shreedharan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2336:
-
Target Version/s: 1.4.0

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features, newbie
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2335:
-
Target Version/s: 1.4.0

> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5937) [YARN] ClientSuite must set YARN mode to true to ensure correct SparkHadoopUtil implementation is used.

2015-02-20 Thread Hari Shreedharan (JIRA)

Hari Shreedharan created SPARK-5937:
---

 Summary: [YARN] ClientSuite must set YARN mode to true to ensure 
correct SparkHadoopUtil implementation is used.
 Key: SPARK-5937
 URL: https://issues.apache.org/jira/browse/SPARK-5937
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5936) Automatically convert a StructType to a MapType when the number of fields exceed a threshold.

2015-02-20 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5936:
---

 Summary: Automatically convert a StructType to a MapType when the 
number of fields exceed a threshold.
 Key: SPARK-5936
 URL: https://issues.apache.org/jira/browse/SPARK-5936
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329803#comment-14329803
 ] 

Apache Spark commented on SPARK-5935:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4710

> Accept MapType in the schema provided to a JSON dataset.
> 
>
> Key: SPARK-5935
> URL: https://issues.apache.org/jira/browse/SPARK-5935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.

2015-02-20 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5935:
---

 Summary: Accept MapType in the schema provided to a JSON dataset.
 Key: SPARK-5935
 URL: https://issues.apache.org/jira/browse/SPARK-5935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2138:
-
Target Version/s: 1.4.0

> The KMeans algorithm in the MLlib can lead to the Serialized Task size become 
> bigger and bigger
> ---
>
> Key: SPARK-2138
> URL: https://issues.apache.org/jira/browse/SPARK-2138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 0.9.0, 0.9.1
>Reporter: DjvuLee
>Assignee: Xiangrui Meng
>
> When the algorithm running at certain stage, when running the reduceBykey() 
> function, It can lead to Executor Lost and Task lost, after several times. 
> the application exit.
> When this error occurred, the size of serialized task is bigger than 10MB, 
> and the size become larger as the iteration increase.
> the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622
> the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1892) Add an OWL-QN optimizer for L1 regularized optimizations.

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1892.

Resolution: Duplicate

> Add an OWL-QN optimizer for L1 regularized optimizations.
> -
>
> Key: SPARK-1892
> URL: https://issues.apache.org/jira/browse/SPARK-1892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> OWL-QN is a modified version of LBFGS to handle L1 regularization.
> The original paper is at 
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_AndrewG07.pdf
> This implementation extends LBFGS and uses the OWL-QN implementation from 
> breeze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1856) Standardize MLlib interfaces

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1856:
-
Issue Type: Umbrella  (was: New Feature)

> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1794) Generic ADMM implementation for SVM, lasso, and L1-regularized logistic regression

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1794.

Resolution: Duplicate

> Generic ADMM implementation for SVM, lasso, and L1-regularized logistic 
> regression
> --
>
> Key: SPARK-1794
> URL: https://issues.apache.org/jira/browse/SPARK-1794
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Andrew Tulloch
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1673) GLMNET implementation in Spark

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1673.

Resolution: Duplicate

> GLMNET implementation in Spark
> --
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1655:
-
Target Version/s: 1.4.0

> In naive Bayes, store conditional probabilities distributively.
> ---
>
> Key: SPARK-1655
> URL: https://issues.apache.org/jira/browse/SPARK-1655
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
>
> In the current implementation, we collect all conditional probabilities to 
> the driver node. When there are many labels and many features, this puts 
> heavy load on the driver. For scalability, we should provide a way to store 
> conditional probabilities distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1418) Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1418.

   Resolution: Implemented
Fix Version/s: 1.2.0

> Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done
> ---
>
> Key: SPARK-1418
> URL: https://issues.apache.org/jira/browse/SPARK-1418
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Matei Zaharia
> Fix For: 1.2.0
>
>
> Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, 
> it caches the Java one, since many of the algorithms are iterative. We should 
> call unpersist() at the end of the algorithm though to free cache space. In 
> addition it may be good to persist the Java RDD with 
> StorageLevel.MEMORY_AND_DISK instead of going back through the NumPy 
> conversion.. it will almost certainly be faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1359) SGD implementation is not efficient

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1359:
-
Target Version/s: 1.4.0

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5934) DStreamGraph.clearMetadata attempts to unpersist the same RDD multiple times

2015-02-20 Thread Nick Pritchard (JIRA)

Nick Pritchard created SPARK-5934:
-

 Summary: DStreamGraph.clearMetadata attempts to unpersist the same 
RDD multiple times
 Key: SPARK-5934
 URL: https://issues.apache.org/jira/browse/SPARK-5934
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Streaming
Affects Versions: 1.2.1
Reporter: Nick Pritchard
Priority: Minor


It seems that since DStream.clearMetadata calls itself recursively on the 
dependencies, that it attempts to unpersist the same RDD, which results in warn 
logs like this:
{quote}
WARN BlockManager: Asked to remove block rdd_2_1, which does not exist
{quote}

or this:
{quote}
WARN BlockManager: Block rdd_2_1 could not be removed as it was not found in 
either the disk, memory, or tachyon store
{quote}

This is preceded by logs like:
{quote}
DEBUG TransformedDStream: Unpersisting old RDDs: 2
DEBUG QueueInputDStream: Unpersisting old RDDs: 2
{quote}

Here is a reproducible case:
{code:scala}
object Test {
  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("Test")
val ssc = new StreamingContext(conf, Seconds(1))
val queue = new mutable.Queue[RDD[Int]]

val input = ssc.queueStream(queue)
val output = input.cache().transform(x => x)
output.print()

ssc.start()
for (i <- 1 to 5) {
  val rdd = ssc.sparkContext.parallelize(Seq(i))
  queue.enqueue(rdd)
  Thread.sleep(1000)
}
ssc.stop()
  }
}
{code}

It doesn't seem to be a fatal error, but the WARN messages are a bit unsettling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1014) MultilogisticRegressionWithSGD

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1014.

Resolution: Duplicate

We support multinomial logistic regression with LBFGS in 1.3. I marked this 
JIRA as duplicated.

> MultilogisticRegressionWithSGD
> --
>
> Key: SPARK-1014
> URL: https://issues.apache.org/jira/browse/SPARK-1014
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Kun Yang
>
> Multilogistic Regression With SGD based on mllib packages
> Use labeledpoint, gradientDescent to train the model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5888:
-
Summary: Add OneHotEncoder as a Transformer  (was: Add OneHotEncoder)

> Add OneHotEncoder as a Transformer
> --
>
> Key: SPARK-5888
> URL: https://issues.apache.org/jira/browse/SPARK-5888
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> `OneHotEncoder` takes a categorical column and output a vector column, which 
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
>   .setInputCol("countryIndex")
>   .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names 
> properly in the output column. We need to discuss the default naming scheme 
> and whether we should let it process multiple categorical columns at the same 
> time.
> One category (the most frequent one) should be removed from the output to 
> make the output columns linear independent. Or this could be an option tuned 
> on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1473:
-
Issue Type: Umbrella  (was: New Feature)

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329773#comment-14329773
 ] 

Apache Spark commented on SPARK-5912:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/4709

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-20 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329771#comment-14329771
 ] 

Nicholas Chammas commented on SPARK-5629:
-

[~florianverhein] - Hmm... Thinking about this for a bit, I'd be against 
introducing such flexibility right off the bat. Every additional option or 
non-standard flow will impose a maintenance burden. 

By that argument, actually, we should also just stick to JSON even though its 
less readable. As in, {{describe}} outputs in pretty-printed JSON and that's it.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5896) toDF in python doesn't work with tuple/list w/o names

2015-02-20 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5896.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4679
[https://github.com/apache/spark/pull/4679]

> toDF in python doesn't work with tuple/list w/o names
> -
>
> Key: SPARK-5896
> URL: https://issues.apache.org/jira/browse/SPARK-5896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.3.0
>
>
> {code}
> rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> kvdf = rdd.toDF()
> {code}
> {code}
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
>   1 rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> > 2 kvdf = rdd.toDF()
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in toDF(self, 
> schema, sampleRatio)
>  53 [Row(name=u'Alice', age=1)]
>  54 """
> ---> 55 return sqlCtx.createDataFrame(self, schema, sampleRatio)
>  56 
>  57 RDD.toDF = toDF
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> createDataFrame(self, data, schema, samplingRatio)
> 395 
> 396 if schema is None:
> --> 397 return self.inferSchema(data, samplingRatio)
> 398 
> 399 if isinstance(schema, (list, tuple)):
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> inferSchema(self, rdd, samplingRatio)
> 228 raise TypeError("Cannot apply schema to DataFrame")
> 229 
> --> 230 schema = self._inferSchema(rdd, samplingRatio)
> 231 converter = _create_converter(schema)
> 232 rdd = rdd.map(converter)
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> _inferSchema(self, rdd, samplingRatio)
> 158 
> 159 if samplingRatio is None:
> --> 160 schema = _infer_schema(first)
> 161 if _has_nulltype(schema):
> 162 for row in rdd.take(100)[1:]:
> /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in 
> _infer_schema(row)
> 646 items = row
> 647 else:
> --> 648 raise ValueError("Can't infer schema from tuple")
> 649 
> 650 elif hasattr(row, "__dict__"):  # object
> ValueError: Can't infer schema from tuple
> {code}
> Nearly the same code works if you give names (and this works without names in 
> scala and calls the columns _1, _2, ...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5898) Can't create DataFrame from Pandas data frame

2015-02-20 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5898.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4679
[https://github.com/apache/spark/pull/4679]

> Can't create DataFrame from Pandas data frame
> -
>
> Key: SPARK-5898
> URL: https://issues.apache.org/jira/browse/SPARK-5898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.3.0
>
>
> {code}
> data = sqlContext.table("sparkCommits")
> p = data.toPandas()
> sqlContext.createDataFrame(p)
> {code}
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
>   1 data = sqlContext.table("sparkCommits")
>   2 p = data.toPandas()
> > 3 sqlContext.createDataFrame(p)
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> createDataFrame(self, data, schema, samplingRatio)
> 385 data = self._sc.parallelize(data.to_records(index=False))
> 386 if schema is None:
> --> 387 schema = list(data.columns)
> 388 
> 389 if not isinstance(data, RDD):
> AttributeError: 'RDD' object has no attribute 'columns'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329739#comment-14329739
 ] 

Sean Owen commented on SPARK-1537:
--

[~zzhan] You have provided a patch as a PR right? anyone can try it. Request 
granted.

Given the YARN JIRAs already referenced here, some of which have patches ready 
to go too, I think it has been discussed in YARN too? What isn't happening with 
YARN that should be, and, can you help with it? I'm not sure if that's where 
you are saying the waiting is. That is: hasn't this been blocked on YARN 
changes for a long time?

I get it, one person's 'outstanding bug' is another's 'will not fix' but that's 
the give and take of OSS. If you want this feature in Spark, and people are 
asking that it should depend on some YARN changes -- then what do you think 
about lobbying for those YARN changes? or do you disagree that they're 
necessary, and can you argue that here please?

I don't understand your second reply. Yes, it sounds like two people have a 
similar solution with a similar problem with YARN APIs. You say you're not 
waiting on code now, but have repeatedly asked Marcelo to share some (other?) 
code. It's odd since, yes, it's very clear you acknowledge you've already seen 
his code and reused a bit, which is entirely fine. I hope we're done with that 
exchange.

I sense some insinuation that code is being 'hidden' in bad faith, but I can't 
figure out the conspiracy. I see every willingness to make *your* change alone 
here, if you propose something that addresses the YARN issues raised here. You 
are *not* blocked on anyone else's patch. However all of us are 'blocked' on 
the consensus of community / committers that care about this issue, and it 
looks like the response is clear so far: not until YARN API stuff is sorted out 
one way or the other.

Are you suggesting this patch should be committed without the YARN changes? or 
that you're working on the YARN changes? what do you want to take over and do 
next?

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-1537:
--
Comment: was deleted

(was: [~sowen] By the way, I am not waiting for someone to give me the patch. 
It is because someone declare the patch is almost ready half year ago. After I 
submit mine, then some one keep saying my patch is not much different from his.)

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4081) Categorical feature indexing

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4081:
-
Target Version/s: 1.4.0  (was: 1.2.0)

> Categorical feature indexing
> 
>
> Key: SPARK-4081
> URL: https://issues.apache.org/jira/browse/SPARK-4081
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> DecisionTree and RandomForest require that categorical features and labels be 
> indexed 0,1,2  There is currently no code to aid with indexing a dataset. 
>  This is a proposal for a helper class for computing indices (and also 
> deciding which features to treat as categorical).
> Proposed functionality:
> * This helps process a dataset of unknown vectors into a dataset with some 
> continuous features and some categorical features. The choice between 
> continuous and categorical is based upon a maxCategories parameter.
> * This can also map categorical feature values to 0-based indices.
> Usage:
> {code}
> val myData1: RDD[Vector] = ...
> val myData2: RDD[Vector] = ...
> val datasetIndexer = new DatasetIndexer(maxCategories)
> datasetIndexer.fit(myData1)
> val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
> datasetIndexer.fit(myData2)
> val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
> val categoricalFeaturesInfo: Map[Double, Int] = 
> datasetIndexer.getCategoricalFeatureIndexes()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3249:
-
Target Version/s:   (was: 1.2.0)

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5516) ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Jav

2015-02-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329718#comment-14329718
 ] 

Xiangrui Meng commented on SPARK-5516:
--

[~wuyukai] Could you provide all the parameters you used? The most important 
ones are number of features, maxDepth, and maxBins. Please also remember to set 
`--driver-memory` to a large number with spark-submit. 

> ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver] java.lang.OutOfMemoryError: Java heap space
> 
>
> Key: SPARK-5516
> URL: https://issues.apache.org/jira/browse/SPARK-5516
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos 6.5   
>Reporter: wuyukai
>
> When we ran the model of Gradient Boosting Tree, it throwed this exception 
> below. The data we used is only 45M. We ran these data on 4 computers that 
> each have 4 cores and 16GB RAM. We set the parameter 
> "gradientboostedtrees.maxiteration" 50.
> 15/02/01 01:39:48 INFO DAGScheduler: Job 965 failed: collectAsMap at 
> DecisionTree.scala:653, took 1.616976 s
> Exception in thread "main" org.apache.spark.SparkException: Job cancelled 
> because SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
>   at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
>   at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>   at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 15/02/01 01:39:48 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2271)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
>   at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writ

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329720#comment-14329720
 ] 

Imran Rashid commented on SPARK-5928:
-

Actually, there *is* some weirdness in how spark handles this failure.  
[~kayousterhout], maybe you can take a look -- I think it might be related to 
this change: 
https://github.com/apache/spark/commit/18ad59e2c6b7bd009e8ba5ebf8fcf99630863029,
 specifically here: 
https://github.com/apache/spark/commit/18ad59e2c6b7bd009e8ba5ebf8fcf99630863029#diff-bad3987c83bd22d46416d3dd9d208e76R488.
  Maybe the issue existed before that change, I am not sure, but at least you 
might be a bit more knowledgeable about this code :)

So the strange thing is, every time there is a {{FetchFailedException}}, spark 
marks the stage as failed, with logs something like:

{noformat}
15/02/20 13:09:24 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose 
tasks have all completed, from pool 
15/02/20 13:09:24 INFO DAGScheduler: Marking Stage 1 (count at :15) as 
failed due to a fetch failure from Stage 0 (map at :15)
15/02/20 13:09:24 INFO DAGScheduler: Stage 1 (count at :15) failed in 
0.736 s
15/02/20 13:09:24 INFO DAGScheduler: Resubmitting Stage 0 (map at :15) 
and Stage 1 (count at :15) due to fetch failure
15/02/20 13:09:24 INFO DAGScheduler: Executor lost: 2 (epoch 1)
15/02/20 13:09:24 INFO BlockManagerMasterActor: Trying to remove executor 2 
from BlockManagerMaster.
15/02/20 13:09:24 INFO BlockManagerMasterActor: Removing block manager 
BlockManagerId(2, imran-3.ent.cloudera.com, 45980)
{noformat}

but then it shortly re-registers that same block manager:

{noformat}
15/02/20 13:09:24 INFO BlockManagerMasterActor: Registering block manager 
imran-3.ent.cloudera.com:45980 with 2.0 GB RAM, BlockManagerId(2, 
imran-3.ent.cloudera.com, 45980)
{noformat}

then it reruns the same stage, and goes through the same thing over and over 
again.  It only breaks out of this loop if instead I get the 
{{java.io.Exception}} from snappy mentioned above (no idea why that exception 
only occurs sometimes).  In that case, we instead get a a message like:

{noformat}
15/02/20 13:19:23 WARN TaskSetManager: Lost task 0.0 in stage 1.3 (TID 7, 
imran-2.ent.cloudera.com): java.io.IOException: failed to uncompress the chunk: 
PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
...
15/02/20 13:19:23 INFO TaskSetManager: Starting task 0.1 in stage 1.3 (TID 8, 
imran-3.ent.cloudera.com, PROCESS_LOCAL, 1056 bytes)
{noformat}

so it just retries the *task*, rather than failing the stage.  If we're lucky 
enough to have that exception occur 4 times in a row, before any 
{{FetchFailedException}} s, then spark aborts the job and the stage retrying 
loop is finally broken.  In my last try w/ the code given for this issue, it 
resulted in 6 retries of the stage.

This should probably go under another issue, but I don't quite understand what 
is going on well enough to formulate that issue clearly.  To me it looks like 
{{FetchFailedException}} should just result in a normal task failure, but it 
looks like that special handling was put in there for some reason.  But I can 
just file another issue if its not clear what is going on here.

thanks

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>

[jira] [Updated] (SPARK-5516) ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5516:
-
Fix Version/s: (was: 1.2.2)

> ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver] java.lang.OutOfMemoryError: Java heap space
> 
>
> Key: SPARK-5516
> URL: https://issues.apache.org/jira/browse/SPARK-5516
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos 6.5   
>Reporter: wuyukai
>
> When we ran the model of Gradient Boosting Tree, it throwed this exception 
> below. The data we used is only 45M. We ran these data on 4 computers that 
> each have 4 cores and 16GB RAM. We set the parameter 
> "gradientboostedtrees.maxiteration" 50.
> 15/02/01 01:39:48 INFO DAGScheduler: Job 965 failed: collectAsMap at 
> DecisionTree.scala:653, took 1.616976 s
> Exception in thread "main" org.apache.spark.SparkException: Job cancelled 
> because SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
>   at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
>   at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>   at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 15/02/01 01:39:48 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2271)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
>   at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.O

[jira] [Updated] (SPARK-5516) ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5516:
-
Target Version/s: 1.4.0  (was: 1.2.0)

> ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver] java.lang.OutOfMemoryError: Java heap space
> 
>
> Key: SPARK-5516
> URL: https://issues.apache.org/jira/browse/SPARK-5516
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos 6.5   
>Reporter: wuyukai
>
> When we ran the model of Gradient Boosting Tree, it throwed this exception 
> below. The data we used is only 45M. We ran these data on 4 computers that 
> each have 4 cores and 16GB RAM. We set the parameter 
> "gradientboostedtrees.maxiteration" 50.
> 15/02/01 01:39:48 INFO DAGScheduler: Job 965 failed: collectAsMap at 
> DecisionTree.scala:653, took 1.616976 s
> Exception in thread "main" org.apache.spark.SparkException: Job cancelled 
> because SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
>   at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
>   at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>   at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 15/02/01 01:39:48 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-22] shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2271)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
>   at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> jav

[jira] [Updated] (SPARK-4406) SVD should check for k < 1

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4406:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> SVD should check for k < 1
> --
>
> Key: SPARK-4406
> URL: https://issues.apache.org/jira/browse/SPARK-4406
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
> Fix For: 1.3.0
>
>
> When SVD is called with k < 1, it still tries to compute the SVD, causing a 
> lower-level error.  It should fail early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329704#comment-14329704
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~sowen] By the way, I am not waiting for someone to give me the patch. It is 
because someone declare the patch is almost ready half year ago. After I submit 
mine, then some one keep saying my patch is not much different from his.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329700#comment-14329700
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~sowen] From the whole context, I believe you understand what happened here. 
Let's be professional. 

My request is "if someone want to try this alpha feature, we can provide a 
patch at least so that people can give it a try. Even if it cannot go upstream 
due to various reasons." 

Due to Yarn block, we should discuss with the yarn community, instead of filing 
a bug and wait forever. 

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329692#comment-14329692
 ] 

Joseph K. Bradley commented on SPARK-5912:
--

Great, thanks!  I build and view them using jekyll in the docs/ directory:
* Run "jekyll build" to compile everything.
* Then run "jekyll serve --watch" and view the guides on localhost:4000

You may need to install jekyll and maybe some other libraries to do this.

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329691#comment-14329691
 ] 

Sean Owen commented on SPARK-1537:
--

[~zzhan] I also can't figure out what you are suggesting here. You have 
proposed a patch, and you've been given feedback with specific reasons it 
shouldn't be committed to Spark. I agree with those, FWIW, thought I think they 
can be overcome soon. I assume others agree, given the silence (?). You haven't 
responded to these specific points. As it stands I think that's your answer: 
these YARN issues need to be addressed -- either fixed or agreed to be not an 
issue.

Nobody needs to 'take over'. I'm not clear why you think you have been waiting 
on something or someone to give you code. Right now the only thing this is 
waiting on is for you or [~zjshen] or anyone to address the YARN API issues. 
Rather than keep the broken record going, why not address the YARN API issues 
highlighted here? sorry, the answer may be that you can't commit this patch you 
want to by yourself but that's just how OSS works.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-20 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329685#comment-14329685
 ] 

Alexander Ulanov commented on SPARK-5912:
-

I've almost written the ChiSquared section in the corresponding file. I was 
able to generate API docs with `build/sbt doc`, however I don't see that 
"mllib-*-*" are too. Could you suggest how should I generate them?

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329681#comment-14329681
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

It's impossible to submit a patch when the implementation is currently blocked 
on a feature that doesn't exist in Yarn. Please check the "is blocked by" link 
at the top of this bug.

If you're willing to write the code to work around that missing feature, please 
include that in your spec and patch. I am not and would rather wait for Yarn 
instead.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329678#comment-14329678
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~vanzin] I declare "integrate your code" from the first submission of PR. Do 
you want to count how many times you keeping saying this? 

 "Here's the link to the comment with the link to my code, dated August '14".  
Now spark is under the vote for 1.3, and today is 2/20/2015.  Is it so 
difficult submit a workable patch and design doc? 

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5933) Centralize deprecated configs in SparkConf

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5933:
-
Description: Deprecated configs are currently all strewn across the code 
base. It would be good to simplify the handling of the deprecated configs in a 
central location to avoid duplicating the deprecation logic everywhere.

> Centralize deprecated configs in SparkConf
> --
>
> Key: SPARK-5933
> URL: https://issues.apache.org/jira/browse/SPARK-5933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Deprecated configs are currently all strewn across the code base. It would be 
> good to simplify the handling of the deprecated configs in a central location 
> to avoid duplicating the deprecation logic everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5933) Centralize deprecated configs in SparkConf

2015-02-20 Thread Andrew Or (JIRA)

Andrew Or created SPARK-5933:


 Summary: Centralize deprecated configs in SparkConf
 Key: SPARK-5933
 URL: https://issues.apache.org/jira/browse/SPARK-5933
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5932) Use consistent naming for byte properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5932:
-
Description: 
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.reducer.maxMbInFlight // megabytes
spark.kryoserializer.buffer.mb // megabytes
spark.shuffle.file.buffer.kb // kilobytes
spark.broadcast.blockSize // kilobytes
spark.executor.logs.rolling.size.maxBytes // bytes
spark.io.compression.snappy.block.size // bytes
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:
{code}
spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.

  was:
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:
{code}
spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.


> Use consistent naming for byte properties
> -
>
> Key: SPARK-5932
> URL: https://issues.apache.org/jira/browse/SPARK-5932
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5931's sister issue.
> The naming of existing byte configs is inconsistent. We currently have the 
> following throughout the code base:
> {code}
> spark.reducer.maxMbInFlight // megabytes
> spark.kryoserializer.buffer.mb // megabytes
> spark.shuffle.file.buffer.kb // kilobytes
> spark.broadcast.blockSize // kilobytes
> spark.executor.logs.rolling.size.maxBytes // bytes
> spark.io.compression.snappy.block.size // bytes
> {code}
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 500b, 2k, 100m, 46g, 
> similar to what we currently use for our memory settings. For instance:
> {code}
> spark.reducer.maxSizeInFlight = 10m
> spark.kryoserializer.buffer = 2m
> spark.shuffle.file.buffer = 10k
> spark.broadcast.blockSize = 20k
> spark.executor.logs.rolling.maxSize = 500b
> spark.io.compression.snappy.blockSize = 200b
> {code}
> All existing configs that are relevant will be deprecated in favor of the new 
> ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5931) Use consistent naming for time properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5931:
-
Description: 
This is SPARK-5932's sister issue.

The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.network.timeout // seconds
spark.executor.heartbeatInterval // milliseconds
spark.storage.blockManagerSlaveTimeoutMs // milliseconds
spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:
{code}
spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.

  was:
This is SPARK-5932's sister issue.

The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:
{code}
spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.


> Use consistent naming for time properties
> -
>
> Key: SPARK-5931
> URL: https://issues.apache.org/jira/browse/SPARK-5931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5932's sister issue.
> The naming of existing time configs is inconsistent. We currently have the 
> following throughout the code base:
> {code}
> spark.network.timeout // seconds
> spark.executor.heartbeatInterval // milliseconds
> spark.storage.blockManagerSlaveTimeoutMs // milliseconds
> spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
> {code}
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 5s, 2ms, 100us. For 
> instance:
> {code}
> spark.network.timeout = 5s
> spark.executor.heartbeatInterval = 500ms
> spark.storage.blockManagerSlaveTimeout = 100ms
> spark.yarn.scheduler.heartbeatInterval = 400ms
> {code}
> All existing configs that are relevant will be deprecated in favor of the new 
> ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5931) Use consistent naming for time properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5931:
-
Description: 
This is SPARK-5932's sister issue.

The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:
{code}
spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.

  was:
This is SPARK-5932's sister issue.

The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:

spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:

spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms

We should do this soon before we keep introducing more time configs.


> Use consistent naming for time properties
> -
>
> Key: SPARK-5931
> URL: https://issues.apache.org/jira/browse/SPARK-5931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5932's sister issue.
> The naming of existing time configs is inconsistent. We currently have the 
> following throughout the code base:
> {code}
> spark.network.timeout (seconds)
> spark.executor.heartbeatInterval (milliseconds)
> spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
> spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)
> {code}
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 5s, 2ms, 100us. For 
> instance:
> {code}
> spark.network.timeout = 5s
> spark.executor.heartbeatInterval = 500ms
> spark.storage.blockManagerSlaveTimeout = 100ms
> spark.yarn.scheduler.heartbeatInterval = 400ms
> {code}
> All existing configs that are relevant will be deprecated in favor of the new 
> ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5932) Use consistent naming for byte properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5932:
-
Description: 
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:
{code}
spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b
{code}
All existing configs that are relevant will be deprecated in favor of the new 
ones. We should do this soon before we keep introducing more time configs.

  was:
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:
{code}
spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b
{code}
We should do this soon before we keep introducing more time configs.


> Use consistent naming for byte properties
> -
>
> Key: SPARK-5932
> URL: https://issues.apache.org/jira/browse/SPARK-5932
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5931's sister issue.
> The naming of existing byte configs is inconsistent. We currently have the 
> following throughout the code base:
> {code}
> spark.reducer.maxMbInFlight (mb)
> spark.kryoserializer.buffer.mb (mb)
> spark.shuffle.file.buffer.kb (kb)
> spark.broadcast.blockSize (kb)
> spark.executor.logs.rolling.size.maxBytes (bytes)
> spark.io.compression.snappy.block.size (bytes)
> {code}
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 500b, 2k, 100m, 46g, 
> similar to what we currently use for our memory settings. For instance:
> {code}
> spark.reducer.maxSizeInFlight = 10m
> spark.kryoserializer.buffer = 2m
> spark.shuffle.file.buffer = 10k
> spark.broadcast.blockSize = 20k
> spark.executor.logs.rolling.maxSize = 500b
> spark.io.compression.snappy.blockSize = 200b
> {code}
> All existing configs that are relevant will be deprecated in favor of the new 
> ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5932) Use consistent naming for byte properties

2015-02-20 Thread Andrew Or (JIRA)

Andrew Or created SPARK-5932:


 Summary: Use consistent naming for byte properties
 Key: SPARK-5932
 URL: https://issues.apache.org/jira/browse/SPARK-5932
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:

spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:

spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b

We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5932) Use consistent naming for byte properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5932:
-
Description: 
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:
{code}
spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)
{code}
Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:
{code}
spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b
{code}
We should do this soon before we keep introducing more time configs.

  was:
This is SPARK-5931's sister issue.

The naming of existing byte configs is inconsistent. We currently have the 
following throughout the code base:

spark.reducer.maxMbInFlight (mb)
spark.kryoserializer.buffer.mb (mb)
spark.shuffle.file.buffer.kb (kb)
spark.broadcast.blockSize (kb)
spark.executor.logs.rolling.size.maxBytes (bytes)
spark.io.compression.snappy.block.size (bytes)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 500b, 2k, 100m, 46g, similar to what we 
currently use for our memory settings. For instance:

spark.reducer.maxSizeInFlight = 10m
spark.kryoserializer.buffer = 2m
spark.shuffle.file.buffer = 10k
spark.broadcast.blockSize = 20k
spark.executor.logs.rolling.maxSize = 500b
spark.io.compression.snappy.blockSize = 200b

We should do this soon before we keep introducing more time configs.


> Use consistent naming for byte properties
> -
>
> Key: SPARK-5932
> URL: https://issues.apache.org/jira/browse/SPARK-5932
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5931's sister issue.
> The naming of existing byte configs is inconsistent. We currently have the 
> following throughout the code base:
> {code}
> spark.reducer.maxMbInFlight (mb)
> spark.kryoserializer.buffer.mb (mb)
> spark.shuffle.file.buffer.kb (kb)
> spark.broadcast.blockSize (kb)
> spark.executor.logs.rolling.size.maxBytes (bytes)
> spark.io.compression.snappy.block.size (bytes)
> {code}
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 500b, 2k, 100m, 46g, 
> similar to what we currently use for our memory settings. For instance:
> {code}
> spark.reducer.maxSizeInFlight = 10m
> spark.kryoserializer.buffer = 2m
> spark.shuffle.file.buffer = 10k
> spark.broadcast.blockSize = 20k
> spark.executor.logs.rolling.maxSize = 500b
> spark.io.compression.snappy.blockSize = 200b
> {code}
> We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5931) Use consistent naming for time properties

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5931:
-
Description: 
This is SPARK-5932's sister issue.

The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:

spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:

spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms

We should do this soon before we keep introducing more time configs.

  was:
The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:

spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:

spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms

We should do this soon before we keep introducing more time configs.


> Use consistent naming for time properties
> -
>
> Key: SPARK-5931
> URL: https://issues.apache.org/jira/browse/SPARK-5931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is SPARK-5932's sister issue.
> The naming of existing time configs is inconsistent. We currently have the 
> following throughout the code base:
> spark.network.timeout (seconds)
> spark.executor.heartbeatInterval (milliseconds)
> spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
> spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)
> Instead, my proposal is to simplify the config name itself and make 
> everything accept time using the following format: 5s, 2ms, 100us. For 
> instance:
> spark.network.timeout = 5s
> spark.executor.heartbeatInterval = 500ms
> spark.storage.blockManagerSlaveTimeout = 100ms
> spark.yarn.scheduler.heartbeatInterval = 400ms
> We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329664#comment-14329664
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~vanzin] If you don't have bandwidth, or don't know how to move forward with 
this JIRA after a long time. I don't mind to take it over.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329660#comment-14329660
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

Hi [~zzhan],

I already posted the link to my code in this bug several times. The reason why 
I haven't sent a PR is the exact reason I raised about your spec and your 
patch: it uses private Yarn APIs. I've said this several times, and I really 
don't understand what part of it you don't understand. Pardon me if I haven't 
been clear about it.

Also note how there's Yarn bug in the list of blocker bugs for this one. That's 
because my p.o.c. code depends on that bug to be fixed before it can move 
forward. If you have a design that is not blocked by that code, and does not 
use internal APIs, feel free to remove the link and post it.

Here's the link to the comment with the link to my code, dated August '14:
https://issues.apache.org/jira/browse/SPARK-1537?focusedCommentId=14088438&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14088438

A link you have already seen, since you used parts of that code in your patch.

So please, can you reply to my actual comments instead of keep going back to 
this issue? My comments have nothing to do with the fact that I've written a 
p.o.c. for this feature. They're issues that exist in your spec and your code 
independent of anything I've done.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5931) Use consistent naming for time properties

2015-02-20 Thread Andrew Or (JIRA)

Andrew Or created SPARK-5931:


 Summary: Use consistent naming for time properties
 Key: SPARK-5931
 URL: https://issues.apache.org/jira/browse/SPARK-5931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


The naming of existing time configs is inconsistent. We currently have the 
following throughout the code base:

spark.network.timeout (seconds)
spark.executor.heartbeatInterval (milliseconds)
spark.storage.blockManagerSlaveTimeoutMs (milliseconds)
spark.yarn.scheduler.heartbeat.interval-ms (milliseconds)

Instead, my proposal is to simplify the config name itself and make everything 
accept time using the following format: 5s, 2ms, 100us. For instance:

spark.network.timeout = 5s
spark.executor.heartbeatInterval = 500ms
spark.storage.blockManagerSlaveTimeout = 100ms
spark.yarn.scheduler.heartbeatInterval = 400ms

We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329649#comment-14329649
 ] 

Zhan Zhang edited comment on SPARK-1537 at 2/20/15 10:14 PM:
-

[~vanzin] Thanks for the comments.  I  don't understand you keep saying "my 
code does not have many differences form your code."   We are working for 
apache project, and we all follow apache policy. Here is the link for apache 
license details:
http://www.apache.org/licenses/LICENSE-2.0.

Since you think your prototype is ready half year ago, as I request several 
times, why not post your workable patch and design and move forward. I will 
explain to you clearly "what's the major difference of  the core design of my 
code from yours" . The patch size is small, and the design is not so 
complicated, but I am sure to show you where those core design come from.

After you post your design and code, we can start from there.

Thanks.

Zhan Zhang



was (Author: zzhan):
[~vanzin] Thanks for the comments.  I  don't understand you keep saying "my 
code does not have many differences form your code."   We are working for 
apache project, and we all follow apache policy. Here is the link for apache 
license details:
http://www.apache.org/licenses/LICENSE-2.0.

As I request several times, why not post your workable patch and design. I will 
explain to you clearly "what's the major difference of  the core design of my 
code from yours" . The patch size is small, and the design is not so 
complicated, but I am sure to show you where those core design come from.

After you post your design and code, we can start from there.

Thanks.

Zhan Zhang


> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329649#comment-14329649
 ] 

Zhan Zhang commented on SPARK-1537:
---

[~vanzin] Thanks for the comments.  I  don't understand you keep saying "my 
code does not have many differences form your code."   We are working for 
apache project, and we all follow apache policy. Here is the link for apache 
license details:
http://www.apache.org/licenses/LICENSE-2.0.

As I request several times, why not post your workable patch and design. I will 
explain to you clearly "what's the major difference of  the core design of my 
code from yours" . The patch size is small, and the design is not so 
complicated, but I am sure to show you where those core design come from.

After you post your design and code, we can start from there.

Thanks.

Zhan Zhang


> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet

2015-02-20 Thread Daniel Fry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329645#comment-14329645
 ] 

Daniel Fry commented on SPARK-3368:
---

Hey fwiw I encountered this recently with spark 1.1.1, parquet-avro 1.6.0rc4, 
and avro 1.7.6. We're running a standalone driver app against the Mesos backend 
scheduler. [~theclaymethod]'s hadoop2 classifier trick didnt work for me, but 
the [~cdgore] approach did, via spark.executorEnv.SPARK_CLASSPATH

> Spark cannot be used with Avro and Parquet
> --
>
> Key: SPARK-3368
> URL: https://issues.apache.org/jira/browse/SPARK-3368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Graham Dennis
>
> Spark cannot currently (as of 1.0.2) use any Parquet write support classes 
> that are not part of the spark assembly jar (at least when launched using 
> `spark-submit`).  This prevents using Avro with Parquet.
> See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
> reproduce this issue.
> The problem appears in the master logs as:
> {noformat}
> 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
> parquet.hadoop.BadConfigurationException: could not instanciate class 
> parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
>   ... 11 more
> {noformat}
> The root cause of the problem is that the class loader that's used to find 
> the Parquet write support class only searches the spark assembly jar and 
> doesn't also search the application jar.  A solution would be to ensure that 
> the application jar is always available on the executor classpath.  This is 
> the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Description: 
The description makes it sound like the retryWait itself defaults to 15 
seconds, when it's actually 5. We should clarify this by changing the wording a 
little...
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{code}

  was:
The description makes it sound like the retryWait itself defaults to 15 
seconds, when it's actually 5.
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{code}


> Documented default of spark.shuffle.io.retryWait is confusing
> -
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Trivial
>
> The description makes it sound like the retryWait itself defaults to 15 
> seconds, when it's actually 5. We should clarify this by changing the wording 
> a little...
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329641#comment-14329641
 ] 

Apache Spark commented on SPARK-4655:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4708

> Split Stage into ShuffleMapStage and ResultStage subclasses
> ---
>
> Key: SPARK-4655
> URL: https://issues.apache.org/jira/browse/SPARK-4655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Ilya Ganelin
>
> The scheduler's {{Stage}} class has many fields which are only applicable to 
> result stages or shuffle map stages.  As a result, I think that it makes 
> sense to make {{Stage}} into an abstract base class with two subclasses, 
> {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
> understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Affects Version/s: 1.2.0

> Documented default of spark.shuffle.io.retryWait is confusing
> -
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Trivial
>
> The description makes it sound like the retryWait itself defaults to 15 
> seconds, when it's actually 5.
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Priority: Trivial  (was: Minor)

> Documented default of spark.shuffle.io.retryWait is confusing
> -
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Trivial
>
> The description makes it sound like the retryWait itself defaults to 15 
> seconds, when it's actually 5.
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Summary: Documented default of spark.shuffle.io.retryWait is confusing  
(was: Documented default of spark.shuffle.io.retryWait is confusing.)

> Documented default of spark.shuffle.io.retryWait is confusing
> -
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Minor
>
> 5 != 15:
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is not consistent

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Priority: Minor  (was: Major)

> Documented default of spark.shuffle.io.retryWait is not consistent
> --
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Minor
>
> 5 != 15:
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Description: 
The description makes it sound like the retryWait itself defaults to 15 
seconds, when it's actually 5.
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{code}

  was:
5 != 15:
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{code}


> Documented default of spark.shuffle.io.retryWait is confusing
> -
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Minor
>
> The description makes it sound like the retryWait itself defaults to 15 
> seconds, when it's actually 5.
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing.

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Summary: Documented default of spark.shuffle.io.retryWait is confusing.  
(was: Documented default of spark.shuffle.io.retryWait is not consistent)

> Documented default of spark.shuffle.io.retryWait is confusing.
> --
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Priority: Minor
>
> 5 != 15:
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is not consistent

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5930:
-
Description: 
5 != 15:
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{code}

  was:
5 != 15:
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{/code}


> Documented default of spark.shuffle.io.retryWait is not consistent
> --
>
> Key: SPARK-5930
> URL: https://issues.apache.org/jira/browse/SPARK-5930
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Andrew Or
>
> 5 != 15:
> {code}
> 
>   spark.shuffle.io.retryWait
>   5
>   
> (Netty only) Seconds to wait between retries of fetches. The maximum 
> delay caused by retrying
> is simply maxRetries * retryWait, by default 15 seconds.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is not consistent

2015-02-20 Thread Andrew Or (JIRA)

Andrew Or created SPARK-5930:


 Summary: Documented default of spark.shuffle.io.retryWait is not 
consistent
 Key: SPARK-5930
 URL: https://issues.apache.org/jira/browse/SPARK-5930
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Andrew Or


5 != 15:
{code}

  spark.shuffle.io.retryWait
  5
  
(Netty only) Seconds to wait between retries of fetches. The maximum delay 
caused by retrying
is simply maxRetries * retryWait, by default 15 seconds.
  

{/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-02-20 Thread Sebastian YEPES FERNANDEZ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329628#comment-14329628
 ] 

Sebastian YEPES FERNANDEZ commented on SPARK-5281:
--

Also having this issue with 1.2.1 with the standard context (sc)

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace 
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-02-20 Thread Sebastian YEPES FERNANDEZ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329628#comment-14329628
 ] 

Sebastian YEPES FERNANDEZ edited comment on SPARK-5281 at 2/20/15 9:54 PM:
---

Also having this issue with 1.2.1 the standard context (sc)


was (Author: syepes):
Also having this issue with 1.2.1 with the standard context (sc)

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace 
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329623#comment-14329623
 ] 

Imran Rashid commented on SPARK-5928:
-

sometimes this also results in exceptions like this (I have no idea why it 
could be either one):

{noformat}
15/02/20 13:42:15 WARN TaskSetManager: Lost task 0.3 in stage 1.6 (TID 19, 
imran-2.ent.cloudera.com): java.io.IOException: failed to uncompress the chunk: 
PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:383)
at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator

[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329613#comment-14329613
 ] 

Imran Rashid commented on SPARK-1391:
-

Here is a minimal program to demonstrate the problem:

{code}
sc.parallelize(1 to 1e6.toInt, 1).map{i => new 
Array[Byte](2.2e3.toInt)}.persist(StorageLevel.DISK_ONLY).count()
{code}

this only demonstrates the problem w/ {{DiskStore}} but a solution to this 
should apply to other cases if done correctly.  (probably need to come up with 
more test cases)

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
> Attachments: SPARK-1391.diff
>
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
> at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
> at 
> org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:

[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329605#comment-14329605
 ] 

Imran Rashid commented on SPARK-1391:
-

[~coderplay], I assume you are no longer looking at this, right?  I'm going to 
take a crack at this issue if you don't mind.  Here is my plan, copied from 
SPARK-1476 (now that I've untangled those issues a little bit):

I'd like to start on it, with the following very minimal goals:

1. Make it possible for blocks to be bigger than 2GB
2. Maintain performance on smaller blocks

ie., I'm not going to try to do anything fancy to optimize performance of the 
large blocks. To that end, my plan is to

1. create a {{LargeByteBuffer}} interface, which just has the same methods we 
use on {{ByteBuffer}}
2. have one implementation that just wraps one ByteBuffer, and another which 
wraps a completely static set of {{ByteBuffer}} s (eg., if you map a 3 GB file, 
it'll just immediately map it to 2 {{ByteBuffer}} s, nothing fancy with only 
mapping the first half of the file until the second is needed etc. etc.)
3. change {{ByteBuffer}} to {{LargeByteBuffer}} in {{BlockStore}}

I see that about a year back there was a lot of discussion on this in 
SPARK-1476, and some alternate proposals. I'd like to push forward with a POC 
to try to move the discussion along again. I know there was some discussion 
about how important this is, and whether or not we want to support it. IMO this 
is a big limitation and results in a lot of frustration for the users, we 
really need a solution for this.

I could still be missing something, but I believe this should also solve 
SPARK-3151

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
> Attachments: SPARK-1391.diff
>
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.It

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329584#comment-14329584
 ] 

Imran Rashid commented on SPARK-5928:
-

Here are some thoughts on we *might* fix this.  I am definitely not saying an 
eventual solution has to do it this way, just figured that I should write down 
my thoughts in case its useful to anyone who takes a look at this in the 
future.  There may be alternate solutions that are worth considering as well.

Right now, we always try to send at least one full shuffle block in each 
message.  We could change the implementation to break one shuffle block into 
mutliple messages.  The key locations for that are 
https://github.com/apache/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java#L93
 on the receiving end, which assumes each "chunk" corresponds to exactly one 
block.  And on the sending side, that happens here: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala#L55.
  The protocol there could change to allow for one block to get sent in 
multiple chunks.

Ok I know that isn't a lot of info, but maybe at least the pointers to the code 
are helpful to somebody :)


> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFi

[jira] [Updated] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4705:
-
Summary: Driver retries in cluster mode always fail if event logging is 
enabled  (was: Driver retries in yarn-cluster mode always fail if event logging 
is enabled)

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329568#comment-14329568
 ] 

Imran Rashid commented on SPARK-5928:
-

(just edited the description -- I mistakenly thought that that spark waited 
forever after the failed fetch, but it just retries the tasks / stages a few 
times as normal and then eventually fails the job.)

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.ni

[jira] [Updated] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5928:

Description: 
If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
exception.  The tasks get retried a few times and then eventually the job fails.

Here is an example program which can cause the exception:
{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
rdd.map { x => (1, x)}.groupByKey().count()
{code}


Note that you can't trigger this exception in local mode, it only happens on 
remote fetches.   I triggered these exceptions running with 
{{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}

{noformat}
15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
2147483647: 3021252889 - discarded
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length 
exceeds 2147483647: 3021252889 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

)
{noformat}

or if you use "spark.shuffle.blockTransferService=nio", then you get:

{noformat}
15/02/20 12:48:07 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
imran-2.ent.cloudera.com): FetchFailed(BlockManagerId(2, 
imran-3.ent.cloudera.com, 42827), shuffleId=0, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: sendMessageReliably failed with 
ACK that signalled a remote erro

[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2015-02-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329554#comment-14329554
 ] 

Imran Rashid commented on SPARK-1476:
-

I spent a little time with [~sandyr] on this today, and I realized that the 
shuffle limit and the cache limit are actually quite distinct.  (Sorry if this 
was already obvious to everyone else.)  I've made another issue SPARK-5928 to 
deal w/ the shuffle issue.  Then I say we make SPARK-1391 focus more on the 
cache limit (and broadcast limit etc.).  I'm going to make this issue require 
both of those.

I'm going to pursue a solution to *only* SPARK-1391 (basically what I outlined 
above), I'll move further discussion of the particular of what I'm doing over 
there.

> 2GB limit in spark for blocks
> -
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: all
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context

2015-02-20 Thread Buck (JIRA)

Buck created SPARK-5929:
---

 Summary: Pyspark: Register a pip requirements file with 
spark_context
 Key: SPARK-5929
 URL: https://issues.apache.org/jira/browse/SPARK-5929
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Buck
Priority: Minor


I've been doing a lot of dependency work with shipping dependencies to workers 
as it is non-trivial for me to have my workers include the proper dependencies 
in their own environments.

To circumvent this, I added a addRequirementsFile() method that takes a pip 
requirements file, downloads the packages, repackages them to be registered 
with addPyFiles and ship them to workers.

Here is a comparison of what I've done on the Palantir fork 

https://github.com/buckheroux/spark/compare/palantir:master...master



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-20 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-5928:
---

 Summary: Remote Shuffle Blocks cannot be more than 2 GB
 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid


If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
exception.  Furthermore, the job doesn't fail -- it simply hangs there, waiting 
for a task to complete that isn't actually making any progress.

Here is an example program which can cause the exception:
{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
rdd.map { x => (1, x)}.groupByKey().count()
{code}


Note that you can't trigger this exception in local mode, it only happens on 
remote fetches.   I triggered these exceptions running with 
{{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}

{noformat}
15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
2147483647: 3021252889 - discarded
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length 
exceeds 2147483647: 3021252889 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

)
{noformat}

or if you use "spark.shuffle.blockTransferService=nio", then you get:

{noformat}
15/02/20 12:48:07 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
imran-2.ent.cloudera.com): F

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-20 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329460#comment-14329460
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

Hi [~zzhan], thanks for uploading the document.

Reading through it, I don't see anything that is really that much different 
from my initial proof-of-concept. The points I'd like to highlight are:

- It still depends on YARN-2423, or at least on some effort to write a REST 
client that does not depend on internal Yarn classes.
- What about overhead of the read code? Large jobs with lots of tasks, or 
really long jobs such as Spark Streaming jobs, will have a really large amount 
of events. Fetching them all in one batch would require a lot of memory for 
serializing the data on both sides (ATS and History Server).
- Any security considerations? I haven't really kept up-to-date with the 
security changes in the ATS after I ran into issues with my p.o.c.; but mainly, 
does the Spark job need any special tokens to talk to the ATS when security is 
enabled? Does the ATS guarantee that only the job itself (or someone with the 
right credentials) can add events to its timeline? Or is that all handled 
transparently, somehow, by the client library?
- Does YARN-2928 affect the design in any way? I took a quick look at the data 
model, so hopefully they'll keep things backwards compatible. But it would 
kinda suck to add support for an API with a limited shelf life if that's not 
the case.


> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5918) Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema

2015-02-20 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329417#comment-14329417
 ] 

Michael Armbrust commented on SPARK-5918:
-

This was a conscious design decision since the optimizations that fixed sized 
strings allow you to do are not very relevant when you aren't managing your own 
memory.  We want to be tolerant of schemas from other systems, but don't 
optimize this ourselves.   We can revisit this if there are use cases that need 
varchar.

> Spark Thrift server reports metadata for VARCHAR column as STRING in result 
> set schema
> --
>
> Key: SPARK-5918
> URL: https://issues.apache.org/jira/browse/SPARK-5918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Holman Lan
>Assignee: Cheng Lian
>
> This is reproducible using the open source JDBC driver by executing a query 
> that will return a VARCHAR column then retrieving the result set metadata. 
> The type name returned by the JDBC driver is VARCHAR which is expected but 
> reports the column type as string[12] and precision/column length as 
> 2147483647 (which is what the JDBC driver would return for STRING column) 
> even though we created a VARCHAR column with max length of 1000.
> Further investigation indicates the GetResultSetMetadata Thrift client API 
> call returns the incorrect metadata.
> We have confirmed this behaviour in  versions 1.1.1 and 1.2.0. We have not 
> yet tested this against 1.2.1 but will do so and report our findings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5095:
-
Assignee: Timothy Chen

> Support launching multiple mesos executors in coarse grained mesos mode
> ---
>
> Key: SPARK-5095
> URL: https://issues.apache.org/jira/browse/SPARK-5095
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>
> Currently in coarse grained mesos mode, it's expected that we only launch one 
> Mesos executor that launches one JVM process to launch multiple spark 
> executors.
> However, this become a problem when the JVM process launched is larger than 
> an ideal size (30gb is recommended value from databricks), which causes GC 
> problems reported on the mailing list.
> We should support launching mulitple executors when large enough resources 
> are available for spark to use, and these resources are still under the 
> configured limit.
> This is also applicable when users want to specifiy number of executors to be 
> launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-02-20 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5095:
-
Affects Version/s: 1.0.0

> Support launching multiple mesos executors in coarse grained mesos mode
> ---
>
> Key: SPARK-5095
> URL: https://issues.apache.org/jira/browse/SPARK-5095
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>
> Currently in coarse grained mesos mode, it's expected that we only launch one 
> Mesos executor that launches one JVM process to launch multiple spark 
> executors.
> However, this become a problem when the JVM process launched is larger than 
> an ideal size (30gb is recommended value from databricks), which causes GC 
> problems reported on the mailing list.
> We should support launching mulitple executors when large enough resources 
> are available for spark to use, and these resources are still under the 
> configured limit.
> This is also applicable when users want to specifiy number of executors to be 
> launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329329#comment-14329329
 ] 

Apache Spark commented on SPARK-5926:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4707

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329269#comment-14329269
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:34 PM:
-

This is because that for DDL like commands with side effects, DataFrame forces 
it to execute right away. However if we just want to know the execution plan, 
we do not need it to execute.
{code:title=DataFrame.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 


was (Author: yanboliang):
This is because that for DDL like commands with side effects, DataFrame forces 
it to execute right away. However if we just want to know the execution plan, 
we do not need it to execute.
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5927) Modify FPGrowth's partition strategy to reduce transactions in partitions

2015-02-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329306#comment-14329306
 ] 

Apache Spark commented on SPARK-5927:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4706

> Modify FPGrowth's partition strategy to reduce transactions in partitions
> -
>
> Key: SPARK-5927
> URL: https://issues.apache.org/jira/browse/SPARK-5927
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5925) YARN - Spark progress bar stucks at 10% but after finishing shows 100%

2015-02-20 Thread Laszlo Fesus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329303#comment-14329303
 ] 

Laszlo Fesus commented on SPARK-5925:
-

Yes, but I thought it would be quite useful if the progress bar could be 
"forwarded" from the more detailed view (spark web interface) to the master 
yarn interface. That could fix the problem. Maybe this feature could be 
implemented also for the _spark.yarn.historyServer.address_ functionality, 
which actually does redirect us to the proper job details on the spark web 
interface. (And would be even better, if we could retrieve this _updated_ 
progress bar also for the yarn restful interface)

> YARN - Spark progress bar stucks at 10% but after finishing shows 100%
> --
>
> Key: SPARK-5925
> URL: https://issues.apache.org/jira/browse/SPARK-5925
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.1
>Reporter: Laszlo Fesus
>Priority: Minor
>
> I did set up a yarn cluster (CDH5) and spark (1.2.1), and also started Spark 
> History Server. Now I am able to click on more details on yarn's web 
> interface and get redirected to the appropriate spark logs during both job 
> execution and also after the job has finished. 
> My only concern is that while a spark job is being executed (either 
> yarn-client or yarn-cluster), the progress bar stucks at 10% and doesn't 
> increase as for MapReduce jobs. After finishing, it shows 100% properly, but 
> we are loosing the real-time tracking capability of the status bar. 
> Also tested yarn restful web interface, and it retrieves again 10% during 
> (yarn) spark job execution, and works well again after finishing. (I suppose 
> for the while being I should have a look on Spark Job Server and see if it's 
> possible to track the job via its restful web interface.)
> Did anyone else experience this behaviour? Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329269#comment-14329269
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:14 PM:
-

This is because that for DDL like commands with side effects, DataFrame forces 
it to execute right away. However if we just want to know the execution plan, 
we do not need it to execute.
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 


was (Author: yanboliang):
This is because that for DDL like queries with side effects, and DataFrame 
force it happen right away. We should use the former queryExecution.logical to 
explain.
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5927) Modify FPGrowth's partition strategy to reduce transactions in partitions

2015-02-20 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-5927:
--

 Summary: Modify FPGrowth's partition strategy to reduce 
transactions in partitions
 Key: SPARK-5927
 URL: https://issues.apache.org/jira/browse/SPARK-5927
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329284#comment-14329284
 ] 

Liang-Chi Hsieh commented on SPARK-5832:


The time complexity O(nnz * K) is just for sparse directed graph and in the 
graph every pair of distinct vertices is only connected by a unique edge. So it 
has constraint. If the directed graph is sparse, but every pair of distinct 
vertices is still connected by a pair of unique edges, the time complexity is 
O(nnz^2 * K).


> Add Affinity Propagation clustering algorithm
> -
>
> Key: SPARK-5832
> URL: https://issues.apache.org/jira/browse/SPARK-5832
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329269#comment-14329269
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:06 PM:
-

This is because that for DDL like queries with side effects, and DataFrame 
force it happen right away. We should use the former queryExecution.logical to 
explain.
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 


was (Author: yanboliang):
This is because that in DataFrameImpl 
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329269#comment-14329269
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:03 PM:
-

This is because that in DataFrameImpl 
{code:title=DataFrameImpl.scala|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 


was (Author: yanboliang):
This is because that in DataFrameImpl 
{code:title=Bar.java|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329251#comment-14329251
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:01 PM:
-

The following is the output of each query
#1:
{panel}
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

#2:
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)
{panel}


was (Author: yanboliang):
The following is the output of each query
#1:
{code|borderStyle=solid} 
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

#2:
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)
{code}

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329251#comment-14329251
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 5:59 PM:
-

The following is the output of each query
#1:
{code|borderStyle=solid} 
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

#2:
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)
{code}


was (Author: yanboliang):
The following is the output of each query
#1:
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

#2:
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)


> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329251#comment-14329251
 ] 

Yanbo Liang edited comment on SPARK-5926 at 2/20/15 6:01 PM:
-

The following is the output of each query
#1:
{panel}
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65
{panel}
#2:
{panel}
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)
{panel}


was (Author: yanboliang):
The following is the output of each query
#1:
{panel}
== Parsed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Analyzed Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Optimized Logical Plan ==
LogicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

== Physical Plan ==
PhysicalRDD [], ParallelCollectionRDD[7] at parallelize at commands.scala:65

#2:
== Parsed Logical Plan ==
'CreateTableAsSelect None, tb, false, Some(TOK_CREATETABLE)
 'Project [*]
  'Filter ('key > 490)
   'UnresolvedRelation [src], None

== Analyzed Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Optimized Logical Plan ==
CreateTableAsSelect [Database:default, TableName: tb, InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None


== Physical Plan ==
ExecutedCommand (CreateTableAsSelect [Database:default, TableName: tb, 
InsertIntoHiveTable]
Project [key#43,value#44]
 Filter (key#43 > 490)
  MetastoreRelation default, src, None
)
{panel}

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command

2015-02-20 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329269#comment-14329269
 ] 

Yanbo Liang commented on SPARK-5926:


This is because that in DataFrameImpl 
{code:title=Bar.java|borderStyle=solid} 
@transient protected[sql] override lazy val logicalPlan: LogicalPlan = 
queryExecution.logical match {
// For various commands (like DDL) and queries with side effects, we force 
query optimization to
// happen right away to let these side effects take place eagerly.
case _: Command |
 _: InsertIntoTable |
 _: CreateTableAsSelect[_] |
 _: CreateTableUsingAsSelect |
 _: WriteToFile =>
  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
case _ =>
  queryExecution.logical
  }
{code} 

> [SQL] DataFrame.explain() return false result for DDL command
> -
>
> Key: SPARK-5926
> URL: https://issues.apache.org/jira/browse/SPARK-5926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, the following two queries should print out the 
> same explain result, but it's not.
> sql("create table tb as select * from src where key > 490").explain(true)
> sql("explain extended create table tb as select * from src where key > 490")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5832:
-
Target Version/s: 1.4.0

> Add Affinity Propagation clustering algorithm
> -
>
> Key: SPARK-5832
> URL: https://issues.apache.org/jira/browse/SPARK-5832
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo