[jira] [Comment Edited] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2014-05-27 Thread npanj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010846#comment-14010846
 ] 

npanj edited comment on SPARK-1153 at 5/28/14 6:48 AM:
---

An alternative approach, that I have been using: 
1 Use a preprocessing step that maps UUID to an Long.
2. Build graph based on Longs

For Mapping in step 1:
- Rank your uuids.
- some kind of has function?

For 1, graphx can provide a tool to generate map.

I will like to hear how others are building graphs out of non-Long node types.





was (Author: npanj):
An alternative approach, that I have been using: 
1 Use a preprocessing step that maps UUID to an Long.
2. Build graph based on Longs

For Mapping in step 1:
- Rank your uuids.
- some kind of has function?

For 1, graphx can provide a tool to generate map.

I will like to hear how others are building graphs out of non-Long node types




> Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
> --
>
> Key: SPARK-1153
> URL: https://issues.apache.org/jira/browse/SPARK-1153
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 0.9.0
>Reporter: Deepak Nulu
>
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys. Others might have a different 
> type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
> will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1552) GraphX performs type comparison incorrectly

2014-05-27 Thread npanj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010851#comment-14010851
 ] 

npanj commented on SPARK-1552:
--

Does it make sense to get rid of this optimization until richer TypeTags are 
available? 

> GraphX performs type comparison incorrectly
> ---
>
> Key: SPARK-1552
> URL: https://issues.apache.org/jira/browse/SPARK-1552
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>
> In GraphImpl, mapVertices and outerJoinVertices use a more efficient 
> implementation when the map function preserves vertex attribute types. This 
> is implemented by comparing the ClassTags of the old and new vertex attribute 
> types. However, ClassTags store _erased_ types, so the comparison will return 
> a false positive for types with different type parameters, such as 
> Option[Int] and Option[Double].
> Thanks to Pierre-Alexandre Fonta for reporting this bug on the [mailing 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Cast-error-when-comparing-a-vertex-attribute-after-its-type-has-changed-td4119.html].
> Demo in the Scala shell:
> scala> import scala.reflect.{classTag, ClassTag}
> scala> def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = 
> classTag[A] equals classTag[B]
> scala> typesEqual(Some(1), Some(2.0)) // should return false
> res2: Boolean = true
> We can require richer TypeTags for these methods, or just take a flag from 
> the caller specifying whether the types are equal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2014-05-27 Thread npanj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010846#comment-14010846
 ] 

npanj commented on SPARK-1153:
--

An alternative approach, that I have been using: 
1 Use a preprocessing step that maps UUID to an Long.
2. Build graph based on Longs

For Mapping in step 1:
- Rank your uuids.
- some kind of has function?

For 1, graphx can provide a tool to generate map.

I will like to hear how others are building graphs out of non-Long node types




> Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
> --
>
> Key: SPARK-1153
> URL: https://issues.apache.org/jira/browse/SPARK-1153
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 0.9.0
>Reporter: Deepak Nulu
>
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys. Others might have a different 
> type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
> will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1945) Add full Java examples in MLlib docs

2014-05-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1945:


 Summary: Add full Java examples in MLlib docs
 Key: SPARK-1945
 URL: https://issues.apache.org/jira/browse/SPARK-1945
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Matei Zaharia


Right now some of the Java tabs only say the following:

"All of MLlib’s methods use Java-friendly types, so you can import and call 
them there the same way you do in Scala. The only caveat is that the methods 
take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. 
You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD 
object."

Would be nice to translate the Scala code into Java instead.

Also, a few pages (most notably the Matrix one) don't have Java examples at all.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1944) Document --verbse in spark-shell -h

2014-05-27 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-1944:
-

 Summary: Document --verbse in spark-shell -h
 Key: SPARK-1944
 URL: https://issues.apache.org/jira/browse/SPARK-1944
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Priority: Minor


The below help for spark-submit should make mention of the {{--verbose}} option

{noformat}
aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h
Usage: spark-submit [options]  [app options]
Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Mode to deploy the app in, either 'client' or 
'cluster'.
  --class CLASS_NAME  Name of your app's main class (required for Java 
apps).
  --arg ARG   Argument to be passed to your application's main 
class. This
  option can be specified multiple times for 
multiple args.
  --name NAME The name of your application (Default: 'Spark').
  --jars JARS A comma-separated list of local jars to include 
on the
  driver classpath and that SparkContext.addJar 
will work
  with. Doesn't work on standalone with 'cluster' 
deploy mode.
  --files FILES   Comma separated list of files to be placed in the 
working dir
  of each executor.
  --properties-file FILE  Path to a file from which to load extra 
properties. If not
  specified, this will look for 
conf/spark-defaults.conf.

  --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
512M).
  --driver-java-options   Extra Java options to pass to the driver
  --driver-library-path   Extra library path entries to pass to the driver
  --driver-class-path Extra class path entries to pass to the driver. 
Note that
  jars added with --jars are automatically included 
in the
  classpath.

  --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
1G).

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM  Cores for driver (Default: 1).
  --supervise If given, restarts the driver on failure.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 YARN-only:
  --executor-cores NUMNumber of cores per executor (Default: 1).
  --queue QUEUE_NAME  The YARN queue to submit to (Default: 'default').
  --num-executors NUM Number of executors to (Default: 2).
  --archives ARCHIVES Comma separated list of archives to be extracted 
into the
  working dir of each executor.
aash@aash-mbp ~/git/spark$
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1944) Document --verbose in spark-shell -h

2014-05-27 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-1944:
--

Summary: Document --verbose in spark-shell -h  (was: Document --verbse in 
spark-shell -h)

> Document --verbose in spark-shell -h
> 
>
> Key: SPARK-1944
> URL: https://issues.apache.org/jira/browse/SPARK-1944
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>Priority: Minor
>
> The below help for spark-submit should make mention of the {{--verbose}} 
> option
> {noformat}
> aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h
> Usage: spark-submit [options]  [app options]
> Options:
>   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.
>   --deploy-mode DEPLOY_MODE   Mode to deploy the app in, either 'client' or 
> 'cluster'.
>   --class CLASS_NAME  Name of your app's main class (required for 
> Java apps).
>   --arg ARG   Argument to be passed to your application's 
> main class. This
>   option can be specified multiple times for 
> multiple args.
>   --name NAME The name of your application (Default: 'Spark').
>   --jars JARS A comma-separated list of local jars to include 
> on the
>   driver classpath and that SparkContext.addJar 
> will work
>   with. Doesn't work on standalone with 'cluster' 
> deploy mode.
>   --files FILES   Comma separated list of files to be placed in 
> the working dir
>   of each executor.
>   --properties-file FILE  Path to a file from which to load extra 
> properties. If not
>   specified, this will look for 
> conf/spark-defaults.conf.
>   --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
> 512M).
>   --driver-java-options   Extra Java options to pass to the driver
>   --driver-library-path   Extra library path entries to pass to the driver
>   --driver-class-path Extra class path entries to pass to the driver. 
> Note that
>   jars added with --jars are automatically 
> included in the
>   classpath.
>   --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
> 1G).
>  Spark standalone with cluster deploy mode only:
>   --driver-cores NUM  Cores for driver (Default: 1).
>   --supervise If given, restarts the driver on failure.
>  Spark standalone and Mesos only:
>   --total-executor-cores NUM  Total cores for all executors.
>  YARN-only:
>   --executor-cores NUMNumber of cores per executor (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> 'default').
>   --num-executors NUM Number of executors to (Default: 2).
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working dir of each executor.
> aash@aash-mbp ~/git/spark$
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1938.


   Resolution: Fixed
Fix Version/s: 1.0.1
   1.1.0
 Assignee: Takuya Ueshin

> ApproxCountDistinctMergeFunction should return Int value.
> -
>
> Key: SPARK-1938
> URL: https://issues.apache.org/jira/browse/SPARK-1938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> {{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the 
> {{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1943) Testing use of target version field

2014-05-27 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1943:
--

 Summary: Testing use of target version field
 Key: SPARK-1943
 URL: https://issues.apache.org/jira/browse/SPARK-1943
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Patrick Wendell






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1943) Testing use of target version field

2014-05-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1943:
---

Target Version/s:   (was: 0.9.2, 1.0.1)

> Testing use of target version field
> ---
>
> Key: SPARK-1943
> URL: https://issues.apache.org/jira/browse/SPARK-1943
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1942) Stop clearing spark.driver.port in unit tests

2014-05-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1942:


 Summary: Stop clearing spark.driver.port in unit tests
 Key: SPARK-1942
 URL: https://issues.apache.org/jira/browse/SPARK-1942
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Matei Zaharia


Since the SparkConf change, this should no longer be needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN

2014-05-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1825:
-

Fix Version/s: 1.1.0

> Windows Spark fails to work with Linux YARN
> ---
>
> Key: SPARK-1825
> URL: https://issues.apache.org/jira/browse/SPARK-1825
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Taeyun Kim
> Fix For: 1.1.0
>
>
> Windows Spark fails to work with Linux YARN.
> This is a cross-platform problem.
> On YARN side, Hadoop 2.4.0 resolved the issue as follows:
> https://issues.apache.org/jira/browse/YARN-1824
> But Spark YARN module does not incorporate the new YARN API yet, so problem 
> persists for Spark.
> First, the following source files should be changed:
> - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
> - 
> /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
> Change is as follows:
> - Replace .$() to .$$()
> - Replace File.pathSeparator for Environment.CLASSPATH.name to 
> ApplicationConstants.CLASS_PATH_SEPARATOR (import 
> org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
> Unless the above are applied, launch_container.sh will contain invalid shell 
> script statements(since they will contain Windows-specific separators), and 
> job will fail.
> Also, the following symptom should also be fixed (I could not find the 
> relevant source code):
> - SPARK_HOME environment variable is copied straight to launch_container.sh. 
> It should be changed to the path format for the server OS, or, the better, a 
> separate environment variable or a configuration variable should be created.
> - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
> the above change is applied. maybe I missed a few lines.
> I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN

2014-05-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1825:
-

Fix Version/s: (was: 1.0.0)

> Windows Spark fails to work with Linux YARN
> ---
>
> Key: SPARK-1825
> URL: https://issues.apache.org/jira/browse/SPARK-1825
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Taeyun Kim
>
> Windows Spark fails to work with Linux YARN.
> This is a cross-platform problem.
> On YARN side, Hadoop 2.4.0 resolved the issue as follows:
> https://issues.apache.org/jira/browse/YARN-1824
> But Spark YARN module does not incorporate the new YARN API yet, so problem 
> persists for Spark.
> First, the following source files should be changed:
> - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
> - 
> /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
> Change is as follows:
> - Replace .$() to .$$()
> - Replace File.pathSeparator for Environment.CLASSPATH.name to 
> ApplicationConstants.CLASS_PATH_SEPARATOR (import 
> org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
> Unless the above are applied, launch_container.sh will contain invalid shell 
> script statements(since they will contain Windows-specific separators), and 
> job will fail.
> Also, the following symptom should also be fixed (I could not find the 
> relevant source code):
> - SPARK_HOME environment variable is copied straight to launch_container.sh. 
> It should be changed to the path format for the server OS, or, the better, a 
> separate environment variable or a configuration variable should be created.
> - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
> the above change is applied. maybe I missed a few lines.
> I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010578#comment-14010578
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

Sounds good.  https://github.com/apache/spark/pull/898

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1941) Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog

2014-05-27 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-1941:
--

 Summary: Update streamlib to 2.7.0 and use HyperLogLogPlus instead 
of HyperLogLog
 Key: SPARK-1941
 URL: https://issues.apache.org/jira/browse/SPARK-1941
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

2014-05-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010521#comment-14010521
 ] 

Cheng Lian commented on SPARK-1915:
---

Added bug reproduction steps.

> AverageFunction should not count if the evaluated value is null.
> 
>
> Key: SPARK-1915
> URL: https://issues.apache.org/jira/browse/SPARK-1915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or 
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the 
> evaluated value is null.
> To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
> {code}
> scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
> ...
> [237.06]
> scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM 
> src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
> ...
> [142.24,15]
> {code}
> In the first query, {{AVG}} is broke into partial aggregation, and gives the 
> right answer (null values ignored). In the second query, since 
> {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} 
> isn't either, and the bug is triggered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

2014-05-27 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1915:
--

Description: 
Average values are difference between the calculation is done partially or not 
partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the 
evaluated value is null.

To reproduce this bug, run the following in {{sbt/sbt hive/console}}:

{code}
scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), 
which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
...
[237.06]

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM 
src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), 
which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]
{code}

In the first query, {{AVG}} is broke into partial aggregation, and gives the 
right answer (null values ignored). In the second query, since {{COUNT(DISTINCT 
key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the 
bug is triggered.

  was:
Average values are difference between the calculation is done partially or not 
partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the 
evaluated value is null.


> AverageFunction should not count if the evaluated value is null.
> 
>
> Key: SPARK-1915
> URL: https://issues.apache.org/jira/browse/SPARK-1915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or 
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the 
> evaluated value is null.
> To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
> {code}
> scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
> ...
> [237.06]
> scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM 
> src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
> ...
> [142.24,15]
> {code}
> In the first query, {{AVG}} is broke into partial aggregation, and gives the 
> right answer (null va

[jira] [Commented] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages

2014-05-27 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010496#comment-14010496
 ] 

Matei Zaharia commented on SPARK-1566:
--

https://github.com/apache/spark/pull/896

> Consolidate the Spark Programming Guide with tabs for all languages
> ---
>
> Key: SPARK-1566
> URL: https://issues.apache.org/jira/browse/SPARK-1566
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.0.0
>
>
> Right now it's Scala-only and the other ones say "look at the Scala 
> programming guide first".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1922.


Resolution: Fixed

> hql query throws "RuntimeException: Unsupported dataType" if struct field of 
> a table has a column with underscore in name
> -
>
> Key: SPARK-1922
> URL: https://issues.apache.org/jira/browse/SPARK-1922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: llai
>Assignee: llai
> Fix For: 1.1.0, 1.0.1
>
>
> If table A has a struct field A_strct , when doing an hql 
> query like "select", "RuntimeException: Unsupported dataType" is thrown.
> Running a query wiht "sbt/sbt hive/console":
> {code}
> scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 
> limit 10").collect().foreach(x => println(x(1)))
> {code}
> Console output:
> {code}
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, 
> pkg FROM pkg_table where year=2014 and month=1 limit 10
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with 
> URI thrift://x
> 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next 
> connection attempt.
> 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore.
> java.lang.RuntimeException: Unsupported dataType: 
> struct
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1922:
---

Assignee: Reynold Xin

> hql query throws "RuntimeException: Unsupported dataType" if struct field of 
> a table has a column with underscore in name
> -
>
> Key: SPARK-1922
> URL: https://issues.apache.org/jira/browse/SPARK-1922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: llai
>Assignee: Reynold Xin
> Fix For: 1.1.0, 1.0.1
>
>
> If table A has a struct field A_strct , when doing an hql 
> query like "select", "RuntimeException: Unsupported dataType" is thrown.
> Running a query wiht "sbt/sbt hive/console":
> {code}
> scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 
> limit 10").collect().foreach(x => println(x(1)))
> {code}
> Console output:
> {code}
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, 
> pkg FROM pkg_table where year=2014 and month=1 limit 10
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with 
> URI thrift://x
> 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next 
> connection attempt.
> 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore.
> java.lang.RuntimeException: Unsupported dataType: 
> struct
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1922:
---

Assignee: llai  (was: Reynold Xin)

> hql query throws "RuntimeException: Unsupported dataType" if struct field of 
> a table has a column with underscore in name
> -
>
> Key: SPARK-1922
> URL: https://issues.apache.org/jira/browse/SPARK-1922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: llai
>Assignee: llai
> Fix For: 1.1.0, 1.0.1
>
>
> If table A has a struct field A_strct , when doing an hql 
> query like "select", "RuntimeException: Unsupported dataType" is thrown.
> Running a query wiht "sbt/sbt hive/console":
> {code}
> scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 
> limit 10").collect().foreach(x => println(x(1)))
> {code}
> Console output:
> {code}
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, 
> pkg FROM pkg_table where year=2014 and month=1 limit 10
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with 
> URI thrift://x
> 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next 
> connection attempt.
> 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore.
> java.lang.RuntimeException: Unsupported dataType: 
> struct
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1922:
---

Fix Version/s: 1.0.1
   1.1.0

> hql query throws "RuntimeException: Unsupported dataType" if struct field of 
> a table has a column with underscore in name
> -
>
> Key: SPARK-1922
> URL: https://issues.apache.org/jira/browse/SPARK-1922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: llai
> Fix For: 1.1.0, 1.0.1
>
>
> If table A has a struct field A_strct , when doing an hql 
> query like "select", "RuntimeException: Unsupported dataType" is thrown.
> Running a query wiht "sbt/sbt hive/console":
> {code}
> scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 
> limit 10").collect().foreach(x => println(x(1)))
> {code}
> Console output:
> {code}
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, 
> pkg FROM pkg_table where year=2014 and month=1 limit 10
> 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with 
> URI thrift://x
> 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next 
> connection attempt.
> 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore.
> java.lang.RuntimeException: Unsupported dataType: 
> struct
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1939) Improve takeSample method in RDD

2014-05-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1939:
-

Assignee: Doris Xin

> Improve takeSample method in RDD
> 
>
> Key: SPARK-1939
> URL: https://issues.apache.org/jira/browse/SPARK-1939
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Doris Xin
>Assignee: Doris Xin
>  Labels: newbie
>
> reimplement takeSample with the ScaSRS algorithm



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)

2014-05-27 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010413#comment-14010413
 ] 

Tathagata Das commented on SPARK-1940:
--

This is solved in https://github.com/apache/spark/pull/895

This PR solves this by implementing a simple RollingFileAppender within Spark 
(disabled by default). When enabled (using configuration parameter 
spark.executor.rollingLogs.enabled), the logs can get rolled over by time 
interval (set with spark.executor.rollingLogs.cleanupTtl, set to daily by 
default). Old logs (older than 7 days, by default) will get deleted 
automatically. The web UI can show the logs across the rolled over files.



> Enable rolling of executor logs (stdout / stderr)
> -
>
> Key: SPARK-1940
> URL: https://issues.apache.org/jira/browse/SPARK-1940
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tathagata Das
>
> Currently, in the default log4j configuration, all the executor logs get sent 
> to the file [executor-working-dir]/stderr. This does not all log 
> files to be rolled, so old logs cannot be removed. 
> Using log4j RollingFileAppender allows log4j logs to be rolled, but all the 
> logs get sent to a different set of files, other than the files 
> stdout and stderr . So the logs are not visible in 
> the Spark web UI any more as Spark web UI only reads the files 
> stdout and stderr. Furthermore, it still does not 
> allow the stdout and stderr to be cleared periodically in case a large amount 
> of stuff gets written to them (e.g. by explicit println inside map function).
> Solving this requires rolling of the logs in such a way that Spark web UI is 
> aware of it and can retrieve the logs across the rolled-over files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)

2014-05-27 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-1940:


Assignee: Tathagata Das

> Enable rolling of executor logs (stdout / stderr)
> -
>
> Key: SPARK-1940
> URL: https://issues.apache.org/jira/browse/SPARK-1940
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, in the default log4j configuration, all the executor logs get sent 
> to the file [executor-working-dir]/stderr. This does not all log 
> files to be rolled, so old logs cannot be removed. 
> Using log4j RollingFileAppender allows log4j logs to be rolled, but all the 
> logs get sent to a different set of files, other than the files 
> stdout and stderr . So the logs are not visible in 
> the Spark web UI any more as Spark web UI only reads the files 
> stdout and stderr. Furthermore, it still does not 
> allow the stdout and stderr to be cleared periodically in case a large amount 
> of stuff gets written to them (e.g. by explicit println inside map function).
> Solving this requires rolling of the logs in such a way that Spark web UI is 
> aware of it and can retrieve the logs across the rolled-over files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010395#comment-14010395
 ] 

Patrick Wendell commented on SPARK-1518:


bq. Thanks, Patrick. This is useful info... I didn't realize there was still 
interest in running Spark against CDH3. Certainly we'll never support it 
directly, since CDH3 was end-of-lifed last year. So we don't really support 
doing anything on CDH3... except upgrading it to CDH4 (or hopefully, 5 which 
has Spark. 

Yeah, the upstream project tries pretty hard to maintain compatibility with as 
many Hadoop versions as possible since we see many people running even fairly 
old ones in the wild. Of course, I'm sure Cloudera will commercially support 
only the most recent ones.

bq. Re: versioning - I was just asking whether this changes the oldest version 
we are compatible with. That's a question we should ask of all Hadoop patches. 
I just tested Spark and it doesn't compile against 0.20.X, so this is a no-op 
in terms of compatibility anyways.

It won't make the 1.0.0 window but we should get it into a 1.0.X release. My 
concern is that once newer Hadoop versions come out, I want people to be able 
to compile Spark against them. Again, since Spark is distributed independently 
from HDFS, this is something that happens a lot, people try to compile older 
Spark releases against newer Hadoop releases.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)

2014-05-27 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-1940:


 Summary: Enable rolling of executor logs (stdout / stderr)
 Key: SPARK-1940
 URL: https://issues.apache.org/jira/browse/SPARK-1940
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Tathagata Das


Currently, in the default log4j configuration, all the executor logs get sent 
to the file [executor-working-dir]/stderr. This does not all log 
files to be rolled, so old logs cannot be removed. 

Using log4j RollingFileAppender allows log4j logs to be rolled, but all the 
logs get sent to a different set of files, other than the files 
stdout and stderr . So the logs are not visible in 
the Spark web UI any more as Spark web UI only reads the files 
stdout and stderr. Furthermore, it still does not 
allow the stdout and stderr to be cleared periodically in case a large amount 
of stuff gets written to them (e.g. by explicit println inside map function).

Solving this requires rolling of the logs in such a way that Spark web UI is 
aware of it and can retrieve the logs across the rolled-over files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1914:
---

Fix Version/s: 1.1.0

> Simplify CountFunction not to traverse to evaluate all child expressions.
> -
>
> Key: SPARK-1914
> URL: https://issues.apache.org/jira/browse/SPARK-1914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> {{CountFunction}} should count up only if the child's evaluated value is not 
> null.
> Because it traverses to evaluate all child expressions, even if the child is 
> null, it counts up if one of the all children is not null.
> To reproduce this bug in {{sbt hive/console}}:
> {code}
> scala> hql("SELECT COUNT(*) FROM src1").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([25])
> scala> hql("SELECT COUNT(*) FROM src1 WHERE key IS NULL").collect()
> res2: Array[org.apache.spark.sql.Row] = Array([10])
> scala> hql("SELECT COUNT(key + 1) FROM src1").collect()
> res3: Array[org.apache.spark.sql.Row] = Array([25])
> {code}
> {{res3}} should be 15 since there are 10 null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1914:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

> Simplify CountFunction not to traverse to evaluate all child expressions.
> -
>
> Key: SPARK-1914
> URL: https://issues.apache.org/jira/browse/SPARK-1914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.0.1
>
>
> {{CountFunction}} should count up only if the child's evaluated value is not 
> null.
> Because it traverses to evaluate all child expressions, even if the child is 
> null, it counts up if one of the all children is not null.
> To reproduce this bug in {{sbt hive/console}}:
> {code}
> scala> hql("SELECT COUNT(*) FROM src1").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([25])
> scala> hql("SELECT COUNT(*) FROM src1 WHERE key IS NULL").collect()
> res2: Array[org.apache.spark.sql.Row] = Array([10])
> scala> hql("SELECT COUNT(key + 1) FROM src1").collect()
> res3: Array[org.apache.spark.sql.Row] = Array([25])
> {code}
> {{res3}} should be 15 since there are 10 null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1939) Improve takeSample method in RDD

2014-05-27 Thread Doris Xin (JIRA)
Doris Xin created SPARK-1939:


 Summary: Improve takeSample method in RDD
 Key: SPARK-1939
 URL: https://issues.apache.org/jira/browse/SPARK-1939
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin


reimplement takeSample with the ScaSRS algorithm



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010364#comment-14010364
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

bq. It would be good to list the oldest upstream Hadoop version we support. 
Many people still run Spark against Hadoop 1.X/CDH3 variants. We get high 
download rates for those pre-built packages. I think this is a little different 
than e.g. CDH where people upgrade all components at the same time... people 
download newer versions of Spark and run it with old filesystems very often.

Thanks, Patrick.  This is useful info... I didn't realize there was still 
interest in running Spark against CDH3.  Certainly we'll never support it 
directly, since CDH3 was end-of-lifed last year.  So we don't really support 
doing anything on CDH3... except upgrading it to CDH4 (or hopefully, 5 which 
has Spark. :)

bq. Re: versioning - I was just asking whether this changes the oldest version 
we are compatible with. That's a question we should ask of all Hadoop patches. 
I just tested Spark and it doesn't compile against 0.20.X, so this is a no-op 
in terms of compatibility anyways.

It sounds like we're good to go on replacing the {{flush}} with {{hsync}} then? 
 I notice you marked this as "critical" recently; do you think it's important 
to 1.0?

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1926) Nullability of Max/Min/First should be true.

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1926.


   Resolution: Fixed
Fix Version/s: 1.0.1
   1.1.0
 Assignee: Takuya Ueshin

> Nullability of Max/Min/First should be true.
> 
>
> Key: SPARK-1926
> URL: https://issues.apache.org/jira/browse/SPARK-1926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> Nullability of {{Max}}/{{Min}}/{{First}} should be {{true}} because they 
> return {{null}} if there are no rows.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1915.


   Resolution: Fixed
Fix Version/s: 1.0.1
   1.1.0
 Assignee: Takuya Ueshin

> AverageFunction should not count if the evaluated value is null.
> 
>
> Key: SPARK-1915
> URL: https://issues.apache.org/jira/browse/SPARK-1915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or 
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the 
> evaluated value is null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190
 ] 

Aaron Davidson edited comment on SPARK-983 at 5/27/14 9:54 PM:
---

Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

Edit: Figured it out, thanks [~pwendell]!


was (Author: ilikerps):
Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-983:
-

Assignee: Madhu Siddalingaiah

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340
 ] 

Ankur Dave edited comment on SPARK-1931 at 5/27/14 9:50 PM:


Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the 
edges before constructing the graph, as follows:

{code}
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._

def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: 
PartitionStrategy): RDD[Edge[ED]] = {
  val numPartitions = edges.partitions.size
  edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, 
numPartitions), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}

val g = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
assert(g.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))

val gPart = Graph(
  g.vertices,
  partitionBy(g.edges, PartitionStrategy.EdgePartition2D))
assert(gPart.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}


was (Author: ankurd):
Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the 
edges before constructing the graph, as follows:

{code}
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._

def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: 
PartitionStrategy): RDD[Edge[ED]] = {
  val numPartitions = edges.partitions.size
  edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, 
numPartitions), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}

val g = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
assert(g.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))

val gPart = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), 
PartitionStrategy.EdgePartition2D))
assert(gPart.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.1
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   import org.apache.spark.graphx._
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340
 ] 

Ankur Dave edited comment on SPARK-1931 at 5/27/14 9:47 PM:


Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the 
edges before constructing the graph, as follows:

{code}
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._

def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: 
PartitionStrategy): RDD[Edge[ED]] = {
  val numPartitions = edges.partitions.size
  edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, 
numPartitions), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}

val g = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
assert(g.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))

val gPart = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), 
PartitionStrategy.EdgePartition2D))
assert(gPart.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}


was (Author: ankurd):
Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the 
edges before constructing the graph as follows:

{code}
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._

def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: 
PartitionStrategy): RDD[Edge[ED]] = {
  val numPartitions = edges.partitions.size
  edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, 
numPartitions), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}

val g = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
assert(g.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))

val gPart = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), 
PartitionStrategy.EdgePartition2D))
assert(gPart.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.1
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   import org.apache.spark.graphx._
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010339#comment-14010339
 ] 

Patrick Wendell commented on SPARK-1518:


Re: versioning - I was just asking whether this changes the oldest version we 
are compatible with. That's a question we should ask of all Hadoop patches. I 
just tested Spark and it doesn't compile against 0.20.X, so this is a no-op in 
terms of compatibility anyways.

It would be good to list the oldest upstream Hadoop version we support. Many 
people still run Spark against Hadoop 1.X/CDH3 variants. We get high download 
rates for those pre-built packages. I think this is a little different than 
e.g. CDH where people upgrade all components at the same time... people 
download newer versions of Spark and run it with old filesystems very often.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340
 ] 

Ankur Dave commented on SPARK-1931:
---

Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the 
edges before constructing the graph as follows:

{code}
import org.apache.spark.HashPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._

def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: 
PartitionStrategy): RDD[Edge[ED]] = {
  val numPartitions = edges.partitions.size
  edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, 
numPartitions), e))
.partitionBy(new HashPartitioner(numPartitions))
.mapPartitions(_.map(_._2), preservesPartitioning = true)
}

val g = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
assert(g.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))

val gPart = Graph(
  sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
  partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), 
PartitionStrategy.EdgePartition2D))
assert(gPart.triplets.collect.map(_.toTuple).toSet ==
  Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.1
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   import org.apache.spark.graphx._
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010322#comment-14010322
 ] 

Sean Owen commented on SPARK-1518:
--

RE: Hadoop versions, in my reckoning of the twisted world of Hadoop versions, 
the 0.23.x branch is still active and so is kind of later than 1.0.x. It may be 
easier to retain 0.23 compatibility than 1.0.x for example.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010305#comment-14010305
 ] 

Marcelo Vanzin commented on SPARK-1518:
---

Hmm, may I suggest a different approach?

Andrew, who wrote the code, might have more info. But from my understanding, 
the flushes were needed because the history server might read logs from 
applications that were not yet finished. So the flush was a best-effort to 
avoid having the HS read files that contained partial JSON objects (and fail to 
parse them).

But since then the HS was changed to only read logs from finished applications. 
I think it's safe to assume that finished applications are not writing to the 
event log anymore, so the above scenario doesn't exist.

So could we just get rid of the explicit flush instead?

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010296#comment-14010296
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

I wonder if we should have a discussion on the mailing list about the oldest 
version of Hadoop we should support.  I would argue that it should be 0.23.  
Yahoo! is still using that version.  Perhaps other people have more information 
than I do, though.

If we decide to support 0.20, I will create a patch that does this using 
reflection.  But I'd rather get your guys' opinion on whether that make sense 
first.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010294#comment-14010294
 ] 

Sean Owen commented on SPARK-1518:
--

0.20.x stopped in early 2010. It is ancient.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010284#comment-14010284
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

I think it's very, very, very unlikely that anyone will want to run Hadoop 0.20 
against Spark.  Why don't we:
* fix the compile against Hadoop trunk
* wait for someone to show up who wants compatibility with hadoop 0.20 before 
we work on it?

It seems like if there is interest in Spark on Hadoop 0.20, there will quickly 
be a patch submitted to get it compiling there.  If there is no such interest, 
then we'll be done here without doing a lot of work up front.

If you agree then I'll create a pull req.  I have verified that fixing the 
flush thing un-breaks the compile on master.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1931:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.1
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   import org.apache.spark.graphx._
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1518:
---

Priority: Critical  (was: Major)

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>Priority: Critical
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1930) The Container is running beyond physical memory limits, so as to be killed.

2014-05-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010258#comment-14010258
 ] 

Patrick Wendell commented on SPARK-1930:


PySpark might also be an issue with this, because it launches Python VM's that 
consume memory.

> The Container is running beyond physical memory limits, so as to be killed.
> ---
>
> Key: SPARK-1930
> URL: https://issues.apache.org/jira/browse/SPARK-1930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Guoqiang Li
> Fix For: 1.0.1
>
>
> When the containers occupies 8G memory ,the containers were killed
> yarn node manager log:
> {code}
> 2014-05-23 13:35:30,776 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
> running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB 
> physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1400809535638_0015_01_05 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
> %p' -Xms8192m -Xmx8192m  -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>   -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" 
> -Dspark.akka.frameSize="20" 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 1> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
>  2> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
>  
> |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill 
> %p -Xms8192m -Xmx8192m -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>  -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
> -Dspark.akka.frameSize=20 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Removed ProcessTree with root 4947
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from RUNNING 
> to KILLING
> 2014-05-23 13:35:30,777 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,788 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1400809535638_0015_01_05 is : 143
> 2014-05-23 13:35:30,829 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from KILLING 
> to CONTAINER_CLEANEDUP_AFTER_KILL
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1400809535638_0015
> CONTAINERID=container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1400809535638_0015_01_05 from application 
> application_1400809535638_0015
> {code}
> I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
> https://github.com/apache/spark/blob/master/yarn

[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency

2014-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010241#comment-14010241
 ] 

Yin Huai commented on SPARK-1935:
-

I have updated my PR to use codec 1.5.

> Explicitly add commons-codec 1.5 as a dependency
> 
>
> Key: SPARK-1935
> URL: https://issues.apache.org/jira/browse/SPARK-1935
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> Right now, commons-codec is a transitive dependency. When Spark is built by 
> maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
> older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
> problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency

2014-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-1935:


Summary: Explicitly add commons-codec 1.5 as a dependency  (was: Explicitly 
add commons-codec 1.4 as a dependency)

> Explicitly add commons-codec 1.5 as a dependency
> 
>
> Key: SPARK-1935
> URL: https://issues.apache.org/jira/browse/SPARK-1935
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> Right now, commons-codec is a transitive dependency. When Spark is built by 
> maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
> older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
> problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1931:
--

Description: 
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  import org.apache.spark.graphx._
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

  was:
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}


> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.0
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   import org.apache.spark.graphx._
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1931:
--

Description: 
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ==
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

  was:
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}


> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.0
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ==
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

2014-05-27 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1931:
--

Description: 
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}

  was:
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
  val gPart = g.partitionBy(EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
{code}


> Graph.partitionBy does not reconstruct routing tables
> -
>
> Key: SPARK-1931
> URL: https://issues.apache.org/jira/browse/SPARK-1931
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.0.0
>
>
> Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
> partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
> without updating the routing tables to reflect the new edge layout. This 
> causes the following test to fail:
> {code}
>   val g = Graph(
> sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))),
> sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
>   assert(g.triplets.collect.map(_.toTuple).toSet ===
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
>   val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D)
>   assert(gPart.triplets.collect.map(_.toTuple).toSet ===
> Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190
 ] 

Aaron Davidson commented on SPARK-983:
--

Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Madhu Siddalingaiah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010182#comment-14010182
 ] 

Madhu Siddalingaiah commented on SPARK-983:
---

Understood. Here are my thoughts: I can implement this feature using 
SizeEstimator, even though it's not ideal. Once there is greater consensus on 
how to manage memory more efficiently, I think it should not be hard to adapt 
the code. At the minimum, the spill/merge code will be in place. I'm watching 
[SPARK-1021|https://issues.apache.org/jira/browse/SPARK-1021], so if that's 
done first, I'll merge it in.

Does that sound reasonable? [Aaron 
Davidson|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ilikerps], 
feel free to assign this feature to me.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010117#comment-14010117
 ] 

Patrick Wendell commented on SPARK-1518:


Ah okay. I'm not sure what the oldest version Hadoop that spark Spark compiles 
against pre 0.21, but it's worth knowing whether this change would cause us to 
drop support for some of the older versions.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1103) Garbage collect RDD information inside of Spark

2014-05-27 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010099#comment-14010099
 ] 

Andrew Ash commented on SPARK-1103:
---

https://github.com/apache/spark/pull/126

> Garbage collect RDD information inside of Spark
> ---
>
> Key: SPARK-1103
> URL: https://issues.apache.org/jira/browse/SPARK-1103
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When Spark jobs run for a long period of time, state accumulates. This is 
> dealt with now using TTL-based cleaning. Instead we should do proper garbage 
> collection using weak references.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010068#comment-14010068
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

bq. Hey Colin Patrick McCabe, what is the oldest version of hadooop that 
contains hflush? /cc Andrew Or who IIRC looked into this a bunch when writing 
the logger.

The oldest Apache release I know about with hflush is Hadoop 0.21.

> Spark master doesn't compile against hadoop-common trunk
> 
>
> Key: SPARK-1518
> URL: https://issues.apache.org/jira/browse/SPARK-1518
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>Assignee: Colin Patrick McCabe
>
> FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
> FileLogger.scala is calling it.
> I've changed it locally to hsync() so I can compile the code, but haven't 
> checked yet whether those are equivalent. hsync() seems to have been there 
> forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1930) The Container is running beyond physical memory limits, so as to be killed.

2014-05-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1930:
---

Summary: The Container is running beyond physical memory limits, so as to 
be killed.  (was:  Container  is running beyond physical memory limits)

> The Container is running beyond physical memory limits, so as to be killed.
> ---
>
> Key: SPARK-1930
> URL: https://issues.apache.org/jira/browse/SPARK-1930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Guoqiang Li
> Fix For: 1.0.1
>
>
> When the containers occupies 8G memory ,the containers were killed
> yarn node manager log:
> {code}
> 2014-05-23 13:35:30,776 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
> running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB 
> physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1400809535638_0015_01_05 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
> %p' -Xms8192m -Xmx8192m  -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>   -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" 
> -Dspark.akka.frameSize="20" 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 1> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
>  2> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
>  
> |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill 
> %p -Xms8192m -Xmx8192m -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>  -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
> -Dspark.akka.frameSize=20 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Removed ProcessTree with root 4947
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from RUNNING 
> to KILLING
> 2014-05-23 13:35:30,777 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,788 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1400809535638_0015_01_05 is : 143
> 2014-05-23 13:35:30,829 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from KILLING 
> to CONTAINER_CLEANEDUP_AFTER_KILL
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1400809535638_0015
> CONTAINERID=container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1400809535638_0015_01_05 from application 
> application_1400809535638_0015
> {code}
> I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
> https://github.com/apache/spark/blob/master/yarn/stable/sr

[jira] [Updated] (SPARK-1930) Container is running beyond physical memory limits

2014-05-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1930:
---

Fix Version/s: 1.0.1

>  Container  is running beyond physical memory limits
> 
>
> Key: SPARK-1930
> URL: https://issues.apache.org/jira/browse/SPARK-1930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Guoqiang Li
> Fix For: 1.0.1
>
>
> When the containers occupies 8G memory ,the containers were killed
> yarn node manager log:
> {code}
> 2014-05-23 13:35:30,776 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
> running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB 
> physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1400809535638_0015_01_05 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
> %p' -Xms8192m -Xmx8192m  -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>   -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" 
> -Dspark.akka.frameSize="20" 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 1> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
>  2> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
>  
> |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill 
> %p -Xms8192m -Xmx8192m -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>  -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
> -Dspark.akka.frameSize=20 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Removed ProcessTree with root 4947
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from RUNNING 
> to KILLING
> 2014-05-23 13:35:30,777 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,788 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1400809535638_0015_01_05 is : 143
> 2014-05-23 13:35:30,829 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from KILLING 
> to CONTAINER_CLEANEDUP_AFTER_KILL
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1400809535638_0015
> CONTAINERID=container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1400809535638_0015_01_05 from application 
> application_1400809535638_0015
> {code}
> I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
> https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562
> Relative to 8G, 384 MB is too small



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875
 ] 

Mark Hamstra edited comment on SPARK-983 at 5/27/14 4:40 PM:
-

I'm hoping these can be kept orthogonal, but I think that it is worth noting 
the existence of SPARK-1021 and the fact that sortByKey as it currently exists 
breaks Spark's "transformations of RDDs are lazy" contract.  I'm currently 
working on that issue, which is undoubtedly going to require at least some 
merge work to be compatible with the resolution of this issue.


was (Author: markhamstra):
I'm hoping these can be kept orthogonal, but I think that it is worth noting 
the existence of https://issues.apache.org/jira/browse/SPARK-1021 and the fact 
that sortByKey as it currently exists breaks Spark's "transformations of RDDs 
are lazy" contract.  I'm currently working on that issue, which is undoubtedly 
going to require at least some merge work to be compatible with the resolution 
of this issue.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1930) Container is running beyond physical memory limits

2014-05-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1930:
---

Summary:  Container  is running beyond physical memory limits  (was: 
Container memory beyond limit, were killed)

>  Container  is running beyond physical memory limits
> 
>
> Key: SPARK-1930
> URL: https://issues.apache.org/jira/browse/SPARK-1930
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Guoqiang Li
>
> When the containers occupies 8G memory ,the containers were killed
> yarn node manager log:
> {code}
> 2014-05-23 13:35:30,776 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
> running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB 
> physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1400809535638_0015_01_05 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
> %p' -Xms8192m -Xmx8192m  -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>   -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" 
> -Dspark.akka.frameSize="20" 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 1> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
>  2> 
> /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
>  
> |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
> /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill 
> %p -Xms8192m -Xmx8192m -Xss2m 
> -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
>  -Dlog4j.configuration=log4j-spark-container.properties 
> -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
> -Dspark.akka.frameSize=20 
> org.apache.spark.executor.CoarseGrainedExecutorBackend 
> akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
> 10dian72.domain.test 4 
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Removed ProcessTree with root 4947
> 2014-05-23 13:35:30,776 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from RUNNING 
> to KILLING
> 2014-05-23 13:35:30,777 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,788 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1400809535638_0015_01_05 is : 143
> 2014-05-23 13:35:30,829 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from KILLING 
> to CONTAINER_CLEANEDUP_AFTER_KILL
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
> OPERATION=Container Finished - Killed   TARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1400809535638_0015
> CONTAINERID=container_1400809535638_0015_01_05
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1400809535638_0015_01_05 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2014-05-23 13:35:30,830 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1400809535638_0015_01_05 from application 
> application_1400809535638_0015
> {code}
> I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
> https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562
> Relative to 8G, 384 MB is too sma

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875
 ] 

Mark Hamstra commented on SPARK-983:


I'm hoping these can be kept orthogonal, but I think that it is worth noting 
the existence of https://issues.apache.org/jira/browse/SPARK-1021 and the fact 
that sortByKey as it currently exists breaks Spark's "transformations of RDDs 
are lazy" contract.  I'm currently working on that issue, which is undoubtedly 
going to require at least some merge work to be compatible with the resolution 
of this issue.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-05-27 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra reassigned SPARK-1021:
---

Assignee: Mark Hamstra

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Andrew Ash
>Assignee: Mark Hamstra
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency

2014-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009856#comment-14009856
 ] 

Yin Huai commented on SPARK-1935:
-

[~srowen] yes, it is the reason. 

I did a quick search in parquet-mr and seems parquet does not have any import 
related to commons-codec. Also, I checked the release note of codec 1.5. It 
seems 1.5 is fine. So, do we upgrade to 1.5?

> Explicitly add commons-codec 1.4 as a dependency
> 
>
> Key: SPARK-1935
> URL: https://issues.apache.org/jira/browse/SPARK-1935
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> Right now, commons-codec is a transitive dependency. When Spark is built by 
> maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
> older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
> problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.

2014-05-27 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009555#comment-14009555
 ] 

Takuya Ueshin commented on SPARK-1938:
--

PRed: https://github.com/apache/spark/pull/893

> ApproxCountDistinctMergeFunction should return Int value.
> -
>
> Key: SPARK-1938
> URL: https://issues.apache.org/jira/browse/SPARK-1938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the 
> {{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.

2014-05-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1938:


 Summary: ApproxCountDistinctMergeFunction should return Int value.
 Key: SPARK-1938
 URL: https://issues.apache.org/jira/browse/SPARK-1938
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the 
{{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM

2014-05-27 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009532#comment-14009532
 ] 

Zhen Peng commented on SPARK-1928:
--

[~gq] I met this case in our local mode spark streaming application. 
And in the UT, I have added a test case to simulate this.

> DAGScheduler suspended by local task OOM
> 
>
> Key: SPARK-1928
> URL: https://issues.apache.org/jira/browse/SPARK-1928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Zhen Peng
> Fix For: 1.0.0
>
>
> DAGScheduler does not handle local task OOM properly, and will wait for the 
> job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered

2014-05-27 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated SPARK-1937:
--

Attachment: RSBTest.scala

The program that triggers the problem.
With the patch, the whole execution time of the job reduces from nearly 600s to 
around 250s.

> Tasks can be submitted before executors are registered
> --
>
> Key: SPARK-1937
> URL: https://issues.apache.org/jira/browse/SPARK-1937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
> Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala
>
>
> During construction, TaskSetManager will assign tasks to several pending 
> lists according to the tasks’ preferred locations. If the desired location is 
> unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list 
> containing tasks without preferred locations.
> The problem is that tasks may be submitted before the executors get 
> registered with the driver, in which case TaskSetManager will assign all the 
> tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, 
> it will pick one from this list and assign it to arbitrary executor, since 
> TaskSetManager considers the tasks can run equally well on any node.
> This problem deprives benefits of data locality, drags the whole job slow and 
> can cause imbalance between executors.
> I ran into this issue when running a spark program on a 7-node cluster 
> (node6~node12). The program processes 100GB data.
> Since the data is uploaded to HDFS from node6, this node has a complete copy 
> of the data and as a result, node6 finishes tasks much faster, which in turn 
> makes it complete dis-proportionally more tasks than other nodes.
> To solve this issue, I think we shouldn't check availability of 
> executors/hosts when constructing TaskSetManager. If a task prefers a node, 
> we simply add the task to that node’s pending list. When later on the node is 
> added, TaskSetManager can schedule the task according to proper locality 
> level. If unfortunately the preferred node(s) never gets added, 
> TaskSetManager can still schedule the task at locality level “ANY”.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1929) DAGScheduler suspended by local task OOM

2014-05-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li closed SPARK-1929.
--

Resolution: Duplicate

> DAGScheduler suspended by local task OOM
> 
>
> Key: SPARK-1929
> URL: https://issues.apache.org/jira/browse/SPARK-1929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Zhen Peng
> Fix For: 1.0.0
>
>
> DAGScheduler does not handle local task OOM properly, and will wait for the 
> job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1929) DAGScheduler suspended by local task OOM

2014-05-27 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009492#comment-14009492
 ] 

Devaraj K commented on SPARK-1929:
--

Duplocate of SPARK-1928.

> DAGScheduler suspended by local task OOM
> 
>
> Key: SPARK-1929
> URL: https://issues.apache.org/jira/browse/SPARK-1929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Zhen Peng
> Fix For: 1.0.0
>
>
> DAGScheduler does not handle local task OOM properly, and will wait for the 
> job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered

2014-05-27 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated SPARK-1937:
--

Attachment: Before-patch.png

> Tasks can be submitted before executors are registered
> --
>
> Key: SPARK-1937
> URL: https://issues.apache.org/jira/browse/SPARK-1937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
> Attachments: After-patch.PNG, Before-patch.png
>
>
> During construction, TaskSetManager will assign tasks to several pending 
> lists according to the tasks’ preferred locations. If the desired location is 
> unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list 
> containing tasks without preferred locations.
> The problem is that tasks may be submitted before the executors get 
> registered with the driver, in which case TaskSetManager will assign all the 
> tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, 
> it will pick one from this list and assign it to arbitrary executor, since 
> TaskSetManager considers the tasks can run equally well on any node.
> This problem deprives benefits of data locality, drags the whole job slow and 
> can cause imbalance between executors.
> I ran into this issue when running a spark program on a 7-node cluster 
> (node6~node12). The program processes 100GB data.
> Since the data is uploaded to HDFS from node6, this node has a complete copy 
> of the data and as a result, node6 finishes tasks much faster, which in turn 
> makes it complete dis-proportionally more tasks than other nodes.
> To solve this issue, I think we shouldn't check availability of 
> executors/hosts when constructing TaskSetManager. If a task prefers a node, 
> we simply add the task to that node’s pending list. When later on the node is 
> added, TaskSetManager can schedule the task according to proper locality 
> level. If unfortunately the preferred node(s) never gets added, 
> TaskSetManager can still schedule the task at locality level “ANY”.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1937) Tasks can be submitted before executors are registered

2014-05-27 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009490#comment-14009490
 ] 

Rui Li commented on SPARK-1937:
---

Here's a quick fix:
https://github.com/apache/spark/pull/892

> Tasks can be submitted before executors are registered
> --
>
> Key: SPARK-1937
> URL: https://issues.apache.org/jira/browse/SPARK-1937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
> Attachments: After-patch.PNG, Before-patch.png
>
>
> During construction, TaskSetManager will assign tasks to several pending 
> lists according to the tasks’ preferred locations. If the desired location is 
> unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list 
> containing tasks without preferred locations.
> The problem is that tasks may be submitted before the executors get 
> registered with the driver, in which case TaskSetManager will assign all the 
> tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, 
> it will pick one from this list and assign it to arbitrary executor, since 
> TaskSetManager considers the tasks can run equally well on any node.
> This problem deprives benefits of data locality, drags the whole job slow and 
> can cause imbalance between executors.
> I ran into this issue when running a spark program on a 7-node cluster 
> (node6~node12). The program processes 100GB data.
> Since the data is uploaded to HDFS from node6, this node has a complete copy 
> of the data and as a result, node6 finishes tasks much faster, which in turn 
> makes it complete dis-proportionally more tasks than other nodes.
> To solve this issue, I think we shouldn't check availability of 
> executors/hosts when constructing TaskSetManager. If a task prefers a node, 
> we simply add the task to that node’s pending list. When later on the node is 
> added, TaskSetManager can schedule the task according to proper locality 
> level. If unfortunately the preferred node(s) never gets added, 
> TaskSetManager can still schedule the task at locality level “ANY”.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered

2014-05-27 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated SPARK-1937:
--

Attachment: After-patch.PNG

> Tasks can be submitted before executors are registered
> --
>
> Key: SPARK-1937
> URL: https://issues.apache.org/jira/browse/SPARK-1937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
> Attachments: After-patch.PNG, Before-patch.png
>
>
> During construction, TaskSetManager will assign tasks to several pending 
> lists according to the tasks’ preferred locations. If the desired location is 
> unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list 
> containing tasks without preferred locations.
> The problem is that tasks may be submitted before the executors get 
> registered with the driver, in which case TaskSetManager will assign all the 
> tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, 
> it will pick one from this list and assign it to arbitrary executor, since 
> TaskSetManager considers the tasks can run equally well on any node.
> This problem deprives benefits of data locality, drags the whole job slow and 
> can cause imbalance between executors.
> I ran into this issue when running a spark program on a 7-node cluster 
> (node6~node12). The program processes 100GB data.
> Since the data is uploaded to HDFS from node6, this node has a complete copy 
> of the data and as a result, node6 finishes tasks much faster, which in turn 
> makes it complete dis-proportionally more tasks than other nodes.
> To solve this issue, I think we shouldn't check availability of 
> executors/hosts when constructing TaskSetManager. If a task prefers a node, 
> we simply add the task to that node’s pending list. When later on the node is 
> added, TaskSetManager can schedule the task according to proper locality 
> level. If unfortunately the preferred node(s) never gets added, 
> TaskSetManager can still schedule the task at locality level “ANY”.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM

2014-05-27 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009477#comment-14009477
 ] 

Zhen Peng commented on SPARK-1928:
--

https://github.com/apache/spark/pull/883 

> DAGScheduler suspended by local task OOM
> 
>
> Key: SPARK-1928
> URL: https://issues.apache.org/jira/browse/SPARK-1928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Zhen Peng
> Fix For: 1.0.0
>
>
> DAGScheduler does not handle local task OOM properly, and will wait for the 
> job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency

2014-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1935:
---

Assignee: Yin Huai

> Explicitly add commons-codec 1.4 as a dependency
> 
>
> Key: SPARK-1935
> URL: https://issues.apache.org/jira/browse/SPARK-1935
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> Right now, commons-codec is a transitive dependency. When Spark is built by 
> maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
> older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
> problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency

2014-05-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009472#comment-14009472
 ] 

Sean Owen commented on SPARK-1935:
--

Yeah, I think  this is a matter of Maven's nearest-first vs SBT's latest-first 
conflict resolution strategy.
It should be safe to manually manage this to 1.5, I believe.

> Explicitly add commons-codec 1.4 as a dependency
> 
>
> Key: SPARK-1935
> URL: https://issues.apache.org/jira/browse/SPARK-1935
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, commons-codec is a transitive dependency. When Spark is built by 
> maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
> older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
> problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1937) Tasks can be submitted before executors are registered

2014-05-27 Thread Rui Li (JIRA)
Rui Li created SPARK-1937:
-

 Summary: Tasks can be submitted before executors are registered
 Key: SPARK-1937
 URL: https://issues.apache.org/jira/browse/SPARK-1937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li


During construction, TaskSetManager will assign tasks to several pending lists 
according to the tasks’ preferred locations. If the desired location is 
unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list 
containing tasks without preferred locations.
The problem is that tasks may be submitted before the executors get registered 
with the driver, in which case TaskSetManager will assign all the tasks to 
pendingTasksWithNoPrefs. Later when it looks for a task to schedule, it will 
pick one from this list and assign it to arbitrary executor, since 
TaskSetManager considers the tasks can run equally well on any node.
This problem deprives benefits of data locality, drags the whole job slow and 
can cause imbalance between executors.
I ran into this issue when running a spark program on a 7-node cluster 
(node6~node12). The program processes 100GB data.
Since the data is uploaded to HDFS from node6, this node has a complete copy of 
the data and as a result, node6 finishes tasks much faster, which in turn makes 
it complete dis-proportionally more tasks than other nodes.
To solve this issue, I think we shouldn't check availability of executors/hosts 
when constructing TaskSetManager. If a task prefers a node, we simply add the 
task to that node’s pending list. When later on the node is added, 
TaskSetManager can schedule the task according to proper locality level. If 
unfortunately the preferred node(s) never gets added, TaskSetManager can still 
schedule the task at locality level “ANY”.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1936) Add apache header and remove author tags

2014-05-27 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009352#comment-14009352
 ] 

Devaraj K commented on SPARK-1936:
--

Pull Request added : https://github.com/apache/spark/pull/890

> Add apache header and remove author tags
> 
>
> Key: SPARK-1936
> URL: https://issues.apache.org/jira/browse/SPARK-1936
> Project: Spark
>  Issue Type: Bug
>Reporter: Devaraj K
>Priority: Minor
>
> These below files don’t have apache header and contain author tags.
> {code:xml}
> spark\repl\src\main\scala\org\apache\spark\repl\SparkExprTyper.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkILoop.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkILoopInit.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkIMain.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkImports.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineCompletion.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineReader.scala
> spark\repl\src\main\scala\org\apache\spark\repl\SparkMemberHandlers.scala
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)