[jira] [Comment Edited] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
[ https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010846#comment-14010846 ] npanj edited comment on SPARK-1153 at 5/28/14 6:48 AM: --- An alternative approach, that I have been using: 1 Use a preprocessing step that maps UUID to an Long. 2. Build graph based on Longs For Mapping in step 1: - Rank your uuids. - some kind of has function? For 1, graphx can provide a tool to generate map. I will like to hear how others are building graphs out of non-Long node types. was (Author: npanj): An alternative approach, that I have been using: 1 Use a preprocessing step that maps UUID to an Long. 2. Build graph based on Longs For Mapping in step 1: - Rank your uuids. - some kind of has function? For 1, graphx can provide a tool to generate map. I will like to hear how others are building graphs out of non-Long node types > Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs. > -- > > Key: SPARK-1153 > URL: https://issues.apache.org/jira/browse/SPARK-1153 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 0.9.0 >Reporter: Deepak Nulu > > Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be > able to use {{UUID}} as the vertex ID type because the data I want to process > with GraphX uses that type for its primay-keys. Others might have a different > type for their primary-keys. Generalizing {{VertexId}} (with a type class) > will help in such cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1552) GraphX performs type comparison incorrectly
[ https://issues.apache.org/jira/browse/SPARK-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010851#comment-14010851 ] npanj commented on SPARK-1552: -- Does it make sense to get rid of this optimization until richer TypeTags are available? > GraphX performs type comparison incorrectly > --- > > Key: SPARK-1552 > URL: https://issues.apache.org/jira/browse/SPARK-1552 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Ankur Dave > > In GraphImpl, mapVertices and outerJoinVertices use a more efficient > implementation when the map function preserves vertex attribute types. This > is implemented by comparing the ClassTags of the old and new vertex attribute > types. However, ClassTags store _erased_ types, so the comparison will return > a false positive for types with different type parameters, such as > Option[Int] and Option[Double]. > Thanks to Pierre-Alexandre Fonta for reporting this bug on the [mailing > list|http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Cast-error-when-comparing-a-vertex-attribute-after-its-type-has-changed-td4119.html]. > Demo in the Scala shell: > scala> import scala.reflect.{classTag, ClassTag} > scala> def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = > classTag[A] equals classTag[B] > scala> typesEqual(Some(1), Some(2.0)) // should return false > res2: Boolean = true > We can require richer TypeTags for these methods, or just take a flag from > the caller specifying whether the types are equal. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
[ https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010846#comment-14010846 ] npanj commented on SPARK-1153: -- An alternative approach, that I have been using: 1 Use a preprocessing step that maps UUID to an Long. 2. Build graph based on Longs For Mapping in step 1: - Rank your uuids. - some kind of has function? For 1, graphx can provide a tool to generate map. I will like to hear how others are building graphs out of non-Long node types > Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs. > -- > > Key: SPARK-1153 > URL: https://issues.apache.org/jira/browse/SPARK-1153 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 0.9.0 >Reporter: Deepak Nulu > > Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be > able to use {{UUID}} as the vertex ID type because the data I want to process > with GraphX uses that type for its primay-keys. Others might have a different > type for their primary-keys. Generalizing {{VertexId}} (with a type class) > will help in such cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1945) Add full Java examples in MLlib docs
Matei Zaharia created SPARK-1945: Summary: Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Right now some of the Java tabs only say the following: "All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object." Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1944) Document --verbse in spark-shell -h
Andrew Ash created SPARK-1944: - Summary: Document --verbse in spark-shell -h Key: SPARK-1944 URL: https://issues.apache.org/jira/browse/SPARK-1944 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Priority: Minor The below help for spark-submit should make mention of the {{--verbose}} option {noformat} aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h Usage: spark-submit [options] [app options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Mode to deploy the app in, either 'client' or 'cluster'. --class CLASS_NAME Name of your app's main class (required for Java apps). --arg ARG Argument to be passed to your application's main class. This option can be specified multiple times for multiple args. --name NAME The name of your application (Default: 'Spark'). --jars JARS A comma-separated list of local jars to include on the driver classpath and that SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode. --files FILES Comma separated list of files to be placed in the working dir of each executor. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver --driver-library-path Extra library path entries to pass to the driver --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: 'default'). --num-executors NUM Number of executors to (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working dir of each executor. aash@aash-mbp ~/git/spark$ {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1944) Document --verbose in spark-shell -h
[ https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-1944: -- Summary: Document --verbose in spark-shell -h (was: Document --verbse in spark-shell -h) > Document --verbose in spark-shell -h > > > Key: SPARK-1944 > URL: https://issues.apache.org/jira/browse/SPARK-1944 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Ash >Priority: Minor > > The below help for spark-submit should make mention of the {{--verbose}} > option > {noformat} > aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h > Usage: spark-submit [options] [app options] > Options: > --master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local. > --deploy-mode DEPLOY_MODE Mode to deploy the app in, either 'client' or > 'cluster'. > --class CLASS_NAME Name of your app's main class (required for > Java apps). > --arg ARG Argument to be passed to your application's > main class. This > option can be specified multiple times for > multiple args. > --name NAME The name of your application (Default: 'Spark'). > --jars JARS A comma-separated list of local jars to include > on the > driver classpath and that SparkContext.addJar > will work > with. Doesn't work on standalone with 'cluster' > deploy mode. > --files FILES Comma separated list of files to be placed in > the working dir > of each executor. > --properties-file FILE Path to a file from which to load extra > properties. If not > specified, this will look for > conf/spark-defaults.conf. > --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: > 512M). > --driver-java-options Extra Java options to pass to the driver > --driver-library-path Extra library path entries to pass to the driver > --driver-class-path Extra class path entries to pass to the driver. > Note that > jars added with --jars are automatically > included in the > classpath. > --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: > 1G). > Spark standalone with cluster deploy mode only: > --driver-cores NUM Cores for driver (Default: 1). > --supervise If given, restarts the driver on failure. > Spark standalone and Mesos only: > --total-executor-cores NUM Total cores for all executors. > YARN-only: > --executor-cores NUMNumber of cores per executor (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > 'default'). > --num-executors NUM Number of executors to (Default: 2). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working dir of each executor. > aash@aash-mbp ~/git/spark$ > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.
[ https://issues.apache.org/jira/browse/SPARK-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1938. Resolution: Fixed Fix Version/s: 1.0.1 1.1.0 Assignee: Takuya Ueshin > ApproxCountDistinctMergeFunction should return Int value. > - > > Key: SPARK-1938 > URL: https://issues.apache.org/jira/browse/SPARK-1938 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > {{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the > {{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1943) Testing use of target version field
Patrick Wendell created SPARK-1943: -- Summary: Testing use of target version field Key: SPARK-1943 URL: https://issues.apache.org/jira/browse/SPARK-1943 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Patrick Wendell -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1943) Testing use of target version field
[ https://issues.apache.org/jira/browse/SPARK-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1943: --- Target Version/s: (was: 0.9.2, 1.0.1) > Testing use of target version field > --- > > Key: SPARK-1943 > URL: https://issues.apache.org/jira/browse/SPARK-1943 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Patrick Wendell > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1942) Stop clearing spark.driver.port in unit tests
Matei Zaharia created SPARK-1942: Summary: Stop clearing spark.driver.port in unit tests Key: SPARK-1942 URL: https://issues.apache.org/jira/browse/SPARK-1942 Project: Spark Issue Type: Task Components: Spark Core Reporter: Matei Zaharia Since the SparkConf change, this should no longer be needed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN
[ https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1825: - Fix Version/s: 1.1.0 > Windows Spark fails to work with Linux YARN > --- > > Key: SPARK-1825 > URL: https://issues.apache.org/jira/browse/SPARK-1825 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Taeyun Kim > Fix For: 1.1.0 > > > Windows Spark fails to work with Linux YARN. > This is a cross-platform problem. > On YARN side, Hadoop 2.4.0 resolved the issue as follows: > https://issues.apache.org/jira/browse/YARN-1824 > But Spark YARN module does not incorporate the new YARN API yet, so problem > persists for Spark. > First, the following source files should be changed: > - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala > - > /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala > Change is as follows: > - Replace .$() to .$$() > - Replace File.pathSeparator for Environment.CLASSPATH.name to > ApplicationConstants.CLASS_PATH_SEPARATOR (import > org.apache.hadoop.yarn.api.ApplicationConstants is required for this) > Unless the above are applied, launch_container.sh will contain invalid shell > script statements(since they will contain Windows-specific separators), and > job will fail. > Also, the following symptom should also be fixed (I could not find the > relevant source code): > - SPARK_HOME environment variable is copied straight to launch_container.sh. > It should be changed to the path format for the server OS, or, the better, a > separate environment variable or a configuration variable should be created. > - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after > the above change is applied. maybe I missed a few lines. > I'm not sure whether this is all, since I'm new to both Spark and YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN
[ https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1825: - Fix Version/s: (was: 1.0.0) > Windows Spark fails to work with Linux YARN > --- > > Key: SPARK-1825 > URL: https://issues.apache.org/jira/browse/SPARK-1825 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Taeyun Kim > > Windows Spark fails to work with Linux YARN. > This is a cross-platform problem. > On YARN side, Hadoop 2.4.0 resolved the issue as follows: > https://issues.apache.org/jira/browse/YARN-1824 > But Spark YARN module does not incorporate the new YARN API yet, so problem > persists for Spark. > First, the following source files should be changed: > - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala > - > /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala > Change is as follows: > - Replace .$() to .$$() > - Replace File.pathSeparator for Environment.CLASSPATH.name to > ApplicationConstants.CLASS_PATH_SEPARATOR (import > org.apache.hadoop.yarn.api.ApplicationConstants is required for this) > Unless the above are applied, launch_container.sh will contain invalid shell > script statements(since they will contain Windows-specific separators), and > job will fail. > Also, the following symptom should also be fixed (I could not find the > relevant source code): > - SPARK_HOME environment variable is copied straight to launch_container.sh. > It should be changed to the path format for the server OS, or, the better, a > separate environment variable or a configuration variable should be created. > - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after > the above change is applied. maybe I missed a few lines. > I'm not sure whether this is all, since I'm new to both Spark and YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010578#comment-14010578 ] Colin Patrick McCabe commented on SPARK-1518: - Sounds good. https://github.com/apache/spark/pull/898 > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1941) Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog
Reynold Xin created SPARK-1941: -- Summary: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog Key: SPARK-1941 URL: https://issues.apache.org/jira/browse/SPARK-1941 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1915) AverageFunction should not count if the evaluated value is null.
[ https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010521#comment-14010521 ] Cheng Lian commented on SPARK-1915: --- Added bug reproduction steps. > AverageFunction should not count if the evaluated value is null. > > > Key: SPARK-1915 > URL: https://issues.apache.org/jira/browse/SPARK-1915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > Average values are difference between the calculation is done partially or > not partially. > Because {{AverageFunction}} (in not-partially calculation) counts even if the > evaluated value is null. > To reproduce this bug, run the following in {{sbt/sbt hive/console}}: > {code} > scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println) > ... > == Query Plan == > Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / > CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] > Exchange SinglePartition > Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS > PartialSum#648] >HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), > which is now runnable > 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98 > == Query Plan == > Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / > CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] > Exchange SinglePartition > Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS > PartialSum#648] >HiveTableScan [key#646], (MetastoreRelation default, src1, None), None) > ... > [237.06] > scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM > src1").collect().foreach(println) > ... > == Query Plan == > Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS > c1#669] > Exchange SinglePartition > HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), > which is now runnable > 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98 > == Query Plan == > Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS > c1#669] > Exchange SinglePartition > HiveTableScan [key#672], (MetastoreRelation default, src1, None), None) > ... > [142.24,15] > {code} > In the first query, {{AVG}} is broke into partial aggregation, and gives the > right answer (null values ignored). In the second query, since > {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} > isn't either, and the bug is triggered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1915) AverageFunction should not count if the evaluated value is null.
[ https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1915: -- Description: Average values are difference between the calculation is done partially or not partially. Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null. To reproduce this bug, run the following in {{sbt/sbt hive/console}}: {code} scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println) ... == Query Plan == Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] Exchange SinglePartition Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648] HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98 == Query Plan == Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] Exchange SinglePartition Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648] HiveTableScan [key#646], (MetastoreRelation default, src1, None), None) ... [237.06] scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println) ... == Query Plan == Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669] Exchange SinglePartition HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98 == Query Plan == Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669] Exchange SinglePartition HiveTableScan [key#672], (MetastoreRelation default, src1, None), None) ... [142.24,15] {code} In the first query, {{AVG}} is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the bug is triggered. was: Average values are difference between the calculation is done partially or not partially. Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null. > AverageFunction should not count if the evaluated value is null. > > > Key: SPARK-1915 > URL: https://issues.apache.org/jira/browse/SPARK-1915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > Average values are difference between the calculation is done partially or > not partially. > Because {{AverageFunction}} (in not-partially calculation) counts even if the > evaluated value is null. > To reproduce this bug, run the following in {{sbt/sbt hive/console}}: > {code} > scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println) > ... > == Query Plan == > Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / > CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] > Exchange SinglePartition > Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS > PartialSum#648] >HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), > which is now runnable > 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98 > == Query Plan == > Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / > CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] > Exchange SinglePartition > Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS > PartialSum#648] >HiveTableScan [key#646], (MetastoreRelation default, src1, None), None) > ... > [237.06] > scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM > src1").collect().foreach(println) > ... > == Query Plan == > Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS > c1#669] > Exchange SinglePartition > HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), > which is now runnable > 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98 > == Query Plan == > Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS > c1#669] > Exchange SinglePartition > HiveTableScan [key#672], (MetastoreRelation default, src1, None), None) > ... > [142.24,15] > {code} > In the first query, {{AVG}} is broke into partial aggregation, and gives the > right answer (null va
[jira] [Commented] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages
[ https://issues.apache.org/jira/browse/SPARK-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010496#comment-14010496 ] Matei Zaharia commented on SPARK-1566: -- https://github.com/apache/spark/pull/896 > Consolidate the Spark Programming Guide with tabs for all languages > --- > > Key: SPARK-1566 > URL: https://issues.apache.org/jira/browse/SPARK-1566 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 1.0.0 > > > Right now it's Scala-only and the other ones say "look at the Scala > programming guide first". -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name
[ https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1922. Resolution: Fixed > hql query throws "RuntimeException: Unsupported dataType" if struct field of > a table has a column with underscore in name > - > > Key: SPARK-1922 > URL: https://issues.apache.org/jira/browse/SPARK-1922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: llai >Assignee: llai > Fix For: 1.1.0, 1.0.1 > > > If table A has a struct field A_strct , when doing an hql > query like "select", "RuntimeException: Unsupported dataType" is thrown. > Running a query wiht "sbt/sbt hive/console": > {code} > scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 > limit 10").collect().foreach(x => println(x(1))) > {code} > Console output: > {code} > 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, > pkg FROM pkg_table where year=2014 and month=1 limit 10 > 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with > URI thrift://x > 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next > connection attempt. > 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore. > java.lang.RuntimeException: Unsupported dataType: > struct > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219) > at > org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name
[ https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1922: --- Assignee: Reynold Xin > hql query throws "RuntimeException: Unsupported dataType" if struct field of > a table has a column with underscore in name > - > > Key: SPARK-1922 > URL: https://issues.apache.org/jira/browse/SPARK-1922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: llai >Assignee: Reynold Xin > Fix For: 1.1.0, 1.0.1 > > > If table A has a struct field A_strct , when doing an hql > query like "select", "RuntimeException: Unsupported dataType" is thrown. > Running a query wiht "sbt/sbt hive/console": > {code} > scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 > limit 10").collect().foreach(x => println(x(1))) > {code} > Console output: > {code} > 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, > pkg FROM pkg_table where year=2014 and month=1 limit 10 > 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with > URI thrift://x > 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next > connection attempt. > 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore. > java.lang.RuntimeException: Unsupported dataType: > struct > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219) > at > org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name
[ https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1922: --- Assignee: llai (was: Reynold Xin) > hql query throws "RuntimeException: Unsupported dataType" if struct field of > a table has a column with underscore in name > - > > Key: SPARK-1922 > URL: https://issues.apache.org/jira/browse/SPARK-1922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: llai >Assignee: llai > Fix For: 1.1.0, 1.0.1 > > > If table A has a struct field A_strct , when doing an hql > query like "select", "RuntimeException: Unsupported dataType" is thrown. > Running a query wiht "sbt/sbt hive/console": > {code} > scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 > limit 10").collect().foreach(x => println(x(1))) > {code} > Console output: > {code} > 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, > pkg FROM pkg_table where year=2014 and month=1 limit 10 > 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with > URI thrift://x > 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next > connection attempt. > 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore. > java.lang.RuntimeException: Unsupported dataType: > struct > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219) > at > org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1922) hql query throws "RuntimeException: Unsupported dataType" if struct field of a table has a column with underscore in name
[ https://issues.apache.org/jira/browse/SPARK-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1922: --- Fix Version/s: 1.0.1 1.1.0 > hql query throws "RuntimeException: Unsupported dataType" if struct field of > a table has a column with underscore in name > - > > Key: SPARK-1922 > URL: https://issues.apache.org/jira/browse/SPARK-1922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: llai > Fix For: 1.1.0, 1.0.1 > > > If table A has a struct field A_strct , when doing an hql > query like "select", "RuntimeException: Unsupported dataType" is thrown. > Running a query wiht "sbt/sbt hive/console": > {code} > scala> hql("SELECT utc_time, pkg FROM pkg_table where year=2014 and month=1 > limit 10").collect().foreach(x => println(x(1))) > {code} > Console output: > {code} > 14/05/25 19:50:27 INFO parse.ParseDriver: Parsing command: SELECT utc_time, > pkg FROM pkg_table where year=2014 and month=1 limit 10 > 14/05/25 19:50:27 INFO parse.ParseDriver: Parse Completed > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch MultiInstanceRelations > 14/05/25 19:50:28 INFO analysis.Analyzer: Max iterations (2) reached for > batch CaseInsensitiveAttributeReferences > 14/05/25 19:50:28 INFO hive.metastore: Trying to connect to metastore with > URI thrift://x > 14/05/25 19:50:28 INFO hive.metastore: Waiting 1 seconds before next > connection attempt. > 14/05/25 19:50:29 INFO hive.metastore: Connected to metastore. > java.lang.RuntimeException: Unsupported dataType: > struct > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:219) > at > org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:273) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > org.apache.spark.sql.hive.MetastoreRelation$$anonfun$8.apply(HiveMetastoreCatalog.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1939) Improve takeSample method in RDD
[ https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1939: - Assignee: Doris Xin > Improve takeSample method in RDD > > > Key: SPARK-1939 > URL: https://issues.apache.org/jira/browse/SPARK-1939 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Doris Xin >Assignee: Doris Xin > Labels: newbie > > reimplement takeSample with the ScaSRS algorithm -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)
[ https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010413#comment-14010413 ] Tathagata Das commented on SPARK-1940: -- This is solved in https://github.com/apache/spark/pull/895 This PR solves this by implementing a simple RollingFileAppender within Spark (disabled by default). When enabled (using configuration parameter spark.executor.rollingLogs.enabled), the logs can get rolled over by time interval (set with spark.executor.rollingLogs.cleanupTtl, set to daily by default). Old logs (older than 7 days, by default) will get deleted automatically. The web UI can show the logs across the rolled over files. > Enable rolling of executor logs (stdout / stderr) > - > > Key: SPARK-1940 > URL: https://issues.apache.org/jira/browse/SPARK-1940 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Tathagata Das > > Currently, in the default log4j configuration, all the executor logs get sent > to the file [executor-working-dir]/stderr. This does not all log > files to be rolled, so old logs cannot be removed. > Using log4j RollingFileAppender allows log4j logs to be rolled, but all the > logs get sent to a different set of files, other than the files > stdout and stderr . So the logs are not visible in > the Spark web UI any more as Spark web UI only reads the files > stdout and stderr. Furthermore, it still does not > allow the stdout and stderr to be cleared periodically in case a large amount > of stuff gets written to them (e.g. by explicit println inside map function). > Solving this requires rolling of the logs in such a way that Spark web UI is > aware of it and can retrieve the logs across the rolled-over files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)
[ https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-1940: Assignee: Tathagata Das > Enable rolling of executor logs (stdout / stderr) > - > > Key: SPARK-1940 > URL: https://issues.apache.org/jira/browse/SPARK-1940 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Tathagata Das >Assignee: Tathagata Das > > Currently, in the default log4j configuration, all the executor logs get sent > to the file [executor-working-dir]/stderr. This does not all log > files to be rolled, so old logs cannot be removed. > Using log4j RollingFileAppender allows log4j logs to be rolled, but all the > logs get sent to a different set of files, other than the files > stdout and stderr . So the logs are not visible in > the Spark web UI any more as Spark web UI only reads the files > stdout and stderr. Furthermore, it still does not > allow the stdout and stderr to be cleared periodically in case a large amount > of stuff gets written to them (e.g. by explicit println inside map function). > Solving this requires rolling of the logs in such a way that Spark web UI is > aware of it and can retrieve the logs across the rolled-over files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010395#comment-14010395 ] Patrick Wendell commented on SPARK-1518: bq. Thanks, Patrick. This is useful info... I didn't realize there was still interest in running Spark against CDH3. Certainly we'll never support it directly, since CDH3 was end-of-lifed last year. So we don't really support doing anything on CDH3... except upgrading it to CDH4 (or hopefully, 5 which has Spark. Yeah, the upstream project tries pretty hard to maintain compatibility with as many Hadoop versions as possible since we see many people running even fairly old ones in the wild. Of course, I'm sure Cloudera will commercially support only the most recent ones. bq. Re: versioning - I was just asking whether this changes the oldest version we are compatible with. That's a question we should ask of all Hadoop patches. I just tested Spark and it doesn't compile against 0.20.X, so this is a no-op in terms of compatibility anyways. It won't make the 1.0.0 window but we should get it into a 1.0.X release. My concern is that once newer Hadoop versions come out, I want people to be able to compile Spark against them. Again, since Spark is distributed independently from HDFS, this is something that happens a lot, people try to compile older Spark releases against newer Hadoop releases. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)
Tathagata Das created SPARK-1940: Summary: Enable rolling of executor logs (stdout / stderr) Key: SPARK-1940 URL: https://issues.apache.org/jira/browse/SPARK-1940 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Tathagata Das Currently, in the default log4j configuration, all the executor logs get sent to the file [executor-working-dir]/stderr. This does not all log files to be rolled, so old logs cannot be removed. Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files stdout and stderr . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files stdout and stderr. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit println inside map function). Solving this requires rolling of the logs in such a way that Spark web UI is aware of it and can retrieve the logs across the rolled-over files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1914: --- Fix Version/s: 1.1.0 > Simplify CountFunction not to traverse to evaluate all child expressions. > - > > Key: SPARK-1914 > URL: https://issues.apache.org/jira/browse/SPARK-1914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > {{CountFunction}} should count up only if the child's evaluated value is not > null. > Because it traverses to evaluate all child expressions, even if the child is > null, it counts up if one of the all children is not null. > To reproduce this bug in {{sbt hive/console}}: > {code} > scala> hql("SELECT COUNT(*) FROM src1").collect() > res1: Array[org.apache.spark.sql.Row] = Array([25]) > scala> hql("SELECT COUNT(*) FROM src1 WHERE key IS NULL").collect() > res2: Array[org.apache.spark.sql.Row] = Array([10]) > scala> hql("SELECT COUNT(key + 1) FROM src1").collect() > res3: Array[org.apache.spark.sql.Row] = Array([25]) > {code} > {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1914: --- Fix Version/s: (was: 1.0.0) 1.0.1 > Simplify CountFunction not to traverse to evaluate all child expressions. > - > > Key: SPARK-1914 > URL: https://issues.apache.org/jira/browse/SPARK-1914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.0.1 > > > {{CountFunction}} should count up only if the child's evaluated value is not > null. > Because it traverses to evaluate all child expressions, even if the child is > null, it counts up if one of the all children is not null. > To reproduce this bug in {{sbt hive/console}}: > {code} > scala> hql("SELECT COUNT(*) FROM src1").collect() > res1: Array[org.apache.spark.sql.Row] = Array([25]) > scala> hql("SELECT COUNT(*) FROM src1 WHERE key IS NULL").collect() > res2: Array[org.apache.spark.sql.Row] = Array([10]) > scala> hql("SELECT COUNT(key + 1) FROM src1").collect() > res3: Array[org.apache.spark.sql.Row] = Array([25]) > {code} > {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1939) Improve takeSample method in RDD
Doris Xin created SPARK-1939: Summary: Improve takeSample method in RDD Key: SPARK-1939 URL: https://issues.apache.org/jira/browse/SPARK-1939 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Doris Xin reimplement takeSample with the ScaSRS algorithm -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010364#comment-14010364 ] Colin Patrick McCabe commented on SPARK-1518: - bq. It would be good to list the oldest upstream Hadoop version we support. Many people still run Spark against Hadoop 1.X/CDH3 variants. We get high download rates for those pre-built packages. I think this is a little different than e.g. CDH where people upgrade all components at the same time... people download newer versions of Spark and run it with old filesystems very often. Thanks, Patrick. This is useful info... I didn't realize there was still interest in running Spark against CDH3. Certainly we'll never support it directly, since CDH3 was end-of-lifed last year. So we don't really support doing anything on CDH3... except upgrading it to CDH4 (or hopefully, 5 which has Spark. :) bq. Re: versioning - I was just asking whether this changes the oldest version we are compatible with. That's a question we should ask of all Hadoop patches. I just tested Spark and it doesn't compile against 0.20.X, so this is a no-op in terms of compatibility anyways. It sounds like we're good to go on replacing the {{flush}} with {{hsync}} then? I notice you marked this as "critical" recently; do you think it's important to 1.0? > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1926) Nullability of Max/Min/First should be true.
[ https://issues.apache.org/jira/browse/SPARK-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1926. Resolution: Fixed Fix Version/s: 1.0.1 1.1.0 Assignee: Takuya Ueshin > Nullability of Max/Min/First should be true. > > > Key: SPARK-1926 > URL: https://issues.apache.org/jira/browse/SPARK-1926 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > Nullability of {{Max}}/{{Min}}/{{First}} should be {{true}} because they > return {{null}} if there are no rows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1915) AverageFunction should not count if the evaluated value is null.
[ https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1915. Resolution: Fixed Fix Version/s: 1.0.1 1.1.0 Assignee: Takuya Ueshin > AverageFunction should not count if the evaluated value is null. > > > Key: SPARK-1915 > URL: https://issues.apache.org/jira/browse/SPARK-1915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.1 > > > Average values are difference between the calculation is done partially or > not partially. > Because {{AverageFunction}} (in not-partially calculation) counts even if the > evaluated value is null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190 ] Aaron Davidson edited comment on SPARK-983 at 5/27/14 9:54 PM: --- Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. Edit: Figured it out, thanks [~pwendell]! was (Author: ilikerps): Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin >Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-983: - Assignee: Madhu Siddalingaiah > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin >Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340 ] Ankur Dave edited comment on SPARK-1931 at 5/27/14 9:50 PM: Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the edges before constructing the graph, as follows: {code} import org.apache.spark.HashPartitioner import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numPartitions = edges.partitions.size edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e)) .partitionBy(new HashPartitioner(numPartitions)) .mapPartitions(_.map(_._2), preservesPartitioning = true) } val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = Graph( g.vertices, partitionBy(g.edges, PartitionStrategy.EdgePartition2D)) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} was (Author: ankurd): Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the edges before constructing the graph, as follows: {code} import org.apache.spark.HashPartitioner import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numPartitions = edges.partitions.size edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e)) .partitionBy(new HashPartitioner(numPartitions)) .mapPartitions(_.map(_._2), preservesPartitioning = true) } val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), PartitionStrategy.EdgePartition2D)) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.1 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > import org.apache.spark.graphx._ > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340 ] Ankur Dave edited comment on SPARK-1931 at 5/27/14 9:47 PM: Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the edges before constructing the graph, as follows: {code} import org.apache.spark.HashPartitioner import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numPartitions = edges.partitions.size edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e)) .partitionBy(new HashPartitioner(numPartitions)) .mapPartitions(_.map(_._2), preservesPartitioning = true) } val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), PartitionStrategy.EdgePartition2D)) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} was (Author: ankurd): Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the edges before constructing the graph as follows: {code} import org.apache.spark.HashPartitioner import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numPartitions = edges.partitions.size edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e)) .partitionBy(new HashPartitioner(numPartitions)) .mapPartitions(_.map(_._2), preservesPartitioning = true) } val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), PartitionStrategy.EdgePartition2D)) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.1 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > import org.apache.spark.graphx._ > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010339#comment-14010339 ] Patrick Wendell commented on SPARK-1518: Re: versioning - I was just asking whether this changes the oldest version we are compatible with. That's a question we should ask of all Hadoop patches. I just tested Spark and it doesn't compile against 0.20.X, so this is a no-op in terms of compatibility anyways. It would be good to list the oldest upstream Hadoop version we support. Many people still run Spark against Hadoop 1.X/CDH3 variants. We get high download rates for those pre-built packages. I think this is a little different than e.g. CDH where people upgrade all components at the same time... people download newer versions of Spark and run it with old filesystems very often. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010340#comment-14010340 ] Ankur Dave commented on SPARK-1931: --- Since the fix didn't make it into Spark 1.0.0, a workaround is to partition the edges before constructing the graph as follows: {code} import org.apache.spark.HashPartitioner import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = { val numPartitions = edges.partitions.size edges.map(e => (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e)) .partitionBy(new HashPartitioner(numPartitions)) .mapPartitions(_.map(_._2), preservesPartitioning = true) } val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), partitionBy(sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2), PartitionStrategy.EdgePartition2D)) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.1 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > import org.apache.spark.graphx._ > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010322#comment-14010322 ] Sean Owen commented on SPARK-1518: -- RE: Hadoop versions, in my reckoning of the twisted world of Hadoop versions, the 0.23.x branch is still active and so is kind of later than 1.0.x. It may be easier to retain 0.23 compatibility than 1.0.x for example. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010305#comment-14010305 ] Marcelo Vanzin commented on SPARK-1518: --- Hmm, may I suggest a different approach? Andrew, who wrote the code, might have more info. But from my understanding, the flushes were needed because the history server might read logs from applications that were not yet finished. So the flush was a best-effort to avoid having the HS read files that contained partial JSON objects (and fail to parse them). But since then the HS was changed to only read logs from finished applications. I think it's safe to assume that finished applications are not writing to the event log anymore, so the above scenario doesn't exist. So could we just get rid of the explicit flush instead? > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010296#comment-14010296 ] Colin Patrick McCabe commented on SPARK-1518: - I wonder if we should have a discussion on the mailing list about the oldest version of Hadoop we should support. I would argue that it should be 0.23. Yahoo! is still using that version. Perhaps other people have more information than I do, though. If we decide to support 0.20, I will create a patch that does this using reflection. But I'd rather get your guys' opinion on whether that make sense first. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010294#comment-14010294 ] Sean Owen commented on SPARK-1518: -- 0.20.x stopped in early 2010. It is ancient. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010284#comment-14010284 ] Colin Patrick McCabe commented on SPARK-1518: - I think it's very, very, very unlikely that anyone will want to run Hadoop 0.20 against Spark. Why don't we: * fix the compile against Hadoop trunk * wait for someone to show up who wants compatibility with hadoop 0.20 before we work on it? It seems like if there is interest in Spark on Hadoop 0.20, there will quickly be a patch submitted to get it compiling there. If there is no such interest, then we'll be done here without doing a lot of work up front. If you agree then I'll create a pull req. I have verified that fixing the flush thing un-breaks the compile on master. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1931: --- Fix Version/s: (was: 1.0.0) 1.0.1 > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.1 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > import org.apache.spark.graphx._ > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1518: --- Priority: Critical (was: Major) > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe >Priority: Critical > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1930) The Container is running beyond physical memory limits, so as to be killed.
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010258#comment-14010258 ] Patrick Wendell commented on SPARK-1930: PySpark might also be an issue with this, because it launches Python VM's that consume memory. > The Container is running beyond physical memory limits, so as to be killed. > --- > > Key: SPARK-1930 > URL: https://issues.apache.org/jira/browse/SPARK-1930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Guoqiang Li > Fix For: 1.0.1 > > > When the containers occupies 8G memory ,the containers were killed > yarn node manager log: > {code} > 2014-05-23 13:35:30,776 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container [pid=4947,containerID=container_1400809535638_0015_01_05] is > running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB > physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1400809535638_0015_01_05 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill > %p' -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" > -Dspark.akka.frameSize="20" > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 1> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout > 2> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr > > |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill > %p -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 > -Dspark.akka.frameSize=20 > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Removed ProcessTree with root 4947 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from RUNNING > to KILLING > 2014-05-23 13:35:30,777 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,788 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1400809535638_0015_01_05 is : 143 > 2014-05-23 13:35:30,829 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from KILLING > to CONTAINER_CLEANEDUP_AFTER_KILL > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1400809535638_0015 > CONTAINERID=container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1400809535638_0015_01_05 from application > application_1400809535638_0015 > {code} > I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} > https://github.com/apache/spark/blob/master/yarn
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010241#comment-14010241 ] Yin Huai commented on SPARK-1935: - I have updated my PR to use codec 1.5. > Explicitly add commons-codec 1.5 as a dependency > > > Key: SPARK-1935 > URL: https://issues.apache.org/jira/browse/SPARK-1935 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.1 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > Right now, commons-codec is a transitive dependency. When Spark is built by > maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an > older version (Hadoop 1.0.4 depends on 1.4). This older version can cause > problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-1935: Summary: Explicitly add commons-codec 1.5 as a dependency (was: Explicitly add commons-codec 1.4 as a dependency) > Explicitly add commons-codec 1.5 as a dependency > > > Key: SPARK-1935 > URL: https://issues.apache.org/jira/browse/SPARK-1935 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.1 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > Right now, commons-codec is a transitive dependency. When Spark is built by > maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an > older version (Hadoop 1.0.4 depends on 1.4). This older version can cause > problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1931: -- Description: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} import org.apache.spark.graphx._ val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} was: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.0 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > import org.apache.spark.graphx._ > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1931: -- Description: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet == Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} was: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.0 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet == > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1931: -- Description: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} was: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) {code} > Graph.partitionBy does not reconstruct routing tables > - > > Key: SPARK-1931 > URL: https://issues.apache.org/jira/browse/SPARK-1931 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.0.0 > > > Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in > partitionBy where, after repartitioning the edges, it reuses the VertexRDD > without updating the routing tables to reflect the new edge layout. This > causes the following test to fail: > {code} > val g = Graph( > sc.parallelize(List((0L, "a"), (1L, "b"), (2L, "c"))), > sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) > assert(g.triplets.collect.map(_.toTuple).toSet === > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > val gPart = g.partitionBy(PartitionStrategy.EdgePartition2D) > assert(gPart.triplets.collect.map(_.toTuple).toSet === > Set(((0L, "a"), (1L, "b"), 1), ((0L, "a"), (2L, "c"), 1))) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190 ] Aaron Davidson commented on SPARK-983: -- Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010182#comment-14010182 ] Madhu Siddalingaiah commented on SPARK-983: --- Understood. Here are my thoughts: I can implement this feature using SizeEstimator, even though it's not ideal. Once there is greater consensus on how to manage memory more efficiently, I think it should not be hard to adapt the code. At the minimum, the spill/merge code will be in place. I'm watching [SPARK-1021|https://issues.apache.org/jira/browse/SPARK-1021], so if that's done first, I'll merge it in. Does that sound reasonable? [Aaron Davidson|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ilikerps], feel free to assign this feature to me. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010117#comment-14010117 ] Patrick Wendell commented on SPARK-1518: Ah okay. I'm not sure what the oldest version Hadoop that spark Spark compiles against pre 0.21, but it's worth knowing whether this change would cause us to drop support for some of the older versions. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1103) Garbage collect RDD information inside of Spark
[ https://issues.apache.org/jira/browse/SPARK-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010099#comment-14010099 ] Andrew Ash commented on SPARK-1103: --- https://github.com/apache/spark/pull/126 > Garbage collect RDD information inside of Spark > --- > > Key: SPARK-1103 > URL: https://issues.apache.org/jira/browse/SPARK-1103 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.0.0 > > > When Spark jobs run for a long period of time, state accumulates. This is > dealt with now using TTL-based cleaning. Instead we should do proper garbage > collection using weak references. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010068#comment-14010068 ] Colin Patrick McCabe commented on SPARK-1518: - bq. Hey Colin Patrick McCabe, what is the oldest version of hadooop that contains hflush? /cc Andrew Or who IIRC looked into this a bunch when writing the logger. The oldest Apache release I know about with hflush is Hadoop 0.21. > Spark master doesn't compile against hadoop-common trunk > > > Key: SPARK-1518 > URL: https://issues.apache.org/jira/browse/SPARK-1518 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin >Assignee: Colin Patrick McCabe > > FSDataOutputStream::sync() has disappeared from trunk in Hadoop; > FileLogger.scala is calling it. > I've changed it locally to hsync() so I can compile the code, but haven't > checked yet whether those are equivalent. hsync() seems to have been there > forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1930) The Container is running beyond physical memory limits, so as to be killed.
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1930: --- Summary: The Container is running beyond physical memory limits, so as to be killed. (was: Container is running beyond physical memory limits) > The Container is running beyond physical memory limits, so as to be killed. > --- > > Key: SPARK-1930 > URL: https://issues.apache.org/jira/browse/SPARK-1930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Guoqiang Li > Fix For: 1.0.1 > > > When the containers occupies 8G memory ,the containers were killed > yarn node manager log: > {code} > 2014-05-23 13:35:30,776 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container [pid=4947,containerID=container_1400809535638_0015_01_05] is > running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB > physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1400809535638_0015_01_05 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill > %p' -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" > -Dspark.akka.frameSize="20" > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 1> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout > 2> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr > > |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill > %p -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 > -Dspark.akka.frameSize=20 > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Removed ProcessTree with root 4947 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from RUNNING > to KILLING > 2014-05-23 13:35:30,777 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,788 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1400809535638_0015_01_05 is : 143 > 2014-05-23 13:35:30,829 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from KILLING > to CONTAINER_CLEANEDUP_AFTER_KILL > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1400809535638_0015 > CONTAINERID=container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1400809535638_0015_01_05 from application > application_1400809535638_0015 > {code} > I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} > https://github.com/apache/spark/blob/master/yarn/stable/sr
[jira] [Updated] (SPARK-1930) Container is running beyond physical memory limits
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1930: --- Fix Version/s: 1.0.1 > Container is running beyond physical memory limits > > > Key: SPARK-1930 > URL: https://issues.apache.org/jira/browse/SPARK-1930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Guoqiang Li > Fix For: 1.0.1 > > > When the containers occupies 8G memory ,the containers were killed > yarn node manager log: > {code} > 2014-05-23 13:35:30,776 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container [pid=4947,containerID=container_1400809535638_0015_01_05] is > running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB > physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1400809535638_0015_01_05 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill > %p' -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" > -Dspark.akka.frameSize="20" > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 1> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout > 2> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr > > |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill > %p -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 > -Dspark.akka.frameSize=20 > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Removed ProcessTree with root 4947 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from RUNNING > to KILLING > 2014-05-23 13:35:30,777 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,788 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1400809535638_0015_01_05 is : 143 > 2014-05-23 13:35:30,829 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from KILLING > to CONTAINER_CLEANEDUP_AFTER_KILL > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1400809535638_0015 > CONTAINERID=container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1400809535638_0015_01_05 from application > application_1400809535638_0015 > {code} > I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} > https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562 > Relative to 8G, 384 MB is too small -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875 ] Mark Hamstra edited comment on SPARK-983 at 5/27/14 4:40 PM: - I'm hoping these can be kept orthogonal, but I think that it is worth noting the existence of SPARK-1021 and the fact that sortByKey as it currently exists breaks Spark's "transformations of RDDs are lazy" contract. I'm currently working on that issue, which is undoubtedly going to require at least some merge work to be compatible with the resolution of this issue. was (Author: markhamstra): I'm hoping these can be kept orthogonal, but I think that it is worth noting the existence of https://issues.apache.org/jira/browse/SPARK-1021 and the fact that sortByKey as it currently exists breaks Spark's "transformations of RDDs are lazy" contract. I'm currently working on that issue, which is undoubtedly going to require at least some merge work to be compatible with the resolution of this issue. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1930) Container is running beyond physical memory limits
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1930: --- Summary: Container is running beyond physical memory limits (was: Container memory beyond limit, were killed) > Container is running beyond physical memory limits > > > Key: SPARK-1930 > URL: https://issues.apache.org/jira/browse/SPARK-1930 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Guoqiang Li > > When the containers occupies 8G memory ,the containers were killed > yarn node manager log: > {code} > 2014-05-23 13:35:30,776 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container [pid=4947,containerID=container_1400809535638_0015_01_05] is > running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB > physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1400809535638_0015_01_05 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill > %p' -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout="120" -Dspark.akka.timeout="120" > -Dspark.akka.frameSize="20" > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 1> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout > 2> > /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr > > |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 > /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill > %p -Xms8192m -Xmx8192m -Xss2m > -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp > -Dlog4j.configuration=log4j-spark-container.properties > -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 > -Dspark.akka.frameSize=20 > org.apache.spark.executor.CoarseGrainedExecutorBackend > akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 > 10dian72.domain.test 4 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Removed ProcessTree with root 4947 > 2014-05-23 13:35:30,776 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from RUNNING > to KILLING > 2014-05-23 13:35:30,777 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,788 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1400809535638_0015_01_05 is : 143 > 2014-05-23 13:35:30,829 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from KILLING > to CONTAINER_CLEANEDUP_AFTER_KILL > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark > OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1400809535638_0015 > CONTAINERID=container_1400809535638_0015_01_05 > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1400809535638_0015_01_05 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2014-05-23 13:35:30,830 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1400809535638_0015_01_05 from application > application_1400809535638_0015 > {code} > I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} > https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562 > Relative to 8G, 384 MB is too sma
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875 ] Mark Hamstra commented on SPARK-983: I'm hoping these can be kept orthogonal, but I think that it is worth noting the existence of https://issues.apache.org/jira/browse/SPARK-1021 and the fact that sortByKey as it currently exists breaks Spark's "transformations of RDDs are lazy" contract. I'm currently working on that issue, which is undoubtedly going to require at least some merge work to be compatible with the resolution of this issue. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra reassigned SPARK-1021: --- Assignee: Mark Hamstra > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0 >Reporter: Andrew Ash >Assignee: Mark Hamstra > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009856#comment-14009856 ] Yin Huai commented on SPARK-1935: - [~srowen] yes, it is the reason. I did a quick search in parquet-mr and seems parquet does not have any import related to commons-codec. Also, I checked the release note of codec 1.5. It seems 1.5 is fine. So, do we upgrade to 1.5? > Explicitly add commons-codec 1.4 as a dependency > > > Key: SPARK-1935 > URL: https://issues.apache.org/jira/browse/SPARK-1935 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.1 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > Right now, commons-codec is a transitive dependency. When Spark is built by > maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an > older version (Hadoop 1.0.4 depends on 1.4). This older version can cause > problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.
[ https://issues.apache.org/jira/browse/SPARK-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009555#comment-14009555 ] Takuya Ueshin commented on SPARK-1938: -- PRed: https://github.com/apache/spark/pull/893 > ApproxCountDistinctMergeFunction should return Int value. > - > > Key: SPARK-1938 > URL: https://issues.apache.org/jira/browse/SPARK-1938 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the > {{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1938) ApproxCountDistinctMergeFunction should return Int value.
Takuya Ueshin created SPARK-1938: Summary: ApproxCountDistinctMergeFunction should return Int value. Key: SPARK-1938 URL: https://issues.apache.org/jira/browse/SPARK-1938 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{ApproxCountDistinctMergeFunction}} should return {{Int}} value because the {{dataType}} of {{ApproxCountDistinct}} is {{IntegerType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009532#comment-14009532 ] Zhen Peng commented on SPARK-1928: -- [~gq] I met this case in our local mode spark streaming application. And in the UT, I have added a test case to simulate this. > DAGScheduler suspended by local task OOM > > > Key: SPARK-1928 > URL: https://issues.apache.org/jira/browse/SPARK-1928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Li updated SPARK-1937: -- Attachment: RSBTest.scala The program that triggers the problem. With the patch, the whole execution time of the job reduces from nearly 600s to around 250s. > Tasks can be submitted before executors are registered > -- > > Key: SPARK-1937 > URL: https://issues.apache.org/jira/browse/SPARK-1937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala > > > During construction, TaskSetManager will assign tasks to several pending > lists according to the tasks’ preferred locations. If the desired location is > unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list > containing tasks without preferred locations. > The problem is that tasks may be submitted before the executors get > registered with the driver, in which case TaskSetManager will assign all the > tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, > it will pick one from this list and assign it to arbitrary executor, since > TaskSetManager considers the tasks can run equally well on any node. > This problem deprives benefits of data locality, drags the whole job slow and > can cause imbalance between executors. > I ran into this issue when running a spark program on a 7-node cluster > (node6~node12). The program processes 100GB data. > Since the data is uploaded to HDFS from node6, this node has a complete copy > of the data and as a result, node6 finishes tasks much faster, which in turn > makes it complete dis-proportionally more tasks than other nodes. > To solve this issue, I think we shouldn't check availability of > executors/hosts when constructing TaskSetManager. If a task prefers a node, > we simply add the task to that node’s pending list. When later on the node is > added, TaskSetManager can schedule the task according to proper locality > level. If unfortunately the preferred node(s) never gets added, > TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1929) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li closed SPARK-1929. -- Resolution: Duplicate > DAGScheduler suspended by local task OOM > > > Key: SPARK-1929 > URL: https://issues.apache.org/jira/browse/SPARK-1929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1929) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009492#comment-14009492 ] Devaraj K commented on SPARK-1929: -- Duplocate of SPARK-1928. > DAGScheduler suspended by local task OOM > > > Key: SPARK-1929 > URL: https://issues.apache.org/jira/browse/SPARK-1929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Li updated SPARK-1937: -- Attachment: Before-patch.png > Tasks can be submitted before executors are registered > -- > > Key: SPARK-1937 > URL: https://issues.apache.org/jira/browse/SPARK-1937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > Attachments: After-patch.PNG, Before-patch.png > > > During construction, TaskSetManager will assign tasks to several pending > lists according to the tasks’ preferred locations. If the desired location is > unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list > containing tasks without preferred locations. > The problem is that tasks may be submitted before the executors get > registered with the driver, in which case TaskSetManager will assign all the > tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, > it will pick one from this list and assign it to arbitrary executor, since > TaskSetManager considers the tasks can run equally well on any node. > This problem deprives benefits of data locality, drags the whole job slow and > can cause imbalance between executors. > I ran into this issue when running a spark program on a 7-node cluster > (node6~node12). The program processes 100GB data. > Since the data is uploaded to HDFS from node6, this node has a complete copy > of the data and as a result, node6 finishes tasks much faster, which in turn > makes it complete dis-proportionally more tasks than other nodes. > To solve this issue, I think we shouldn't check availability of > executors/hosts when constructing TaskSetManager. If a task prefers a node, > we simply add the task to that node’s pending list. When later on the node is > added, TaskSetManager can schedule the task according to proper locality > level. If unfortunately the preferred node(s) never gets added, > TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009490#comment-14009490 ] Rui Li commented on SPARK-1937: --- Here's a quick fix: https://github.com/apache/spark/pull/892 > Tasks can be submitted before executors are registered > -- > > Key: SPARK-1937 > URL: https://issues.apache.org/jira/browse/SPARK-1937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > Attachments: After-patch.PNG, Before-patch.png > > > During construction, TaskSetManager will assign tasks to several pending > lists according to the tasks’ preferred locations. If the desired location is > unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list > containing tasks without preferred locations. > The problem is that tasks may be submitted before the executors get > registered with the driver, in which case TaskSetManager will assign all the > tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, > it will pick one from this list and assign it to arbitrary executor, since > TaskSetManager considers the tasks can run equally well on any node. > This problem deprives benefits of data locality, drags the whole job slow and > can cause imbalance between executors. > I ran into this issue when running a spark program on a 7-node cluster > (node6~node12). The program processes 100GB data. > Since the data is uploaded to HDFS from node6, this node has a complete copy > of the data and as a result, node6 finishes tasks much faster, which in turn > makes it complete dis-proportionally more tasks than other nodes. > To solve this issue, I think we shouldn't check availability of > executors/hosts when constructing TaskSetManager. If a task prefers a node, > we simply add the task to that node’s pending list. When later on the node is > added, TaskSetManager can schedule the task according to proper locality > level. If unfortunately the preferred node(s) never gets added, > TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Li updated SPARK-1937: -- Attachment: After-patch.PNG > Tasks can be submitted before executors are registered > -- > > Key: SPARK-1937 > URL: https://issues.apache.org/jira/browse/SPARK-1937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > Attachments: After-patch.PNG, Before-patch.png > > > During construction, TaskSetManager will assign tasks to several pending > lists according to the tasks’ preferred locations. If the desired location is > unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list > containing tasks without preferred locations. > The problem is that tasks may be submitted before the executors get > registered with the driver, in which case TaskSetManager will assign all the > tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, > it will pick one from this list and assign it to arbitrary executor, since > TaskSetManager considers the tasks can run equally well on any node. > This problem deprives benefits of data locality, drags the whole job slow and > can cause imbalance between executors. > I ran into this issue when running a spark program on a 7-node cluster > (node6~node12). The program processes 100GB data. > Since the data is uploaded to HDFS from node6, this node has a complete copy > of the data and as a result, node6 finishes tasks much faster, which in turn > makes it complete dis-proportionally more tasks than other nodes. > To solve this issue, I think we shouldn't check availability of > executors/hosts when constructing TaskSetManager. If a task prefers a node, > we simply add the task to that node’s pending list. When later on the node is > added, TaskSetManager can schedule the task according to proper locality > level. If unfortunately the preferred node(s) never gets added, > TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009477#comment-14009477 ] Zhen Peng commented on SPARK-1928: -- https://github.com/apache/spark/pull/883 > DAGScheduler suspended by local task OOM > > > Key: SPARK-1928 > URL: https://issues.apache.org/jira/browse/SPARK-1928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Zhen Peng > Fix For: 1.0.0 > > > DAGScheduler does not handle local task OOM properly, and will wait for the > job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1935: --- Assignee: Yin Huai > Explicitly add commons-codec 1.4 as a dependency > > > Key: SPARK-1935 > URL: https://issues.apache.org/jira/browse/SPARK-1935 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.1 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > Right now, commons-codec is a transitive dependency. When Spark is built by > maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an > older version (Hadoop 1.0.4 depends on 1.4). This older version can cause > problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009472#comment-14009472 ] Sean Owen commented on SPARK-1935: -- Yeah, I think this is a matter of Maven's nearest-first vs SBT's latest-first conflict resolution strategy. It should be safe to manually manage this to 1.5, I believe. > Explicitly add commons-codec 1.4 as a dependency > > > Key: SPARK-1935 > URL: https://issues.apache.org/jira/browse/SPARK-1935 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.1 >Reporter: Yin Huai >Priority: Minor > > Right now, commons-codec is a transitive dependency. When Spark is built by > maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an > older version (Hadoop 1.0.4 depends on 1.4). This older version can cause > problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1937) Tasks can be submitted before executors are registered
Rui Li created SPARK-1937: - Summary: Tasks can be submitted before executors are registered Key: SPARK-1937 URL: https://issues.apache.org/jira/browse/SPARK-1937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li During construction, TaskSetManager will assign tasks to several pending lists according to the tasks’ preferred locations. If the desired location is unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list containing tasks without preferred locations. The problem is that tasks may be submitted before the executors get registered with the driver, in which case TaskSetManager will assign all the tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, it will pick one from this list and assign it to arbitrary executor, since TaskSetManager considers the tasks can run equally well on any node. This problem deprives benefits of data locality, drags the whole job slow and can cause imbalance between executors. I ran into this issue when running a spark program on a 7-node cluster (node6~node12). The program processes 100GB data. Since the data is uploaded to HDFS from node6, this node has a complete copy of the data and as a result, node6 finishes tasks much faster, which in turn makes it complete dis-proportionally more tasks than other nodes. To solve this issue, I think we shouldn't check availability of executors/hosts when constructing TaskSetManager. If a task prefers a node, we simply add the task to that node’s pending list. When later on the node is added, TaskSetManager can schedule the task according to proper locality level. If unfortunately the preferred node(s) never gets added, TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1936) Add apache header and remove author tags
[ https://issues.apache.org/jira/browse/SPARK-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009352#comment-14009352 ] Devaraj K commented on SPARK-1936: -- Pull Request added : https://github.com/apache/spark/pull/890 > Add apache header and remove author tags > > > Key: SPARK-1936 > URL: https://issues.apache.org/jira/browse/SPARK-1936 > Project: Spark > Issue Type: Bug >Reporter: Devaraj K >Priority: Minor > > These below files don’t have apache header and contain author tags. > {code:xml} > spark\repl\src\main\scala\org\apache\spark\repl\SparkExprTyper.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkILoop.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkILoopInit.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkIMain.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkImports.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineCompletion.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineReader.scala > spark\repl\src\main\scala\org\apache\spark\repl\SparkMemberHandlers.scala > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)