[jira] [Created] (SPARK-2705) Wrong stage description in Web UI
Cheng Lian created SPARK-2705: - Summary: Wrong stage description in Web UI Key: SPARK-2705 URL: https://issues.apache.org/jira/browse/SPARK-2705 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor Type of stage description object in the stage table of Web UI should be a {{String}}, but an {{Option\[String\]}} is used. See [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2705) Wrong stage description in Web UI
[ https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075609#comment-14075609 ] Cheng Lian commented on SPARK-2705: --- PR: https://github.com/apache/spark/pull/1524 Wrong stage description in Web UI -- Key: SPARK-2705 URL: https://issues.apache.org/jira/browse/SPARK-2705 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor Type of stage description object in the stage table of Web UI should be a {{String}}, but an {{Option\[String\]}} is used. See [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2706) Enable Spark to support Hive 0.13
Chunjun Xiao created SPARK-2706: --- Summary: Enable Spark to support Hive 0.13 Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED
[jira] [Resolved] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2679. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1581 [https://github.com/apache/spark/pull/1581] Ser/De for Double to enable calling Java API from python in MLlib - Key: SPARK-2679 URL: https://issues.apache.org/jira/browse/SPARK-2679 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin Fix For: 1.1.0 In order to enable Java/Scala APIs to be reused in the Python implementation of RandomRDD and Correlations, we need a set of ser/de for the type Double in _common.py and PythonMLLibAPI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2679: - Assignee: Doris Xin Ser/De for Double to enable calling Java API from python in MLlib - Key: SPARK-2679 URL: https://issues.apache.org/jira/browse/SPARK-2679 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin Assignee: Doris Xin Fix For: 1.1.0 In order to enable Java/Scala APIs to be reused in the Python implementation of RandomRDD and Correlations, we need a set of ser/de for the type Double in _common.py and PythonMLLibAPI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2681) Spark can hang when fetching shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2681: --- Affects Version/s: (was: 1.0.0) 1.0.1 Spark can hang when fetching shuffle blocks --- Key: SPARK-2681 URL: https://issues.apache.org/jira/browse/SPARK-2681 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Blocker executor log : {noformat} 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 53628 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 and clearing cache 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, computing it 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395] 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 8 ms 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called with curMem=920109320, maxMem=4322230272 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values to memory (estimated size 28.1 KB, free 3.2 GB) 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block rdd_51_83 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, computing it 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 28, fetching them 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395] 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1024 blocks 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote fetches in 0 ms 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, computing it 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 4 ms 14/07/24 22:57:09 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@3dcc1da1 14/07/24 23:05:07 INFO
[jira] [Updated] (SPARK-2681) Spark can hang when fetching shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2681: --- Attachment: jstack-26027.log [~pwendell] Jstack output has been uploaded. Spark can hang when fetching shuffle blocks --- Key: SPARK-2681 URL: https://issues.apache.org/jira/browse/SPARK-2681 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Blocker Attachments: jstack-26027.log executor log : {noformat} 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 53628 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 and clearing cache 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, computing it 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395] 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks 14/07/24 22:56:53 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 8 ms 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called with curMem=920109320, maxMem=4322230272 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values to memory (estimated size 28.1 KB, free 3.2 GB) 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block rdd_51_83 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, computing it 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 28, fetching them 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395] 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1024 blocks 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote fetches in 0 ms 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, computing it 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 non-empty blocks out of 1024 blocks 14/07/24 22:56:55 INFO storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote fetches in 4 ms 14/07/24 22:57:09 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(tuan221,51153) 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ?
[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle
[ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075639#comment-14075639 ] Apache Spark commented on SPARK-2532: - User 'mridulm' has created a pull request for this issue: https://github.com/apache/spark/pull/1609 Fix issues with consolidated shuffle Key: SPARK-2532 URL: https://issues.apache.org/jira/browse/SPARK-2532 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: All Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Fix For: 1.1.0 Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks unfortunately :) ). Consolidated shuffle is broken in multiple ways in spark : a) Task failure(s) can cause the state to become inconsistent. b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent. (As part of exception/error handling). c) Some of the api in block writer causes implementation issues - for example: a revert is always followed by close : but the implemention tries to keep them separate, resulting in surface for errors. d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively written to : it computes length by subtracting next offset from current offset (or length if this is last offset)- the latter fails when fetch is happening in parallel to write. Note, this happens even if there are no task failures of any kind ! This usually results in stream corruption or decompression errors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2707) Upgrade to Akka 2.3
Yardena created SPARK-2707: -- Summary: Upgrade to Akka 2.3 Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075648#comment-14075648 ] Yardena commented on SPARK-2707: Some minor source changes may be required, see http://doc.akka.io/docs/akka/snapshot/project/migration-guide-2.2.x-2.3.x.html Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075652#comment-14075652 ] Aaron Davidson commented on SPARK-2707: --- It does sound mostly mechanical and I believe we don't use most of those features. Perhaps just getting it to compile (while re-shading protobuf) would be sufficient to make it work. Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
[ https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075654#comment-14075654 ] Michael Yannakopoulos commented on SPARK-2550: -- Hi Xiangrui, Is it only my problem or a general one the fact that building the whole project using the command 'sbt/sbt assembly' fails? From what I see the errors come from the patches related to the UI and WebUI functionality. Thanks, Michael Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Michael Yannakopoulos Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2677: --- Affects Version/s: 1.0.1 BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker Fix For: 1.1.0, 1.0.3 In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2677: --- Affects Version/s: 0.9.2 BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075672#comment-14075672 ] Patrick Wendell commented on SPARK-2677: Just as an FYI - this has been observed also in several earlier versions of Spark. I think one issue is that we don't have timeouts in the conneciton manger code. If a JVM goes into GC thrashing and becomes un-responsive (but still alive), then you can get stuck here. BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2677: --- Target Version/s: 1.1.0, 1.0.3 Fix Version/s: (was: 1.0.3) (was: 1.1.0) BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075673#comment-14075673 ] Guoqiang Li commented on SPARK-2677: If {{yarn.scheduler.fair.preemption}} is set to true in yarn, This issue will appear frequently. BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2677) BasicBlockFetchIterator#next can wait forever
[ https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075673#comment-14075673 ] Guoqiang Li edited comment on SPARK-2677 at 7/27/14 6:17 PM: - If {{yarn.scheduler.fair.preemption}} is set to {{true}} in yarn, This issue will appear frequently. was (Author: gq): If {{yarn.scheduler.fair.preemption}} is set to true in yarn, This issue will appear frequently. BasicBlockFetchIterator#next can wait forever - Key: SPARK-2677 URL: https://issues.apache.org/jira/browse/SPARK-2677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.2, 1.0.0, 1.0.1 Reporter: Kousuke Saruta Priority: Blocker In BasicBlockFetchIterator#next, it waits fetch result on result.take. {code} override def next(): (BlockId, Option[Iterator[Any]]) = { resultsGotten += 1 val startFetchWait = System.currentTimeMillis() val result = results.take() val stopFetchWait = System.currentTimeMillis() _fetchWaitTime += (stopFetchWait - startFetchWait) if (! result.failed) bytesInFlight -= result.size while (!fetchRequests.isEmpty (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = maxBytesInFlight)) { sendRequest(fetchRequests.dequeue()) } (result.blockId, if (result.failed) None else Some(result.deserialize())) } {code} But, results is implemented as LinkedBlockingQueue so if remote executor hang up, fetching Executor waits forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
[ https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075675#comment-14075675 ] Michael Yannakopoulos commented on SPARK-2550: -- The errors are related to the fact that object TaskUIData class has moved from 'org.apache.spark.ui.jobs.TaskUIData' to 'org.apache.spark.ui.jobs.UIData.TaskUIData'. Should I open a new Jira Task and resolve it? Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Michael Yannakopoulos Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
Michael Yannakopoulos created SPARK-2708: Summary: [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075720#comment-14075720 ] Sean Owen commented on SPARK-2688: -- If you persist/cache rdd2, it is not recomputed. You can already execute operations in parallel within a SparkContext. Just execute them in parallel. Need a way to run multiple data pipeline concurrently - Key: SPARK-2688 URL: https://issues.apache.org/jira/browse/SPARK-2688 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: Xuefu Zhang Suppose we want to do the following data processing: {code} rdd1 - rdd2 - rdd3 | - rdd4 | - rdd5 \ - rdd6 {code} where - represents a transformation. rdd3 to rrdd6 are all derived from an intermediate rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez already realized the importance of this (TEZ-391), so I think Spark should provide this too. This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075721#comment-14075721 ] Patrick Wendell commented on SPARK-2707: What are the features we want to use here in the newer akka version? I sort of wonder whether we should just shade all of akka so that we don't expose it as an external API in Spark, and users can independently use whatever Akka version they want. Otherwise we won't ever be able to swap out our internal communication layer. Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2705) Wrong stage description in Web UI
[ https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075722#comment-14075722 ] Patrick Wendell commented on SPARK-2705: Fixed via: https://github.com/apache/spark/pull/1524 Wrong stage description in Web UI -- Key: SPARK-2705 URL: https://issues.apache.org/jira/browse/SPARK-2705 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor Fix For: 1.1.0 Type of stage description object in the stage table of Web UI should be a {{String}}, but an {{Option\[String\]}} is used. See [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2705) Wrong stage description in Web UI
[ https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2705: --- Assignee: Cheng Lian Wrong stage description in Web UI -- Key: SPARK-2705 URL: https://issues.apache.org/jira/browse/SPARK-2705 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.1.0 Type of stage description object in the stage table of Web UI should be a {{String}}, but an {{Option\[String\]}} is used. See [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions
[ https://issues.apache.org/jira/browse/SPARK-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2082: --- Component/s: MLlib Stratified sampling implementation in PairRDDFunctions -- Key: SPARK-2082 URL: https://issues.apache.org/jira/browse/SPARK-2082 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Doris Xin Assignee: Doris Xin Implementation of stratified sampling that guarantees exact sample size = sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1498) Spark can hang if pyspark tasks fail
[ https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1498: --- Component/s: PySpark Spark can hang if pyspark tasks fail Key: SPARK-1498 URL: https://issues.apache.org/jira/browse/SPARK-1498 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1, 0.9.2 Reporter: Kay Ousterhout Fix For: 1.0.0 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning an error. This is partially a scheduler problem -- the scheduler sometimes thinks failed tasks succeed, even though they have a stack trace and exception. You can reproduce this problem with: ardd = sc.parallelize([(1,2,3), (4,5,6)]) brdd = sc.parallelize([(1,2,6), (4,5,9)]) ardd.join(brdd).count() The last line will run forever (the problem in this code is that the RDD entries have 3 values instead of the expected 2). I haven't verified if this is a problem for 1.0 as well as 0.9. Thanks to Shivaram for helping diagnose this issue! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.
[ https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2703: --- Component/s: Spark Core Make Tachyon related unit tests execute without deploying a Tachyon system locally. --- Key: SPARK-2703 URL: https://issues.apache.org/jira/browse/SPARK-2703 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Haoyuan Li Fix For: 1.1.0 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2702) Upgrade Tachyon dependency to 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2702: --- Component/s: Spark Core Upgrade Tachyon dependency to 0.5.0 --- Key: SPARK-2702 URL: https://issues.apache.org/jira/browse/SPARK-2702 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Haoyuan Li Fix For: 1.1.0 Upgrade Tachyon dependency to 0.5.0: a. Code dependency. b. Start Tachyon script. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2673) Improve Spark so that we can attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2673: --- Component/s: Spark Core Improve Spark so that we can attach Debugger to Executors easily Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) It's difficult for Executors running on the same machine to open debug port because we can only pass same JVM options to all executors. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075728#comment-14075728 ] Sean Owen commented on SPARK-2708: -- I don't see any failures when I run the tests from master just now. Jenkins seems to be succeeding too, or at least, the failed builds don't seem to be due to a TaskUIData class: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ The class TaskUIData is present in org.apache.spark.ui.jobs.UIData. It was added pretty recently: https://github.com/apache/spark/commits/72e9021eaf26f31a82120505f8b764b18fbe8d48/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala Maybe you need to do a clean build? [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075735#comment-14075735 ] Michael Yannakopoulos commented on SPARK-2708: -- I am doing it right now! I am going to report as soon as possible. Thanks! [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Yannakopoulos closed SPARK-2708. Resolution: Fixed [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075736#comment-14075736 ] Michael Yannakopoulos commented on SPARK-2708: -- Yes, you are right! Thanks for the help. I am closing this issue as resolved. Thanks again guys! [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2410) Thrift/JDBC Server
[ https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2410. Resolution: Fixed Issue resolved by pull request 1600 [https://github.com/apache/spark/pull/1600] Thrift/JDBC Server -- Key: SPARK-2410 URL: https://issues.apache.org/jira/browse/SPARK-2410 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 We have this, but need to make sure that it gets merged into master before the 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075738#comment-14075738 ] Sean Owen commented on SPARK-2708: -- (Nit: might mark it as Not A Problem or something, lest someone go looking for something that fixed this.) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075744#comment-14075744 ] Aaron Davidson commented on SPARK-2707: --- That doesn't sound like a bad idea -- actually sounds significantly more straightforward than depending on a version of Akka that only shades the internal protobuf usage. Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Tzolov updated SPARK-2614: Summary: Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml) (was: Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2665) Add EqualNS support for HiveQL
[ https://issues.apache.org/jira/browse/SPARK-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2665. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Cheng Hao Add EqualNS support for HiveQL -- Key: SPARK-2665 URL: https://issues.apache.org/jira/browse/SPARK-2665 Project: Spark Issue Type: New Feature Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.1.0 Hive Supports the operator =, which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075755#comment-14075755 ] Sean Owen commented on SPARK-2420: -- There aren't great answers to this one. I also ended up favoring downgrading as a path of least resistance. Here is the narrative behind my opinion: This did come up as an issue when Guava was upgraded to 14 :) It seems annoying that a dependency dictates a version of Guava, but c'est la vie for any dependency. It just happens that Guava is so common. Spark users are inevitably Hadoop users, so it's a dependency that exerts special influence. I think this is being improved upstream in Hadoop, by shading, but, that doesn't help existing versions in the field, which will be around for years. It is causing actual problems for users, and for future efforts that are probably important to Spark, such as Hive on Spark here. Downgrading looks feasible. See my PR: https://github.com/apache/spark/pull/1610 *It does need review!* Downgrading could break Spark apps that depend on it depending on Guava 12+. But this is really a problem with such an app though, as it should depend on Guava directly. But still, a point to consider. Can one justify a down-grade between a dependency between 1.x and 1.(x+1)? I think so if you view it as more a bug fix. But why can't Spark shade Guava? This is also reasonable to consider. If you're worried about breaking apps, that's a more breaking change though, and I understand not-breaking apps is high priority. Apps who rely on Guava transitively might continue to work just fine otherwise, but not if it disappears from Spark. Shading is always a bit risky, as it can't always adjust all use of reflection or other reliance on package names in the library. You can end up with two copies of singleton classes of course, if someone else brings their own Guava, which might or might not be OK. I don't have a specific problem in mind for Guava, though. A more significant reason is that I'm still not 100% sure shading in Spark fixes the collision, in stand-alone mode at least. Spark apps who bring Guava 14 may still collide with Hadoop's classpath, containing 11. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075754#comment-14075754 ] Apache Spark commented on SPARK-2420: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1610 Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075758#comment-14075758 ] Patrick Wendell edited comment on SPARK-2420 at 7/27/14 9:14 PM: - I put some thought into this as well. One big issue (and this was frankly a mistake in Spark's Java API design) is that we expose guava's Optional type in Spark's Java API. In general we should avoid relying on external types in any of our API's - that decision was made a long time ago when we were a much smaller project. The reason why downgrading is bad for user applications is that it's not something they can just work around by declaring a newer version of Guava in their build. The whole issue here is that Guava 11 and 14 are not binary compatible. I.e. if user code depends on Guava 14, and that gets pulled in, then Spark will break. So users will actually have to roll back their source code as well if it depends on newer Guava features. This is very disruptive from a user perspective and I think it's tantamount to an API change, since users will have to re-write code. It's in some ways worse than a Spark API change, because we can't easily write a downgrade guide of Guava from 14 to 11 (there will simply be missing features). I think the best solution here is to shade guava. And by shade I mean actually re-publish Guava under the org.spark-project namespace as we have done with a few other critical dependencies, and then depend on that in the spark build. This is much better than using something like the maven shade plug-in which is more of a hack. Then the issue is our Java API, because that currently exposes the Guava Optional class directly under its original namespace. I see two options. (1) Change Spark's API to return a Spark-specific optional class. (2) Inline the definition of Guava's Optional (under its original namespace) in Spark's source code - it's a very simple class and has been stable across several versions of Guava. The only risk with (2) is that if Guava makes an incompatible change to Optional, we are in trouble. If that happens, we could always fall back to (1) though in a future release. was (Author: pwendell): I put some thought into this as well. One big issue (and this was frankly a mistake in Spark's Java API design) is that we expose guava's Optional type in Spark's Java API. In general we should avoid relying on external types in any of our API's - that decision was made a long time ago when we were a much smaller project. The reason why downgrading is bad for user applications is that it's not something they can just work around by declaring a newer version of Guava in their build. The whole issue here is that Guava 11 and 14 are not binary compatible. I.e. if user code depends on Guava 14, and that gets pulled in, then Spark will break. So users will actually have to roll back their source code as well if it depends on newer Guava features. This is very disruptive from a user perspective and I think it's tantamount to an API change, since users will have to re-write code. It's in some ways worse than a Spark API change, because we can't easily write a downgrade guide of Guava from 14 to 11 (there will simply be missing features). I think the best solution here is to shade guava. And by shade I mean actually re-publish Guava under the org.spark-project namespace as we have done with a few other critical dependencies, and then depend on that in the spark build. This is much better than using something like the maven shade plug-in which is more of a hack. Then the issue is our Java API, because that currently exposes the Guava Optional class directly under it's original namespace. I see two options. (i) Change Spark's API to return a Spark-specific optional class. (ii) Inline the definition of Guava's Optional (under its original namespace) in Spark's source code - it's a very simple class and has been stable across several versions of Guava. The only risk with (ii) is that if Guava makes an incompatible change to Optional, we are in trouble. If that happens, we could always fall back to (i) though in a future release. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075758#comment-14075758 ] Patrick Wendell commented on SPARK-2420: I put some thought into this as well. One big issue (and this was frankly a mistake in Spark's Java API design) is that we expose guava's Optional type in Spark's Java API. In general we should avoid relying on external types in any of our API's - that decision was made a long time ago when we were a much smaller project. The reason why downgrading is bad for user applications is that it's not something they can just work around by declaring a newer version of Guava in their build. The whole issue here is that Guava 11 and 14 are not binary compatible. I.e. if user code depends on Guava 14, and that gets pulled in, then Spark will break. So users will actually have to roll back their source code as well if it depends on newer Guava features. This is very disruptive from a user perspective and I think it's tantamount to an API change, since users will have to re-write code. It's in some ways worse than a Spark API change, because we can't easily write a downgrade guide of Guava from 14 to 11 (there will simply be missing features). I think the best solution here is to shade guava. And by shade I mean actually re-publish Guava under the org.spark-project namespace as we have done with a few other critical dependencies, and then depend on that int he spark build. This is much better than using something like the maven shade plug-in which is more of a hack. Then the issue is our Java API, because that currently exposes the Guava Optional class directly under it's original namespace. I see two options. (i) Change Spark's API to return a Spark-specific optional class. (ii) Inline the definition of Guava's Optional (under its original namespace) in Spark's source code - it's a very simple class and has been stable across several versions of Guava. The only risk with (ii) is that if Guava makes an incompatible change to Optional, we are in trouble. If that happens, we could always fall back to (i) though in a future release. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2709) Add a tool for certifying Spark API compatiblity
Patrick Wendell created SPARK-2709: -- Summary: Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075762#comment-14075762 ] Sean Owen commented on SPARK-2420: -- Yep, there's an argument there. The downsides are that apps who relied on Guava coming in via Spark will not work. Though the fix is proper and easy. I thought that might have been a non-starter. Yeah, shading means people can bring their own Guava and it won't collide with Spark, but I think it still collides with Hadoop, and it matters in standalone mode (but not YARN mode I think? someone needs to check my understanding). I suppose I'd suggest that needs to be checked, or else it doesn't actually help Spark (+ Hadoop) users use Guava 14, and that's a lot of the users. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075764#comment-14075764 ] Patrick Wendell commented on SPARK-2420: Yeah, I think users having to add guava 14 to their build is (compared with alternatives) not too bad, provided they don't have to make any code changes. [~sowen] If we shade, then I don't see how in any mode we could conflict with any hadoop code. It would just be like any other dependency that Spark has but Hadoop doesn't have (?) Could you elaborate a bit more on the conflict you are anticipating? Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075765#comment-14075765 ] Sean Owen commented on SPARK-2420: -- Spark doesn't conflict then, but the user code may conflict with Hadoop. That's the scenario. Maybe spark.files.userClassPathFirst takes care of this in general, and in YARN, you are more isolated from the Hadoop stuff. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075768#comment-14075768 ] Apache Spark commented on SPARK-2614: - User 'tzolov' has created a pull request for this issue: https://github.com/apache/spark/pull/1611 Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml) -- Key: SPARK-2614 URL: https://issues.apache.org/jira/browse/SPARK-2614 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Christian Tzolov The tar.gz distribution includes already the spark-examples.jar in the bundle. It is a common practice for installers to run SparkPi as a smoke test to verify that the installation is OK /usr/share/spark/bin/spark-submit \ --num-executors 10 --master yarn-cluster \ --class org.apache.spark.examples.SparkPi \ /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.
[ https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075770#comment-14075770 ] Michael Yannakopoulos commented on SPARK-2708: -- This issue is resolved. Initially there was no problem so this issue has been closed without any patches provided. The solution is just to perform a clean build on the apache spark project. [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining. - Key: SPARK-2708 URL: https://issues.apache.org/jira/browse/SPARK-2708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Michael Yannakopoulos Assignee: Michael Yannakopoulos Labels: patch Build procedure fails due to numerous errors appearing in files located in Apache Spark Core project's 'org.apache.spark.ui' directory where case class 'TaskUIData' appears to be undefined. However the problem seems more complicated since the class is imported correctly to the aforementioned files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1777) Pass cached blocks directly to disk if memory is not large enough
[ https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1777: - Priority: Critical (was: Major) Pass cached blocks directly to disk if memory is not large enough --- Key: SPARK-1777 URL: https://issues.apache.org/jira/browse/SPARK-1777 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical Fix For: 1.1.0 Attachments: spark-1777-design-doc.pdf Currently in Spark we entirely unroll a partition and then check whether it will cause us to exceed the storage limit. This has an obvious problem - if the partition itself is enough to push us over the storage limit (and eventually over the JVM heap), it will cause an OOM. This can happen in cases where a single partition is very large or when someone is running examples locally with a small heap. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106 We should think a bit about the most elegant way to fix this - it shares some similarities with the external aggregation code. A simple idea is to periodically check the size of the buffer as we are unrolling and see if we are over the memory limit. If we are we could prepend the existing buffer to the iterator and write that entire thing out to disk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075789#comment-14075789 ] Apache Spark commented on SPARK-2710: - User 'chutium' has created a pull request for this issue: https://github.com/apache/spark/pull/1612 Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class) -- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2514) Random RDD generator
[ https://issues.apache.org/jira/browse/SPARK-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2514. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1520 [https://github.com/apache/spark/pull/1520] Random RDD generator Key: SPARK-2514 URL: https://issues.apache.org/jira/browse/SPARK-2514 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075796#comment-14075796 ] Anand Avati commented on SPARK-2707: The changes to just get compiled with 2.3.x can be found here - https://github.com/avati/spark/commit/000441bfec9315d1132cd9b785791a6fcbf9d4d4. However that does not work, and new SparkContext keeps throwing: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:180) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615) I am still investigating what other changes are needed in spark for akka 2.3.x to work Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075799#comment-14075799 ] Teng Qiu commented on SPARK-2710: - a problem is, there is nothing to push down... have no idea how filters can be pushed from logical plan to JdbcRDD... maybe only change the query string and rebuild conn.prepareStatement... Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class) -- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Qiu updated SPARK-2710: Description: Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. oh, and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) but such a close method will never be able to close conn... was: Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class) -- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. oh, and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) but such a close method will never be able to close conn... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods
[ https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075816#comment-14075816 ] Xiangrui Meng commented on SPARK-2550: -- After you merge new changes from the master, please run `sbt/sbt clean` to clean the cache in order to build correctly. Support regularization and intercept in pyspark's linear methods Key: SPARK-2550 URL: https://issues.apache.org/jira/browse/SPARK-2550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Michael Yannakopoulos Python API doesn't provide options to set regularization parameter and intercept in linear methods, which should be fixed in v1.1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task
Matei Zaharia created SPARK-2711: Summary: Create a ShuffleMemoryManager that allocates across spilling collections in the same task Key: SPARK-2711 URL: https://issues.apache.org/jira/browse/SPARK-2711 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task
[ https://issues.apache.org/jira/browse/SPARK-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2711: - Description: Right now if there are two ExternalAppendOnlyMaps, they don't compete correctly for memory. This can happen e.g. in a task that is both reducing data from its parent RDD and writing it out to files for a future shuffle, for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) (another key). Create a ShuffleMemoryManager that allocates across spilling collections in the same task - Key: SPARK-2711 URL: https://issues.apache.org/jira/browse/SPARK-2711 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia Right now if there are two ExternalAppendOnlyMaps, they don't compete correctly for memory. This can happen e.g. in a task that is both reducing data from its parent RDD and writing it out to files for a future shuffle, for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) (another key). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2659) HiveQL: Division operator should always perform fractional division
[ https://issues.apache.org/jira/browse/SPARK-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2659. - Resolution: Fixed Fix Version/s: 1.1.0 HiveQL: Division operator should always perform fractional division --- Key: SPARK-2659 URL: https://issues.apache.org/jira/browse/SPARK-2659 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2410) Thrift/JDBC Server
[ https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075833#comment-14075833 ] Patrick Wendell commented on SPARK-2410: {code} [info] - test query execution against a Hive Thrift server *** FAILED *** [info] java.sql.SQLException: Could not open connection to jdbc:hive2://localhost:59556/: java.net.ConnectException: Connection refused [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107) [info] ... [info] Cause: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134) [info] ... [info] Cause: java.net.ConnectException: Connection refused [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [info] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [info] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [info] at java.net.Socket.connect(Socket.java:579) [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144) [info] ... [info] CliSuite: Executing: create table hive_test1(key int, val string);, expecting output: OK [info] - simple commands *** FAILED *** [info] java.lang.AssertionError: assertion failed: Didn't find OK in the output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:25) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:25) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:53) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) {code} Thrift/JDBC Server -- Key: SPARK-2410 URL: https://issues.apache.org/jira/browse/SPARK-2410 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 We have this, but need to make sure that it gets merged into master before the 1.1 release.
[jira] [Reopened] (SPARK-2410) Thrift/JDBC Server
[ https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-2410: Re opening this again due to test issues. Thrift/JDBC Server -- Key: SPARK-2410 URL: https://issues.apache.org/jira/browse/SPARK-2410 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.1.0 We have this, but need to make sure that it gets merged into master before the 1.1 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2651) Add maven scalastyle plugin
[ https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2651: --- Assignee: Rahul Singhal Add maven scalastyle plugin --- Key: SPARK-2651 URL: https://issues.apache.org/jira/browse/SPARK-2651 Project: Spark Issue Type: Improvement Components: Build Reporter: Rahul Singhal Assignee: Rahul Singhal Priority: Minor Fix For: 1.1.0 SBT has a scalastyle plugin which can be executed to check for coding conventions. It would be nice to add the same for maven builds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2712) Add a small note that mvn package must happen before test
Stephen Boesch created SPARK-2712: - Summary: Add a small note that mvn package must happen before test Key: SPARK-2712 URL: https://issues.apache.org/jira/browse/SPARK-2712 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.0, 0.9.1, 1.1.1 Environment: all Reporter: Stephen Boesch Priority: Trivial Fix For: 1.1.0 Add to the building-with-maven.md: Requirement: build packages before running tests Tests must be run AFTER the package target has already been executed. The following is an example of a correct (build, test) sequence: mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package mvn -Pyarn -Phadoop-2.3 -Phive test BTW Reynold Xin requested this tiny doc improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2713) Executors of same application in same host should only download files jars once
Zhihui created SPARK-2713: - Summary: Executors of same application in same host should only download files jars once Key: SPARK-2713 URL: https://issues.apache.org/jira/browse/SPARK-2713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui If spark lunched multiple executors in one host for one application, every executor will download it dependent files and jars (if not using local: url) independently. It maybe result to huge latency. In my case, it result to 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2713) Executors of same application in same host should only download files jars once
[ https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-2713: -- Description: If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. was: If spark lunched multiple executors in one host for one application, every executor will download it dependent files and jars (if not using local: url) independently. It maybe result to huge latency. In my case, it result to 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. Executors of same application in same host should only download files jars once - Key: SPARK-2713 URL: https://issues.apache.org/jira/browse/SPARK-2713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2713) Executors of same application in same host should only download files jars once
[ https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075914#comment-14075914 ] Zhihui commented on SPARK-2713: --- PR https://github.com/apache/spark/pull/1616 Executors of same application in same host should only download files jars once - Key: SPARK-2713 URL: https://issues.apache.org/jira/browse/SPARK-2713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2713) Executors of same application in same host should only download files jars once
[ https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075916#comment-14075916 ] Apache Spark commented on SPARK-2713: - User 'li-zhihui' has created a pull request for this issue: https://github.com/apache/spark/pull/1616 Executors of same application in same host should only download files jars once - Key: SPARK-2713 URL: https://issues.apache.org/jira/browse/SPARK-2713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui If Spark lunched multiple executors in one host for one application, every executor would download it dependent files and jars (if not using local: url) independently. It maybe result in huge latency. In my case, it result in 20 seconds latency to download dependent jars(about 17M) when I lunch 32 executors in one host(total 4 hosts). This patch will cache downloaded files and jars for executors to reduce network throughput and download latency. I my case, the latency was reduced from 20 seconds to less than 1 second. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2484) Build should not run hive compatibility tests by default.
[ https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2484: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-2487 Build should not run hive compatibility tests by default. - Key: SPARK-2484 URL: https://issues.apache.org/jira/browse/SPARK-2484 Project: Spark Issue Type: Sub-task Reporter: Guoqiang Li Assignee: Guoqiang Li hive compatibility test takes a long time, in some cases, we don't need to run it. -- This message was sent by Atlassian JIRA (v6.2#6252)