[jira] [Created] (SPARK-2705) Wrong stage description in Web UI

2014-07-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-2705:
-

 Summary: Wrong stage description in Web UI 
 Key: SPARK-2705
 URL: https://issues.apache.org/jira/browse/SPARK-2705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor


Type of stage description object in the stage table of Web UI should be a 
{{String}}, but an {{Option\[String\]}} is used. See 
[here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2705) Wrong stage description in Web UI

2014-07-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075609#comment-14075609
 ] 

Cheng Lian commented on SPARK-2705:
---

PR: https://github.com/apache/spark/pull/1524

 Wrong stage description in Web UI 
 --

 Key: SPARK-2705
 URL: https://issues.apache.org/jira/browse/SPARK-2705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 Type of stage description object in the stage table of Web UI should be a 
 {{String}}, but an {{Option\[String\]}} is used. See 
 [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2706) Enable Spark to support Hive 0.13

2014-07-27 Thread Chunjun Xiao (JIRA)
Chunjun Xiao created SPARK-2706:
---

 Summary: Enable Spark to support Hive 0.13
 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao


It seems Spark cannot work with Hive 0.13 well.
When I compiled Spark with Hive 0.13.1, I got some error messages, as attached 
below.
So, when can Spark be enabled to support Hive 0.13?

Compiling Error:
{quote}
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
 type mismatch;
 found   : String
 required: Array[String]
[ERROR]   val proc: CommandProcessor = 
CommandProcessorFactory.get(tokens(0), hiveconf)
[ERROR]  ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
 overloaded method constructor TableDesc with alternatives:
  (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
and
  ()org.apache.hadoop.hive.ql.plan.TableDesc
 cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
value tableDesc)(in value tableDesc)], java.util.Properties)
[ERROR]   val tableDesc = new TableDesc(
[ERROR]   ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
 value getPartitionPath is not a member of 
org.apache.hadoop.hive.ql.metadata.Partition
[ERROR]   val partPath = partition.getPartitionPath
[ERROR]^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
 value appendReadColumnNames is not a member of object 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils
[ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
attributes.map(_.name))
[ERROR]   ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
 org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
[ERROR]   new HiveDecimal(bd.underlying())
[ERROR]   ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
 type mismatch;
 found   : org.apache.hadoop.fs.Path
 required: String
[ERROR]   
SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
[ERROR]   ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
 value getExternalTmpFileURI is not a member of 
org.apache.hadoop.hive.ql.Context
[ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
[ERROR]   ^
[ERROR] 
/ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: 
org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
[ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
[ERROR]  ^
[ERROR] 8 errors found
[DEBUG] Compilation failed (CompilerInterface)
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [2.579s]
[INFO] Spark Project Core  SUCCESS [2:39.805s]
[INFO] Spark Project Bagel ... SUCCESS [21.148s]
[INFO] Spark Project GraphX .. SUCCESS [59.950s]
[INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
[INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
[INFO] Spark Project Tools ... SUCCESS [15.405s]
[INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
[INFO] Spark Project SQL . SUCCESS [1:11.094s]
[INFO] Spark Project Hive  FAILURE [11.121s]
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Parent POM . SKIPPED
[INFO] Spark Project YARN Stable API . SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project Examples  SKIPPED

[jira] [Resolved] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib

2014-07-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2679.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1581
[https://github.com/apache/spark/pull/1581]

 Ser/De for Double to enable calling Java API from python in MLlib
 -

 Key: SPARK-2679
 URL: https://issues.apache.org/jira/browse/SPARK-2679
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin
 Fix For: 1.1.0


 In order to enable Java/Scala APIs to be reused in the Python implementation 
 of RandomRDD and Correlations, we need a set of ser/de for the type Double in 
 _common.py and PythonMLLibAPI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib

2014-07-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2679:
-

Assignee: Doris Xin

 Ser/De for Double to enable calling Java API from python in MLlib
 -

 Key: SPARK-2679
 URL: https://issues.apache.org/jira/browse/SPARK-2679
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin
Assignee: Doris Xin
 Fix For: 1.1.0


 In order to enable Java/Scala APIs to be reused in the Python implementation 
 of RandomRDD and Correlations, we need a set of ser/de for the type Double in 
 _common.py and PythonMLLibAPI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2681) Spark can hang when fetching shuffle blocks

2014-07-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2681:
---

Affects Version/s: (was: 1.0.0)
   1.0.1

 Spark can hang when fetching shuffle blocks
 ---

 Key: SPARK-2681
 URL: https://issues.apache.org/jira/browse/SPARK-2681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Blocker

 executor log :
 {noformat}
 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 53628
 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 
 and clearing cache
 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, 
 computing it
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 9, fetching them
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 8 ms
 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called 
 with curMem=920109320, maxMem=4322230272
 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values 
 to memory (estimated size 28.1 KB, free 3.2 GB)
 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block 
 rdd_51_83
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, 
 computing it
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 28, fetching them
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty 
 blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote 
 fetches in 0 ms
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, 
 computing it
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 4 ms
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing 
 ReceivingConnection to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? 
 sun.nio.ch.SelectionKeyImpl@3dcc1da1
 14/07/24 23:05:07 INFO 

[jira] [Updated] (SPARK-2681) Spark can hang when fetching shuffle blocks

2014-07-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2681:
---

Attachment: jstack-26027.log

[~pwendell] Jstack output has been uploaded.

 Spark can hang when fetching shuffle blocks
 ---

 Key: SPARK-2681
 URL: https://issues.apache.org/jira/browse/SPARK-2681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Blocker
 Attachments: jstack-26027.log


 executor log :
 {noformat}
 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 53628
 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 
 and clearing cache
 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, 
 computing it
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 9, fetching them
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 8 ms
 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called 
 with curMem=920109320, maxMem=4322230272
 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values 
 to memory (estimated size 28.1 KB, free 3.2 GB)
 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block 
 rdd_51_83
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, 
 computing it
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 28, fetching them
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty 
 blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote 
 fetches in 0 ms
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, 
 computing it
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 4 ms
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing 
 ReceivingConnection to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? 
 

[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle

2014-07-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075639#comment-14075639
 ] 

Apache Spark commented on SPARK-2532:
-

User 'mridulm' has created a pull request for this issue:
https://github.com/apache/spark/pull/1609

 Fix issues with consolidated shuffle
 

 Key: SPARK-2532
 URL: https://issues.apache.org/jira/browse/SPARK-2532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical
 Fix For: 1.1.0


 Will file PR with changes as soon as merge is done (earlier merge became 
 outdated in 2 weeks unfortunately :) ).
 Consolidated shuffle is broken in multiple ways in spark :
 a) Task failure(s) can cause the state to become inconsistent.
 b) Multiple revert's or combination of close/revert/close can cause the state 
 to be inconsistent.
 (As part of exception/error handling).
 c) Some of the api in block writer causes implementation issues - for 
 example: a revert is always followed by close : but the implemention tries to 
 keep them separate, resulting in surface for errors.
 d) Fetching data from consolidated shuffle files can go badly wrong if the 
 file is being actively written to : it computes length by subtracting next 
 offset from current offset (or length if this is last offset)- the latter 
 fails when fetch is happening in parallel to write.
 Note, this happens even if there are no task failures of any kind !
 This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Yardena (JIRA)
Yardena created SPARK-2707:
--

 Summary: Upgrade to Akka 2.3
 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena


Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Yardena (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075648#comment-14075648
 ] 

Yardena commented on SPARK-2707:


Some minor source changes may be required, see 
http://doc.akka.io/docs/akka/snapshot/project/migration-guide-2.2.x-2.3.x.html 

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075652#comment-14075652
 ] 

Aaron Davidson commented on SPARK-2707:
---

It does sound mostly mechanical and I believe we don't use most of those 
features. Perhaps just getting it to compile (while re-shading protobuf) would 
be sufficient to make it work.

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods

2014-07-27 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075654#comment-14075654
 ] 

Michael Yannakopoulos commented on SPARK-2550:
--

Hi Xiangrui,

Is it only my problem or a general one the fact that building the whole project 
using the command 'sbt/sbt assembly' fails? From what I see the errors come from
the patches related to the UI and WebUI functionality.

Thanks,
Michael

 Support regularization and intercept in pyspark's linear methods
 

 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Michael Yannakopoulos

 Python API doesn't provide options to set regularization parameter and 
 intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2677:
---

Affects Version/s: 1.0.1

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2677:
---

Affects Version/s: 0.9.2

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075672#comment-14075672
 ] 

Patrick Wendell commented on SPARK-2677:


Just as an FYI - this has been observed also in several earlier versions of 
Spark. I think one issue is that we don't have timeouts in the conneciton 
manger code. If a JVM goes into GC thrashing and becomes un-responsive (but 
still alive), then you can get stuck here.

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2677:
---

Target Version/s: 1.1.0, 1.0.3
   Fix Version/s: (was: 1.0.3)
  (was: 1.1.0)

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075673#comment-14075673
 ] 

Guoqiang Li commented on SPARK-2677:


If {{yarn.scheduler.fair.preemption}} is set to true in yarn, This issue will 
appear frequently.

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-27 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075673#comment-14075673
 ] 

Guoqiang Li edited comment on SPARK-2677 at 7/27/14 6:17 PM:
-

If {{yarn.scheduler.fair.preemption}} is set to {{true}} in yarn, This issue 
will appear frequently.


was (Author: gq):
If {{yarn.scheduler.fair.preemption}} is set to true in yarn, This issue will 
appear frequently.

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods

2014-07-27 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075675#comment-14075675
 ] 

Michael Yannakopoulos commented on SPARK-2550:
--

The errors are related to the fact that object TaskUIData class has moved from
'org.apache.spark.ui.jobs.TaskUIData' to 
'org.apache.spark.ui.jobs.UIData.TaskUIData'.

Should I open a new Jira Task and resolve it?

 Support regularization and intercept in pyspark's linear methods
 

 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Michael Yannakopoulos

 Python API doesn't provide options to set regularization parameter and 
 intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Michael Yannakopoulos (JIRA)
Michael Yannakopoulos created SPARK-2708:


 Summary: [APACHE-SPARK] [CORE] Build Fails: Case class 
'TaskUIData' makes sbt complaining.
 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos


Build procedure fails due to numerous errors appearing in files located in 
Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
'TaskUIData' appears to be undefined. However the problem seems more 
complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075720#comment-14075720
 ] 

Sean Owen commented on SPARK-2688:
--

If you persist/cache rdd2, it is not recomputed. You can already execute 
operations in parallel within a SparkContext. Just execute them in parallel. 

 Need a way to run multiple data pipeline concurrently
 -

 Key: SPARK-2688
 URL: https://issues.apache.org/jira/browse/SPARK-2688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Xuefu Zhang

 Suppose we want to do the following data processing: 
 {code}
 rdd1 - rdd2 - rdd3
| - rdd4
| - rdd5
\ - rdd6
 {code}
 where - represents a transformation. rdd3 to rrdd6 are all derived from an 
 intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
 execution. However, rdd.foreach(fn) only trigger pipeline rdd1 - rdd2 - 
 rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
 recomputed. This is very inefficient. Ideally, we should be able to trigger 
 the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
 way doing so. Tez already realized the importance of this (TEZ-391), so I 
 think Spark should provide this too.
 This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075721#comment-14075721
 ] 

Patrick Wendell commented on SPARK-2707:


What are the features we want to use here in the newer akka version? I sort of 
wonder whether we should just shade all of akka so that we don't expose it as 
an external API in Spark, and users can independently use whatever Akka version 
they want. Otherwise we won't ever be able to swap out our internal 
communication layer.

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2705) Wrong stage description in Web UI

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075722#comment-14075722
 ] 

Patrick Wendell commented on SPARK-2705:


Fixed via: https://github.com/apache/spark/pull/1524

 Wrong stage description in Web UI 
 --

 Key: SPARK-2705
 URL: https://issues.apache.org/jira/browse/SPARK-2705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor
 Fix For: 1.1.0


 Type of stage description object in the stage table of Web UI should be a 
 {{String}}, but an {{Option\[String\]}} is used. See 
 [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2705) Wrong stage description in Web UI

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2705:
---

Assignee: Cheng Lian

 Wrong stage description in Web UI 
 --

 Key: SPARK-2705
 URL: https://issues.apache.org/jira/browse/SPARK-2705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.1.0


 Type of stage description object in the stage table of Web UI should be a 
 {{String}}, but an {{Option\[String\]}} is used. See 
 [here|https://github.com/apache/spark/blob/aaf2b735fddbebccd28012006ee4647af3b3624f/core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala#L125].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2082:
---

Component/s: MLlib

 Stratified sampling implementation in PairRDDFunctions
 --

 Key: SPARK-2082
 URL: https://issues.apache.org/jira/browse/SPARK-2082
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Doris Xin
Assignee: Doris Xin

 Implementation of stratified sampling that guarantees exact sample size = 
 sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1498) Spark can hang if pyspark tasks fail

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1498:
---

Component/s: PySpark

 Spark can hang if pyspark tasks fail
 

 Key: SPARK-1498
 URL: https://issues.apache.org/jira/browse/SPARK-1498
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1, 0.9.2
Reporter: Kay Ousterhout
 Fix For: 1.0.0


 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning 
 an error.  This is partially a scheduler problem -- the scheduler sometimes 
 thinks failed tasks succeed, even though they have a stack trace and 
 exception.
 You can reproduce this problem with:
 ardd = sc.parallelize([(1,2,3), (4,5,6)])
 brdd = sc.parallelize([(1,2,6), (4,5,9)])
 ardd.join(brdd).count()
 The last line will run forever (the problem in this code is that the RDD 
 entries have 3 values instead of the expected 2).  I haven't verified if this 
 is a problem for 1.0 as well as 0.9.
 Thanks to Shivaram for helping diagnose this issue!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2703:
---

Component/s: Spark Core

 Make Tachyon related unit tests execute without deploying a Tachyon system 
 locally.
 ---

 Key: SPARK-2703
 URL: https://issues.apache.org/jira/browse/SPARK-2703
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Haoyuan Li
 Fix For: 1.1.0


 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2702) Upgrade Tachyon dependency to 0.5.0

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2702:
---

Component/s: Spark Core

 Upgrade Tachyon dependency to 0.5.0
 ---

 Key: SPARK-2702
 URL: https://issues.apache.org/jira/browse/SPARK-2702
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Haoyuan Li
 Fix For: 1.1.0


 Upgrade Tachyon dependency to 0.5.0:
 a. Code dependency.
 b. Start Tachyon script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2673) Improve Spark so that we can attach Debugger to Executors easily

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2673:
---

Component/s: Spark Core

 Improve Spark so that we can attach Debugger to Executors easily
 

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kousuke Saruta

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) It's difficult for Executors running on the same machine to open debug 
 port because we can only pass same JVM options to all executors.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075728#comment-14075728
 ] 

Sean Owen commented on SPARK-2708:
--

I don't see any failures when I run the tests from master just now. Jenkins 
seems to be succeeding too, or at least, the failed builds don't seem to be due 
to a TaskUIData class: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/  

The class TaskUIData is present in org.apache.spark.ui.jobs.UIData. It was 
added pretty recently: 
https://github.com/apache/spark/commits/72e9021eaf26f31a82120505f8b764b18fbe8d48/core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala

Maybe you need to do a clean build?

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075735#comment-14075735
 ] 

Michael Yannakopoulos commented on SPARK-2708:
--

I am doing it right now! I am going to report as soon as possible.

Thanks!

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Michael Yannakopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Yannakopoulos closed SPARK-2708.


Resolution: Fixed

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075736#comment-14075736
 ] 

Michael Yannakopoulos commented on SPARK-2708:
--

Yes, you are right! Thanks for the help. I am closing this issue as resolved.
Thanks again guys!

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2410) Thrift/JDBC Server

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2410.


Resolution: Fixed

Issue resolved by pull request 1600
[https://github.com/apache/spark/pull/1600]

 Thrift/JDBC Server
 --

 Key: SPARK-2410
 URL: https://issues.apache.org/jira/browse/SPARK-2410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 We have this, but need to make sure that it gets merged into master before 
 the 1.1 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075738#comment-14075738
 ] 

Sean Owen commented on SPARK-2708:
--

(Nit: might mark it as Not A Problem or something, lest someone go looking for 
something that fixed this.)

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075744#comment-14075744
 ] 

Aaron Davidson commented on SPARK-2707:
---

That doesn't sound like a bad idea -- actually sounds significantly more 
straightforward than depending on a version of Akka that only shades the 
internal protobuf usage.

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)

2014-07-27 Thread Christian Tzolov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Tzolov updated SPARK-2614:


Summary: Add the spark-examples-xxx-.jar to the Debian packages created 
with mvn ... -Pdeb (using assembly/pom.xml)  (was: Add the 
spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. 
-Pdeb))

 Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... 
 -Pdeb (using assembly/pom.xml)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2665) Add EqualNS support for HiveQL

2014-07-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2665.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Cheng Hao

 Add EqualNS support for HiveQL
 --

 Key: SPARK-2665
 URL: https://issues.apache.org/jira/browse/SPARK-2665
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.1.0


 Hive Supports the operator =, which returns same result with EQUAL(=) 
 operator for non-null operands, but returns TRUE if both are NULL, FALSE if 
 one of the them is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075755#comment-14075755
 ] 

Sean Owen commented on SPARK-2420:
--

There aren't great answers to this one. I also ended up favoring downgrading as 
a path of least resistance. Here is the narrative behind my opinion:


This did come up as an issue when Guava was upgraded to 14 :)

It seems annoying that a dependency dictates a version of Guava, but c'est la 
vie for any dependency. It just happens that Guava is so common. 

Spark users are inevitably Hadoop users, so it's a dependency that exerts 
special influence.

I think this is being improved upstream in Hadoop, by shading, but, that 
doesn't help existing versions in the field, which will be around for years.

It is causing actual problems for users, and for future efforts that are 
probably important to Spark, such as Hive on Spark here.

Downgrading looks feasible. See my PR: 
https://github.com/apache/spark/pull/1610 *It does need review!*

Downgrading could break Spark apps that depend on it depending on Guava 12+. 
But this is really a problem with such an app though, as it should depend on 
Guava directly. But still, a point to consider.

Can one justify a down-grade between a dependency between 1.x and 1.(x+1)? I 
think so if you view it as more a bug fix.

But why can't Spark shade Guava? This is also reasonable to consider. 

If you're worried about breaking apps, that's a more breaking change though, 
and I understand not-breaking apps is high priority. Apps who rely on Guava 
transitively might continue to work just fine otherwise, but not if it 
disappears from Spark.

Shading is always a bit risky, as it can't always adjust all use of reflection 
or other reliance on package names in the library. You can end up with two 
copies of singleton classes of course, if someone else brings their own Guava, 
which might or might not be OK. I don't have a specific problem in mind for 
Guava, though.

A more significant reason is that I'm still not 100% sure shading in Spark 
fixes the collision, in stand-alone mode at least. Spark apps who bring Guava 
14 may still collide with Hadoop's classpath, containing 11.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075754#comment-14075754
 ] 

Apache Spark commented on SPARK-2420:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1610

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075758#comment-14075758
 ] 

Patrick Wendell edited comment on SPARK-2420 at 7/27/14 9:14 PM:
-

I put some thought into this as well. One big issue (and this was frankly a 
mistake in Spark's Java API design) is that we expose guava's Optional type in 
Spark's Java API. In general we should avoid relying on external types in any 
of our API's - that decision was made a long time ago when we were a much 
smaller project.

The reason why downgrading is bad for user applications is that it's not 
something they can just work around by declaring a newer version of Guava in 
their build. The whole issue here is that Guava 11 and 14 are not binary 
compatible. I.e. if user code depends on Guava 14, and that gets pulled in, 
then Spark will break. So users will actually have to roll back their source 
code as well if it depends on newer Guava features. This is very disruptive 
from a user perspective and I think it's tantamount to an API change, since 
users will have to re-write code. It's in some ways worse than a Spark API 
change, because we can't easily write a downgrade guide of Guava from 14 to 
11 (there will simply be missing features).

I think the best solution here is to shade guava. And by shade I mean actually 
re-publish Guava under the org.spark-project namespace as we have done with a 
few other critical dependencies, and then depend on that in the spark build. 
This is much better than using something like the maven shade plug-in which is 
more of a hack.

Then the issue is our Java API, because that currently exposes the Guava 
Optional class directly under its original namespace. I see two options. (1) 
Change Spark's API to return a Spark-specific optional class. (2) Inline the 
definition of Guava's Optional (under its original namespace) in Spark's source 
code - it's a very simple class and has been stable across several versions of 
Guava.

The only risk with (2) is that if Guava makes an incompatible change to 
Optional, we are in trouble. If that happens, we could always fall back to (1) 
though in a future release.







was (Author: pwendell):
I put some thought into this as well. One big issue (and this was frankly a 
mistake in Spark's Java API design) is that we expose guava's Optional type in 
Spark's Java API. In general we should avoid relying on external types in any 
of our API's - that decision was made a long time ago when we were a much 
smaller project.

The reason why downgrading is bad for user applications is that it's not 
something they can just work around by declaring a newer version of Guava in 
their build. The whole issue here is that Guava 11 and 14 are not binary 
compatible. I.e. if user code depends on Guava 14, and that gets pulled in, 
then Spark will break. So users will actually have to roll back their source 
code as well if it depends on newer Guava features. This is very disruptive 
from a user perspective and I think it's tantamount to an API change, since 
users will have to re-write code. It's in some ways worse than a Spark API 
change, because we can't easily write a downgrade guide of Guava from 14 to 
11 (there will simply be missing features).

I think the best solution here is to shade guava. And by shade I mean actually 
re-publish Guava under the org.spark-project namespace as we have done with a 
few other critical dependencies, and then depend on that in the spark build. 
This is much better than using something like the maven shade plug-in which is 
more of a hack.

Then the issue is our Java API, because that currently exposes the Guava 
Optional class directly under it's original namespace. I see two options. (i) 
Change Spark's API to return a Spark-specific optional class. (ii) Inline the 
definition of Guava's Optional (under its original namespace) in Spark's source 
code - it's a very simple class and has been stable across several versions of 
Guava.

The only risk with (ii) is that if Guava makes an incompatible change to 
Optional, we are in trouble. If that happens, we could always fall back to (i) 
though in a future release.






 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075758#comment-14075758
 ] 

Patrick Wendell commented on SPARK-2420:


I put some thought into this as well. One big issue (and this was frankly a 
mistake in Spark's Java API design) is that we expose guava's Optional type in 
Spark's Java API. In general we should avoid relying on external types in any 
of our API's - that decision was made a long time ago when we were a much 
smaller project.

The reason why downgrading is bad for user applications is that it's not 
something they can just work around by declaring a newer version of Guava in 
their build. The whole issue here is that Guava 11 and 14 are not binary 
compatible. I.e. if user code depends on Guava 14, and that gets pulled in, 
then Spark will break. So users will actually have to roll back their source 
code as well if it depends on newer Guava features. This is very disruptive 
from a user perspective and I think it's tantamount to an API change, since 
users will have to re-write code. It's in some ways worse than a Spark API 
change, because we can't easily write a downgrade guide of Guava from 14 to 
11 (there will simply be missing features).

I think the best solution here is to shade guava. And by shade I mean actually 
re-publish Guava under the org.spark-project namespace as we have done with a 
few other critical dependencies, and then depend on that int he spark build. 
This is much better than using something like the maven shade plug-in which is 
more of a hack.

Then the issue is our Java API, because that currently exposes the Guava 
Optional class directly under it's original namespace. I see two options. (i) 
Change Spark's API to return a Spark-specific optional class. (ii) Inline the 
definition of Guava's Optional (under its original namespace) in Spark's source 
code - it's a very simple class and has been stable across several versions of 
Guava.

The only risk with (ii) is that if Guava makes an incompatible change to 
Optional, we are in trouble. If that happens, we could always fall back to (i) 
though in a future release.






 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2014-07-27 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-2709:
--

 Summary: Add a tool for certifying Spark API compatiblity
 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma


As Spark is packaged by more and more distributors, it would be good to have a 
tool that verifies API compatiblity of a provided Spark package. The tool would 
certify that a vendor distrubtion of Spark contains all of the API's present in 
a particular upstream Spark version.

This will help vendors make sure they remain API compliant when they make 
changes or back ports to Spark. It will also discourage vendors from knowingly 
breaking API's, because anyone can audit their distribution and see that they 
have removed support for certain API's.

I'm hoping a tool like this will avoid API fragmentation in the Spark community.

One poor man's implementation of this is that a vendor can just run the 
binary compatibility checks in the spark build against an upstream version of 
Spark. That's a pretty good start, but it means you can't come as a third party 
and audit a distribution.

Another approach would be to have something where anyone can come in and audit 
a distribution even if they don't have access to the packaging and source code. 
That would look something like this:

1. For each release we publish a manifest of all public API's (we might borrow 
the MIMA string representation of bye code signatures)
2. We package an auditing tool as a jar file.
3. The user runs a tool with spark-submit that reflectively walks through all 
exposed Spark API's and makes sure that everything on the manifest is 
encountered.

From the implementation side, this is just brainstorming at this point.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075762#comment-14075762
 ] 

Sean Owen commented on SPARK-2420:
--

Yep, there's an argument there. The downsides are that apps who relied on Guava 
coming in via Spark will not work. Though the fix is proper and easy. I thought 
that might have been a non-starter. Yeah, shading means people can bring their 
own Guava and it won't collide with Spark, but I think it still collides with 
Hadoop, and it matters in standalone mode (but not YARN mode I think? someone 
needs to check my understanding). I suppose I'd suggest that needs to be 
checked, or else it doesn't actually help Spark (+ Hadoop) users use Guava 14, 
and that's a lot of the users.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075764#comment-14075764
 ] 

Patrick Wendell commented on SPARK-2420:


Yeah, I think users having to add guava 14 to their build is (compared with 
alternatives) not too bad, provided they don't have to make any code changes.

[~sowen] If we shade, then I don't see how in any mode we could conflict with 
any hadoop code. It would just be like any other dependency that Spark has but 
Hadoop doesn't have (?) Could you elaborate a bit more on the conflict you are 
anticipating?

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075765#comment-14075765
 ] 

Sean Owen commented on SPARK-2420:
--

Spark doesn't conflict then, but the user code may conflict with Hadoop. That's 
the scenario. Maybe spark.files.userClassPathFirst takes care of this in 
general, and in YARN, you are more isolated from the Hadoop stuff.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)

2014-07-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075768#comment-14075768
 ] 

Apache Spark commented on SPARK-2614:
-

User 'tzolov' has created a pull request for this issue:
https://github.com/apache/spark/pull/1611

 Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... 
 -Pdeb (using assembly/pom.xml)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2708) [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt complaining.

2014-07-27 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075770#comment-14075770
 ] 

Michael Yannakopoulos commented on SPARK-2708:
--

This issue is resolved. Initially there was no problem so this issue has been 
closed without any patches provided.
The solution is just to perform a clean build on the apache spark project.

 [APACHE-SPARK] [CORE] Build Fails: Case class 'TaskUIData' makes sbt 
 complaining.
 -

 Key: SPARK-2708
 URL: https://issues.apache.org/jira/browse/SPARK-2708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Michael Yannakopoulos
Assignee: Michael Yannakopoulos
  Labels: patch

 Build procedure fails due to numerous errors appearing in files located in 
 Apache Spark Core project's 'org.apache.spark.ui' directory where case class 
 'TaskUIData' appears to be undefined. However the problem seems more 
 complicated since the class is imported correctly to the aforementioned files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1777) Pass cached blocks directly to disk if memory is not large enough

2014-07-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1777:
-

Priority: Critical  (was: Major)

 Pass cached blocks directly to disk if memory is not large enough
 ---

 Key: SPARK-1777
 URL: https://issues.apache.org/jira/browse/SPARK-1777
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.1.0

 Attachments: spark-1777-design-doc.pdf


 Currently in Spark we entirely unroll a partition and then check whether it 
 will cause us to exceed the storage limit. This has an obvious problem - if 
 the partition itself is enough to push us over the storage limit (and 
 eventually over the JVM heap), it will cause an OOM.
 This can happen in cases where a single partition is very large or when 
 someone is running examples locally with a small heap.
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106
 We should think a bit about the most elegant way to fix this - it shares some 
 similarities with the external aggregation code.
 A simple idea is to periodically check the size of the buffer as we are 
 unrolling and see if we are over the memory limit. If we are we could prepend 
 the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)

2014-07-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075789#comment-14075789
 ] 

Apache Spark commented on SPARK-2710:
-

User 'chutium' has created a pull request for this issue:
https://github.com/apache/spark/pull/1612

 Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
 --

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2514) Random RDD generator

2014-07-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2514.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1520
[https://github.com/apache/spark/pull/1520]

 Random RDD generator
 

 Key: SPARK-2514
 URL: https://issues.apache.org/jira/browse/SPARK-2514
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075796#comment-14075796
 ] 

Anand Avati commented on SPARK-2707:


The changes to just get compiled with 2.3.x can be found here - 
https://github.com/avati/spark/commit/000441bfec9315d1132cd9b785791a6fcbf9d4d4. 
However that does not work, and new SparkContext keeps throwing:

  java.util.concurrent.TimeoutException: Futures timed out after [1 
milliseconds]
  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
  at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)
  at akka.remote.Remoting.start(Remoting.scala:180)
  at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
  at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
  at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
  at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)

I am still investigating what other changes are needed in spark for akka 2.3.x 
to work


 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)

2014-07-27 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075799#comment-14075799
 ] 

Teng Qiu commented on SPARK-2710:
-

a problem is, there is nothing to push down... have no idea how filters can be 
pushed from logical plan to JdbcRDD... maybe only change the query string and 
rebuild conn.prepareStatement...

 Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
 --

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)

2014-07-27 Thread Teng Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-2710:


Description: 
Spark SQL can take Parquet files or JSON files as a table directly (without 
given a case class to define the schema)

as a component named SQL, it should also be able to take a ResultSet from RDBMS 
easily.

i find that there is a JdbcRDD in core: 
core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

so i want to make some small change in this file to allow SQLContext to read 
the MetaData from the PreparedStatement (read metadata do not need to execute 
the query really).

Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
MetaData.

In the further, maybe we can add a feature in sql-shell, so that user can using 
spark-thrift-server join tables from different sources

such as:
{code}
CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
initQuery bound ...
CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/
SELECT parquet_files.colX, jdbc_tbl1.colY
  FROM parquet_files
  JOIN jdbc_tbl1
ON (parquet_files.id = jdbc_tbl1.id)
{code}

I think such a feature will be useful, like facebook Presto engine does.


oh, and there is a small bug in JdbcRDD

in compute(), method close()
{code}
if (null != conn  ! stmt.isClosed()) conn.close()
{code}
should be
{code}
if (null != conn  ! conn.isClosed()) conn.close()
{code}

just a small write error :)
but such a close method will never be able to close conn...


  was:
Spark SQL can take Parquet files or JSON files as a table directly (without 
given a case class to define the schema)

as a component named SQL, it should also be able to take a ResultSet from RDBMS 
easily.

i find that there is a JdbcRDD in core: 
core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

so i want to make some small change in this file to allow SQLContext to read 
the MetaData from the PreparedStatement (read metadata do not need to execute 
the query really).

and there is a small bug in JdbcRDD

in compute(), method close()
{code}
if (null != conn  ! stmt.isClosed()) conn.close()
{code}
should be
{code}
if (null != conn  ! conn.isClosed()) conn.close()
{code}

just a small write error :)

Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
MetaData.

In the further, maybe we can add a feature in sql-shell, so that user can using 
spark-thrift-server join tables from different sources

such as:
{code}
CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
initQuery bound ...
CREATE TABLE parquet_files AS JDBC hdfs://tmp/parquet_table/
SELECT parquet_files.colX, jdbc_tbl1.colY
  FROM parquet_files
  JOIN jdbc_tbl1
ON (parquet_files.id = jdbc_tbl1.id)
{code}

I think such a feature will be useful, like facebook Presto engine does.


 Build SchemaRDD from a JdbcRDD with MetaData (no hard code case class)
 --

 Key: SPARK-2710
 URL: https://issues.apache.org/jira/browse/SPARK-2710
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Teng Qiu

 Spark SQL can take Parquet files or JSON files as a table directly (without 
 given a case class to define the schema)
 as a component named SQL, it should also be able to take a ResultSet from 
 RDBMS easily.
 i find that there is a JdbcRDD in core: 
 core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
 so i want to make some small change in this file to allow SQLContext to read 
 the MetaData from the PreparedStatement (read metadata do not need to execute 
 the query really).
 Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
 MetaData.
 In the further, maybe we can add a feature in sql-shell, so that user can 
 using spark-thrift-server join tables from different sources
 such as:
 {code}
 CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password 
 initQuery bound ...
 CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/
 SELECT parquet_files.colX, jdbc_tbl1.colY
   FROM parquet_files
   JOIN jdbc_tbl1
 ON (parquet_files.id = jdbc_tbl1.id)
 {code}
 I think such a feature will be useful, like facebook Presto engine does.
 oh, and there is a small bug in JdbcRDD
 in compute(), method close()
 {code}
 if (null != conn  ! stmt.isClosed()) conn.close()
 {code}
 should be
 {code}
 if (null != conn  ! conn.isClosed()) conn.close()
 {code}
 just a small write error :)
 but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods

2014-07-27 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075816#comment-14075816
 ] 

Xiangrui Meng commented on SPARK-2550:
--

After you merge new changes from the master, please run `sbt/sbt clean` to 
clean the cache in order to build correctly.

 Support regularization and intercept in pyspark's linear methods
 

 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Michael Yannakopoulos

 Python API doesn't provide options to set regularization parameter and 
 intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task

2014-07-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2711:


 Summary: Create a ShuffleMemoryManager that allocates across 
spilling collections in the same task
 Key: SPARK-2711
 URL: https://issues.apache.org/jira/browse/SPARK-2711
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task

2014-07-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2711:
-

Description: Right now if there are two ExternalAppendOnlyMaps, they don't 
compete correctly for memory. This can happen e.g. in a task that is both 
reducing data from its parent RDD and writing it out to files for a future 
shuffle, for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) 
(another key).

 Create a ShuffleMemoryManager that allocates across spilling collections in 
 the same task
 -

 Key: SPARK-2711
 URL: https://issues.apache.org/jira/browse/SPARK-2711
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia

 Right now if there are two ExternalAppendOnlyMaps, they don't compete 
 correctly for memory. This can happen e.g. in a task that is both reducing 
 data from its parent RDD and writing it out to files for a future shuffle, 
 for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) (another 
 key).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2659) HiveQL: Division operator should always perform fractional division

2014-07-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2659.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 HiveQL: Division operator should always perform fractional division
 ---

 Key: SPARK-2659
 URL: https://issues.apache.org/jira/browse/SPARK-2659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2410) Thrift/JDBC Server

2014-07-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075833#comment-14075833
 ] 

Patrick Wendell commented on SPARK-2410:


{code}
[info] - test query execution against a Hive Thrift server *** FAILED ***
[info]   java.sql.SQLException: Could not open connection to 
jdbc:hive2://localhost:59556/: java.net.ConnectException: Connection refused
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
[info]   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
[info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:110)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:107)
[info]   ...
[info]   Cause: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
[info]   at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
[info]   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
[info]   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
[info]   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
[info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:131)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:134)
[info]   ...
[info]   Cause: java.net.ConnectException: Connection refused
[info]   at java.net.PlainSocketImpl.socketConnect(Native Method)
[info]   at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
[info]   at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
[info]   at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
[info]   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[info]   at java.net.Socket.connect(Socket.java:579)
[info]   at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
[info]   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
[info]   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
[info]   ...
[info] CliSuite:
Executing: create table hive_test1(key int, val string);, expecting output: OK
[info] - simple commands *** FAILED ***
[info]   java.lang.AssertionError: assertion failed: Didn't find OK in the 
output:
[info]   at scala.Predef$.assert(Predef.scala:179)
[info]   at 
org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:25)
[info]   at 
org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:25)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:53)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:51)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
{code}

 Thrift/JDBC Server
 --

 Key: SPARK-2410
 URL: https://issues.apache.org/jira/browse/SPARK-2410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 We have this, but need to make sure that it gets merged into master before 
 the 1.1 release.




[jira] [Reopened] (SPARK-2410) Thrift/JDBC Server

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2410:



Re opening this again due to test issues.

 Thrift/JDBC Server
 --

 Key: SPARK-2410
 URL: https://issues.apache.org/jira/browse/SPARK-2410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 We have this, but need to make sure that it gets merged into master before 
 the 1.1 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2651) Add maven scalastyle plugin

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2651:
---

Assignee: Rahul Singhal

 Add maven scalastyle plugin
 ---

 Key: SPARK-2651
 URL: https://issues.apache.org/jira/browse/SPARK-2651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Rahul Singhal
Assignee: Rahul Singhal
Priority: Minor
 Fix For: 1.1.0


 SBT has a scalastyle plugin which can be executed to check for coding 
 conventions. It would be nice to add the same for maven builds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2712) Add a small note that mvn package must happen before test

2014-07-27 Thread Stephen Boesch (JIRA)
Stephen Boesch created SPARK-2712:
-

 Summary: Add a small note that mvn package must happen before 
test
 Key: SPARK-2712
 URL: https://issues.apache.org/jira/browse/SPARK-2712
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.0, 0.9.1, 1.1.1
 Environment: all
Reporter: Stephen Boesch
Priority: Trivial
 Fix For: 1.1.0


Add to the building-with-maven.md:

Requirement: build packages before running tests
Tests must be run AFTER the package target has already been executed. The 
following is an example of a correct (build, test) sequence:
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
mvn -Pyarn -Phadoop-2.3 -Phive test

BTW Reynold Xin requested this tiny doc improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2713) Executors of same application in same host should only download files jars once

2014-07-27 Thread Zhihui (JIRA)
Zhihui created SPARK-2713:
-

 Summary: Executors of same application in same host should only 
download files  jars once
 Key: SPARK-2713
 URL: https://issues.apache.org/jira/browse/SPARK-2713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui


If spark lunched multiple executors in one host for one application, every 
executor will download it dependent files and jars (if not using local: url) 
independently. It maybe result to huge latency. In my case, it result to 20 
seconds latency to download dependent jars(about 17M) when I lunch 32 executors 
in one host(total 4 hosts). 

This patch will cache downloaded files and jars for executors to reduce network 
throughput and download latency. I my case, the latency was reduced from 20 
seconds to less than 1 second.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2713) Executors of same application in same host should only download files jars once

2014-07-27 Thread Zhihui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-2713:
--

Description: 
If Spark lunched multiple executors in one host for one application, every 
executor would download it dependent files and jars (if not using local: url) 
independently. It maybe result in huge latency. In my case, it result in 20 
seconds latency to download dependent jars(about 17M) when I lunch 32 executors 
in one host(total 4 hosts). 

This patch will cache downloaded files and jars for executors to reduce network 
throughput and download latency. I my case, the latency was reduced from 20 
seconds to less than 1 second.

  was:
If spark lunched multiple executors in one host for one application, every 
executor will download it dependent files and jars (if not using local: url) 
independently. It maybe result to huge latency. In my case, it result to 20 
seconds latency to download dependent jars(about 17M) when I lunch 32 executors 
in one host(total 4 hosts). 

This patch will cache downloaded files and jars for executors to reduce network 
throughput and download latency. I my case, the latency was reduced from 20 
seconds to less than 1 second.


 Executors of same application in same host should only download files  jars 
 once
 -

 Key: SPARK-2713
 URL: https://issues.apache.org/jira/browse/SPARK-2713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 If Spark lunched multiple executors in one host for one application, every 
 executor would download it dependent files and jars (if not using local: url) 
 independently. It maybe result in huge latency. In my case, it result in 20 
 seconds latency to download dependent jars(about 17M) when I lunch 32 
 executors in one host(total 4 hosts). 
 This patch will cache downloaded files and jars for executors to reduce 
 network throughput and download latency. I my case, the latency was reduced 
 from 20 seconds to less than 1 second.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2713) Executors of same application in same host should only download files jars once

2014-07-27 Thread Zhihui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075914#comment-14075914
 ] 

Zhihui commented on SPARK-2713:
---

PR https://github.com/apache/spark/pull/1616

 Executors of same application in same host should only download files  jars 
 once
 -

 Key: SPARK-2713
 URL: https://issues.apache.org/jira/browse/SPARK-2713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 If Spark lunched multiple executors in one host for one application, every 
 executor would download it dependent files and jars (if not using local: url) 
 independently. It maybe result in huge latency. In my case, it result in 20 
 seconds latency to download dependent jars(about 17M) when I lunch 32 
 executors in one host(total 4 hosts). 
 This patch will cache downloaded files and jars for executors to reduce 
 network throughput and download latency. I my case, the latency was reduced 
 from 20 seconds to less than 1 second.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2713) Executors of same application in same host should only download files jars once

2014-07-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075916#comment-14075916
 ] 

Apache Spark commented on SPARK-2713:
-

User 'li-zhihui' has created a pull request for this issue:
https://github.com/apache/spark/pull/1616

 Executors of same application in same host should only download files  jars 
 once
 -

 Key: SPARK-2713
 URL: https://issues.apache.org/jira/browse/SPARK-2713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui

 If Spark lunched multiple executors in one host for one application, every 
 executor would download it dependent files and jars (if not using local: url) 
 independently. It maybe result in huge latency. In my case, it result in 20 
 seconds latency to download dependent jars(about 17M) when I lunch 32 
 executors in one host(total 4 hosts). 
 This patch will cache downloaded files and jars for executors to reduce 
 network throughput and download latency. I my case, the latency was reduced 
 from 20 seconds to less than 1 second.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2484) Build should not run hive compatibility tests by default.

2014-07-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2484:
---

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-2487

 Build should not run hive compatibility tests by default.
 -

 Key: SPARK-2484
 URL: https://issues.apache.org/jira/browse/SPARK-2484
 Project: Spark
  Issue Type: Sub-task
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 hive compatibility test takes a long time, in some cases, we don't need to 
 run it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)