date:20140919

[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...

2014-09-19 Thread ravipesala

GitHub user ravipesala opened a pull request:

https://github.com/apache/spark/pull/2456

[SPARK-3536][SQL] SELECT on empty parquet table throws exception

It return null metadata from parquet if querying on empty parquet file 
while calculating splits.So added null check and returns the empty splits.

Author : ravipesala ravindra.pes...@huawei.com

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/spark SPARK-3536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2456


commit 1e81a50631b1f44ad7de65b83408a40218234745
Author: ravipesala ravindra.pes...@huawei.com
Date:   2014-09-18T18:02:46Z

Fixed the issue when querying on empty parquet file.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...

2014-09-19 Thread ankurdave

Github user ankurdave commented on the pull request:

https://github.com/apache/spark/pull/1903#issuecomment-56140430
  
Thanks! Merged into master and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...

2014-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1903


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-19 Thread nishkamravi2

Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-56142506
  
@sryza Thanks Sandy.  Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2455#issuecomment-56144570
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2455#issuecomment-56144582
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56147622
  
@davies Does `PickleSerializer` compress data? If not, maybe we should 
cache the deserialized RDD instead of the one from `_.reserialize`. They have 
the same storage. I understand that batch-serialization can help GC. But 
algorithms like linear methods should only allocate short-lived objects. Is 
batch-serialization worth the tradeoff?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...

2014-09-19 Thread ankurdave

Github user ankurdave commented on the pull request:

https://github.com/apache/spark/pull/2446#issuecomment-56151121
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3268][SQL] DoubleType, FloatType and De...

2014-09-19 Thread gvramana

GitHub user gvramana opened a pull request:

https://github.com/apache/spark/pull/2457

[SPARK-3268][SQL] DoubleType, FloatType and DecimalType modulus support

Supported modulus operation using % operator on fractional datatypes 
FloatType, DoubleType and DecimalType
Example:
SELECT 1388632775.0 % 60 from tablename LIMIT 1

Author : Venkata Ramana Gollamudi ramana.gollam...@huawei.com

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gvramana/spark double_modulus_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2457.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2457


commit 296d2539c0d745d0450441997390052352b8731d
Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com
Date:   2014-09-18T11:06:10Z

modified to add modulus support to fractional types float,double,decimal

commit 513d0e0ce4fdaf3faf11698d4ea079c79538f402
Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com
Date:   2014-09-18T11:06:10Z

modified to add modulus support to fractional types float,double,decimal

commit e112c09ccf0be8354afe3359a4d3e18c6346475c
Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com
Date:   2014-09-19T07:47:35Z

corrected the testcase

commit 3624471e5b65ccb92fb84d7de9303669ec79965e
Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com
Date:   2014-09-19T08:01:25Z

modified testcase




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...

2014-09-19 Thread Sean Owen

Hm deleteOnExit should at least not hurt and I thought it will delete dirs
if they are empty, which may be so if temp files inside never existed or
were cleaned up themselves. But yeah always delete explicitly in the normal
execution path even in the event of normal exceptions.
On Sep 19, 2014 3:00 AM, mattf g...@git.apache.org wrote:

 Github user mattf commented on the pull request:

 https://github.com/apache/spark/pull/2393#issuecomment-56127248

 +1 lgtm

 fyi, i checked, deleteOnExit isn't an option because it cannot
 recursively delete


 ---
 If your project is set up for it, you can reply to this email and have your
 reply appear on GitHub as well. If your project does not have this feature
 enabled and wishes so, or if the feature is enabled but not working, please
 contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
 with INFRA.
 ---

 -
 To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
 For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56152819
  
Sorry for asking - but have you tested this on a real cluster?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56152843
  
Oh and thanks for doing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...

2014-09-19 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2439#issuecomment-56153581
  
@jegonzal you should take a look :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...

2014-09-19 Thread ravipesala

Github user ravipesala commented on the pull request:

https://github.com/apache/spark/pull/2456#issuecomment-56157072
  
Please review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...

2014-09-19 Thread adrian-wang

GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/2458

[SPARK-3598][SQL]cast to timestamp should be the same as hive

this patch fixes timestamp smaller than 0 and cast int as timestamp

select cast(1000 as timestamp) from src limit 1;

should return 1970-01-01 00:00:01, but we now take it as 1000 seconds.
also, current implementation has bug when the time is before 1970-01-01 
00:00:00.
@rxin @marmbrus @chenghao-intel

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark timestamp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2458.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2458


commit 1234f666283172b28d5f17904fc3f2f5065a21ca
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-09-19T10:11:49Z

fix timestamp smaller than 0 and cast int as timestamp




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-09-19 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-56164031
  
Hi @liyezhang556520 , thanks for pointing this out! I have updated my PR, 
please review @andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-09-19 Thread liyezhang556520

Github user liyezhang556520 commented on a diff in the pull request:

https://github.com/apache/spark/pull/791#discussion_r17781184
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -239,18 +250,18 @@ private[spark] class MemoryStore(blockManager: 
BlockManager, maxMemory: Long)
   val currentSize = vector.estimateSize()
   if (currentSize = memoryThreshold) {
 val amountToRequest = (currentSize * memoryGrowthFactor - 
memoryThreshold).toLong
-// Hold the accounting lock, in case another thread 
concurrently puts a block that
-// takes up the unrolling space we just ensured here
-accountingLock.synchronized {
-  if (!reserveUnrollMemoryForThisThread(amountToRequest)) {
-// If the first request is not granted, try again after 
ensuring free space
-// If there is still not enough space, give up and drop 
the partition
-val spaceToEnsure = maxUnrollMemory - currentUnrollMemory
-if (spaceToEnsure  0) {
-  val result = ensureFreeSpace(blockId, spaceToEnsure)
-  droppedBlocks ++= result.droppedBlocks
+if (!reserveUnrollMemoryForThisThread(amountToRequest)) {
+  val spaceToEnsure = maxUnrollMemory - currentUnrollMemory
+  if (spaceToEnsure  0) {
+val task = planFreeSpace(blockId, spaceToEnsure)
--- End diff --

Hi @cloud-fan , you removed `accountingLock.synchronized` here, so there 
will be more than one thread call `planFreeSpace` here for reserving memory. 
And each thread will asking for memory with size `maxUnrollMemory - 
currentUnrollMemory`. I think the logic is not the same with the original 
intention. 

There is second question, what if `maxUnrollMemory` is large 
(`maxMemory*unrollFraction` might be dozens of GB large), while the requested 
memory `amountToRequest` is small (maybe dozens of MB), then you only use one 
thread to free the size, which is `spaceToEnsure`, this seems doesn't solve the 
IO issue.

Third, since you are lazy drop the to be dropped blocks, how can you avoid 
OOM which is @andrewor14 pointed out (the putting speed is faster than 
dropping)?

Does the three problems exists in the current patch? Maybe I missed 
something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread mattf

Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2444#issuecomment-56172067
  
+1 lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread mattf

Github user mattf commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17782052
  
--- Diff: sbin/slaves.sh ---
@@ -67,20 +69,26 @@ fi
 
 if [ $HOSTLIST =  ]; then
   if [ $SPARK_SLAVES =  ]; then
-export HOSTLIST=${SPARK_CONF_DIR}/slaves
+if [ -f ${SPARK_CONF_DIR}/slaves ]; then
+  HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves`
+else
+  HOSTLIST=localhost
+fi
   else
-export HOSTLIST=${SPARK_SLAVES}
+HOSTLIST=`cat ${SPARK_SLAVES}`
--- End diff --

thanks for pointing that out. i didn't read closely enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-09-19 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-56174217
  
@mridulm  any comments?

I'm ok with it if its a consistent problem for users.  One thing we 
definitely need to do is document it and possibly look at including better log 
and error messages. We should at least log the size of the overhead it 
calculates.  It would also be nice to log what it is when we fail to get a 
container large enough or it fails due to the cluster max allocation limit was 
hit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17785127
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -37,154 +36,106 @@ import org.apache.hadoop.yarn.api.protocolrecords._
 import org.apache.hadoop.yarn.api.records._
 import org.apache.hadoop.yarn.conf.YarnConfiguration
 import org.apache.hadoop.yarn.util.Records
+
 import org.apache.spark.{Logging, SecurityManager, SparkConf, 
SparkContext, SparkException}
 
 /**
- * The entry point (starting in Client#main() and Client#run()) for 
launching Spark on YARN. The
- * Client submits an application to the YARN ResourceManager.
+ * The entry point (starting in Client#main() and Client#run()) for 
launching Spark on YARN.
+ * The Client submits an application to the YARN ResourceManager.
  */
-trait ClientBase extends Logging {
-  val args: ClientArguments
-  val conf: Configuration
-  val sparkConf: SparkConf
-  val yarnConf: YarnConfiguration
-  val credentials = UserGroupInformation.getCurrentUser().getCredentials()
-  private val SPARK_STAGING: String = .sparkStaging
+private[spark] trait ClientBase extends Logging {
+  import ClientBase._
+
+  protected val args: ClientArguments
+  protected val hadoopConf: Configuration
+  protected val sparkConf: SparkConf
+  protected val yarnConf: YarnConfiguration
+  protected val credentials = 
UserGroupInformation.getCurrentUser.getCredentials
+  protected val amMemoryOverhead = args.amMemoryOverhead // MB
+  protected val executorMemoryOverhead = args.executorMemoryOverhead // MB
   private val distCacheMgr = new ClientDistributedCacheManager()
 
-  // Staging directory is private! - rwx
-  val STAGING_DIR_PERMISSION: FsPermission =
-FsPermission.createImmutable(Integer.parseInt(700, 8).toShort)
-  // App files are world-wide readable and owner writable - rw-r--r--
-  val APP_FILE_PERMISSION: FsPermission =
-FsPermission.createImmutable(Integer.parseInt(644, 8).toShort)
-
-  // Additional memory overhead - in mb.
-  protected def memoryOverhead: Int = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-YarnSparkHadoopUtil.DEFAULT_MEMORY_OVERHEAD)
-
-  // TODO(harvey): This could just go in ClientArguments.
-  def validateArgs() = {
-Map(
-  (args.numExecutors = 0) - Error: You must specify at least 1 
executor!,
-  (args.amMemory = memoryOverhead) - (Error: AM memory size must 
be +
-greater than:  + memoryOverhead),
-  (args.executorMemory = memoryOverhead) - (Error: Executor memory 
size +
-must be greater than:  + memoryOverhead.toString)
-).foreach { case(cond, errStr) =
-  if (cond) {
-logError(errStr)
-throw new IllegalArgumentException(args.getUsageMessage())
-  }
-}
-  }
-
-  def getAppStagingDir(appId: ApplicationId): String = {
-SPARK_STAGING + Path.SEPARATOR + appId.toString() + Path.SEPARATOR
-  }
-
-  def verifyClusterResources(app: GetNewApplicationResponse) = {
-val maxMem = app.getMaximumResourceCapability().getMemory()
-logInfo(Max mem capabililty of a single resource in this cluster  + 
maxMem)
-
-// If we have requested more then the clusters max for a single 
resource then exit.
-if (args.executorMemory  maxMem) {
-  val errorMessage =
-Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.
-  .format(args.executorMemory, maxMem)
-
-  logError(errorMessage)
-  throw new IllegalArgumentException(errorMessage)
-}
-val amMem = args.amMemory + memoryOverhead
+  /**
+   * Fail fast if we have requested more resources per container than is 
available in the cluster.
+   */
+  protected def verifyClusterResources(newAppResponse: 
GetNewApplicationResponse): Unit = {
+val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
+logInfo(Verifying our application has not requested more than the 
maximum  +
+  smemory capability of the cluster ($maxMem MB per container))
+val executorMem = args.executorMemory + executorMemoryOverhead
+if (executorMem  maxMem) {
+  throw new IllegalArgumentException(sRequired executor memory 
($executorMem MB)  +
+sis above the max threshold ($maxMem MB) of this cluster!)
+}
+val amMem = args.amMemory + amMemoryOverhead
 if (amMem  maxMem) {
-
-  val errorMessage = Required AM memory (%d) is above the max 
threshold (%d) of this cluster.
-.format(amMem, maxMem)
-  logError(errorMessage)
-  throw new

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17786319
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -415,41 +381,153 @@ trait ClientBase extends Logging {
 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout,
 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr)
 
-logInfo(Yarn AM launch context:)
-logInfo(s  user class: ${args.userClass})
-logInfo(s  env:$env)
-logInfo(s  command:${commands.mkString( )})
-
 // TODO: it would be nicer to just make sure there are no null 
commands here
 val printableCommands = commands.map(s = if (s == null) null else 
s).toList
 amContainer.setCommands(printableCommands)
 
-setupSecurityToken(amContainer)
+
logDebug(===)
+logDebug(Yarn AM launch context:)
+logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)})
+logDebug(env:)
+launchEnv.foreach { case (k, v) = logDebug(s$k - $v) }
+logDebug(resources:)
+localResources.foreach { case (k, v) = logDebug(s$k - $v)}
+logDebug(command:)
+logDebug(s${printableCommands.mkString( )})
+
logDebug(===)
 
 // send the acl settings into YARN to control who has access via YARN 
interfaces
 val securityManager = new SecurityManager(sparkConf)
 
amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
+setupSecurityToken(amContainer)
+UserGroupInformation.getCurrentUser().addCredentials(credentials)
 
 amContainer
   }
+
+  /**
+   * Report the state of an application until it has exited, either 
successfully or
+   * due to some failure, then return the application state.
+   *
--- End diff --

missing the appId param


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib] Fix example code variable name misspel...

2014-09-19 Thread rnowling

GitHub user rnowling opened a pull request:

https://github.com/apache/spark/pull/2459

[MLLib] Fix example code variable name misspelling in MLLib Feature 
Extraction guide



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rnowling/spark tfidf-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2459


commit b370a919451ca7e8c1b3eec1b35b941e48571717
Author: RJ Nowling rnowl...@gmail.com
Date:   2014-09-19T14:09:13Z

Fix variable name misspelling in MLLib Feature Extraction guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17786722
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala
 ---
@@ -19,29 +19,24 @@ package org.apache.spark.deploy.yarn
 
 import java.net.URI
 
+import scala.collection.mutable.{HashMap, LinkedHashMap, Map}
+
 import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.FileStatus
-import org.apache.hadoop.fs.FileSystem
-import org.apache.hadoop.fs.Path
+import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
 import org.apache.hadoop.fs.permission.FsAction
-import org.apache.hadoop.yarn.api.records.LocalResource
-import org.apache.hadoop.yarn.api.records.LocalResourceVisibility
-import org.apache.hadoop.yarn.api.records.LocalResourceType
+import org.apache.hadoop.yarn.api.records._
--- End diff --

just curious, why change this to ._ and all the others to {}?I'm not 
sure if we have a standard for that?  Generally I go for explicitly listing 
them the ones out I'm using.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2350#issuecomment-56183509
  
This mostly looks good.  A couple minor comments is all.  I do also still 
want to run through some tests on alpha.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...

2014-09-19 Thread rnowling

Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2439#issuecomment-56183914
  
@ankurdave I'd be a bit concerned about how that affects the correctness of 
the algorithm.  Especially since this will round every value down when maybe 
you only one to round the edge case down.  Would you give me some time to check 
the original paper before you commit this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread brndnmtthws

Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56184849
  
I did indeed test it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-09-19 Thread gss2002

Github user gss2002 commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-56185334
  
We have been using this fix for a few weeks now against Hive 13. The only 
outstanding issue I see and this could be something larger is the fact that 
Spark Thrift service doesn't seem to support the hive.server2.enable.doAs = 
true. It doesn't set proxy user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17788554
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -415,41 +381,153 @@ trait ClientBase extends Logging {
 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout,
 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr)
 
-logInfo(Yarn AM launch context:)
-logInfo(s  user class: ${args.userClass})
-logInfo(s  env:$env)
-logInfo(s  command:${commands.mkString( )})
-
 // TODO: it would be nicer to just make sure there are no null 
commands here
 val printableCommands = commands.map(s = if (s == null) null else 
s).toList
 amContainer.setCommands(printableCommands)
 
-setupSecurityToken(amContainer)
+
logDebug(===)
+logDebug(Yarn AM launch context:)
+logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)})
+logDebug(env:)
+launchEnv.foreach { case (k, v) = logDebug(s$k - $v) }
+logDebug(resources:)
+localResources.foreach { case (k, v) = logDebug(s$k - $v)}
+logDebug(command:)
+logDebug(s${printableCommands.mkString( )})
+
logDebug(===)
 
 // send the acl settings into YARN to control who has access via YARN 
interfaces
 val securityManager = new SecurityManager(sparkConf)
 
amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
+setupSecurityToken(amContainer)
+UserGroupInformation.getCurrentUser().addCredentials(credentials)
 
 amContainer
   }
+
+  /**
+   * Report the state of an application until it has exited, either 
successfully or
+   * due to some failure, then return the application state.
+   *
+   * @param returnOnRunning Whether to also return the application state 
when it is RUNNING.
+   * @param logApplicationReport Whether to log details of the application 
report every iteration.
+   * @return state of the application, one of FINISHED, FAILED, KILLED, 
and RUNNING.
+   */
+  def monitorApplication(
+  appId: ApplicationId,
+  returnOnRunning: Boolean = false,
+  logApplicationReport: Boolean = true): YarnApplicationState = {
+val interval = sparkConf.getLong(spark.yarn.report.interval, 1000)
+var lastState: YarnApplicationState = null
+while (true) {
+  Thread.sleep(interval)
+  val report = getApplicationReport(appId)
+  val state = report.getYarnApplicationState
+
+  if (logApplicationReport) {
+logInfo(sApplication report from ResourceManager for app 
${appId.getId} (state: $state))
--- End diff --

seems like we wouldn't need the from ResourceManager here.   Also could 
we put the full application id here instead of just the last bit. Its must 
easier to copy and paste if the user wants to grab it and use in yarn command 
of ui. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...

2014-09-19 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/1297#discussion_r17790219
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/IndexedRDDLike.scala ---
@@ -0,0 +1,338 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import scala.collection.immutable.LongMap
+import scala.language.higherKinds
+import scala.reflect.ClassTag
+
+import org.apache.spark._
+import org.apache.spark.SparkContext._
+import org.apache.spark.storage.StorageLevel
+
+import IndexedRDD.Id
+
+/**
+ * Contains members that are shared among all variants of IndexedRDD 
(e.g., IndexedRDD,
+ * VertexRDD).
+ *
+ * @tparam V the type of the values stored in the IndexedRDD
+ * @tparam P the type of the partitions making up the IndexedRDD
+ * @tparam Self the type of the implementing container. This allows 
transformation methods on any
+ * implementing container to yield a result of the same type.
+ */
+private[spark] trait IndexedRDDLike[
+@specialized(Long, Int, Double) V,
+P[X] : IndexedRDDPartitionLike[X, P],
+Self[X] : IndexedRDDLike[X, P, Self]]
+  extends RDD[(Id, V)] {
+
+  /** A generator for ClassTags of the value type V. */
+  protected implicit def vTag: ClassTag[V]
+
+  /** A generator for ClassTags of the partition type P. */
+  protected implicit def pTag[V2]: ClassTag[P[V2]]
+
+  /** Accessor for the IndexedRDD variant that is mixing in this trait. */
+  protected def self: Self[V]
+
+  /** The underlying representation of the IndexedRDD as an RDD of 
partitions. */
+  def partitionsRDD: RDD[P[V]]
+  require(partitionsRDD.partitioner.isDefined)
+
+  def withPartitionsRDD[V2: ClassTag](partitionsRDD: RDD[P[V2]]): Self[V2]
+
+  override val partitioner = partitionsRDD.partitioner
+
+  override protected def getPartitions: Array[Partition] = 
partitionsRDD.partitions
+
+  override protected def getPreferredLocations(s: Partition): Seq[String] =
+partitionsRDD.preferredLocations(s)
+
+  override def persist(newLevel: StorageLevel): this.type = {
+partitionsRDD.persist(newLevel)
+this
+  }
+
+  override def unpersist(blocking: Boolean = true): this.type = {
+partitionsRDD.unpersist(blocking)
+this
+  }
+
+  override def count(): Long = {
+partitionsRDD.map(_.size).reduce(_ + _)
+  }
+
+  /** Provides the `RDD[(Id, V)]` equivalent output. */
+  override def compute(part: Partition, context: TaskContext): 
Iterator[(Id, V)] = {
+firstParent[P[V]].iterator(part, context).next.iterator
+  }
+
+  /** Gets the value corresponding to the specified key, if any. */
+  def get(k: Id): Option[V] = multiget(Array(k)).get(k)
+
+  /** Gets the values corresponding to the specified keys, if any. */
+  def multiget(ks: Array[Id]): Map[Id, V] = {
+val ksByPartition = ks.groupBy(k = 
self.partitioner.get.getPartition(k))
+val partitions = ksByPartition.keys.toSeq
+def unionMaps(maps: TraversableOnce[LongMap[V]]): LongMap[V] = {
+  maps.foldLeft(LongMap.empty[V]) {
+(accum, map) = accum.unionWith(map, (id, a, b) = a)
+  }
+}
+// TODO: avoid sending all keys to all partitions by creating and 
zipping an RDD of keys
--- End diff --

would this be another use of the `bulkMultiget` I suggested in jira?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...

2014-09-19 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/1297#discussion_r17791303
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ImmutableLongOpenHashSet.scala
 ---
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.reflect._
+import com.google.common.hash.Hashing
+
+/**
+ * A fast, immutable hash set optimized for insertions and lookups (but 
not deletions) of `Long`
+ * elements. Because it exposes the position of a key in the underlying 
array, this is useful as a
+ * building block for higher level data structures such as a hash map (for 
example,
+ * IndexedRDDPartition).
+ *
+ * It uses quadratic probing with a power-of-2 hash table size, which is 
guaranteed to explore all
+ * spaces for each key (see 
http://en.wikipedia.org/wiki/Quadratic_probing).
+ */
+private[spark] class ImmutableLongOpenHashSet(
+/** Underlying array of elements used as a hash table. */
+val data: ImmutableVector[Long],
+/** Whether or not there is an element at the corresponding position 
in `data`. */
+val bitset: ImmutableBitSet,
+/**
+ * Position of a focused element. This is useful when returning a 
modified set along with a
+ * pointer to the location of modification.
+ */
+val focus: Int,
+/** Load threshold at which to grow the underlying vectors. */
+loadFactor: Double
+  ) extends Serializable {
+
+  require(loadFactor  1.0, Load factor must be less than 1.0)
+  require(loadFactor  0.0, Load factor must be greater than 0.0)
+  require(capacity == nextPowerOf2(capacity), data capacity must be a 
power of 2)
+
+  import OpenHashSet.{INVALID_POS, NONEXISTENCE_MASK, POSITION_MASK, 
Hasher, LongHasher}
+
+  private val hasher: Hasher[Long] = new LongHasher
+
+  private def mask = capacity - 1
+  private def growThreshold = (loadFactor * capacity).toInt
+
+  def withFocus(focus: Int): ImmutableLongOpenHashSet =
+new ImmutableLongOpenHashSet(data, bitset, focus, loadFactor)
+
+  /** The number of elements in the set. */
+  def size: Int = bitset.cardinality
+
+  /** The capacity of the set (i.e. size of the underlying vector). */
+  def capacity: Int = data.size
+
+  /** Return true if this set contains the specified element. */
+  def contains(k: Long): Boolean = getPos(k) != INVALID_POS
+
+  /**
+   * Nondestructively add an element to the set, returning a new set. If 
the set is over capacity
+   * after the insertion, grows the set and rehashes all elements.
+   */
+  def add(k: Long): ImmutableLongOpenHashSet = {
+addWithoutResize(k).rehashIfNeeded(ImmutableLongOpenHashSet.grow, 
ImmutableLongOpenHashSet.move)
+  }
+
+  /**
+   * Add an element to the set. This one differs from add in that it 
doesn't trigger rehashing.
+   * The caller is responsible for calling rehashIfNeeded.
+   *
+   * Use (retval.focus  POSITION_MASK) to get the actual position, and
+   * (retval.focus  NONEXISTENCE_MASK) == 0 for prior existence.
+   */
+  def addWithoutResize(k: Long): ImmutableLongOpenHashSet = {
+var pos = hashcode(hasher.hash(k))  mask
+var i = 1
+var result: ImmutableLongOpenHashSet = null
+while (result == null) {
+  if (!bitset.get(pos)) {
+// This is a new key.
+result = new ImmutableLongOpenHashSet(
+  data.updated(pos, k), bitset.set(pos), pos | NONEXISTENCE_MASK, 
loadFactor)
+  } else if (data(pos) == k) {
+// Found an existing key.
+result = this.withFocus(pos)
+  } else {
+val delta = i
+pos = (pos + delta)  mask
+i += 1
+  }
+}
+result
+  }
+
+  /**
+   * Rehash the set if it is overloaded.
+   * @param allocateFunc Callback invoked when we

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-19 Thread mattf

Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2313#issuecomment-56197446
  
for some additional input, @pwendell - do you think requiring numpy for 
core would be acceptable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...

2014-09-19 Thread patmcdonough

Github user patmcdonough commented on a diff in the pull request:

https://github.com/apache/spark/pull/2447#discussion_r17794069
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -208,6 +208,23 @@ abstract class RDD[T: ClassTag](
   }
 
   /**
+   * Get the number of partitions in this RDD
+   *
+   * {{{
+   * scala val rdd = sc.parallelize(1 to 4, 2)
+   * rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at 
parallelize at console:12
+   *
+   * scala rdd.getNumPartitions
+   * res1: Int = 2
+   * }}}
--- End diff --

Good point, although it's worth noting this was essentially ported directly 
from the python API (including the doc). Any doc changes should be consistent 
across both versions if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...

2014-09-19 Thread squito

Github user squito commented on the pull request:

https://github.com/apache/spark/pull/1297#issuecomment-56199798
  
This looks great!  my comments are minor.

I know its early to be discussing example docs, but I just wanted to 
mention that I can see caching being an area of confusion.  Eg., you wouldn't 
want to serialize  cache each update to an indexedRDD, as each cache would 
make a full copy and not get the benefits of the ImmutableVectors. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread brkyvz

Github user brkyvz commented on the pull request:

https://github.com/apache/spark/pull/2451#issuecomment-56202513
  
@anantasty: If you could look through the code and mark places where you're 
like What the heck is going on here, it would be easier for me to write up 
proper comments. I'm going to add a lot today, I can incorporate yours as well. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56206573
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56206744
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2456#issuecomment-56206732
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3268][SQL] DoubleType, FloatType and De...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2457#issuecomment-56206727
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2337#discussion_r17796741
  
--- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
@@ -83,6 +83,15 @@ trait FutureAction[T] extends Future[T] {
*/
   @throws(classOf[Exception])
   def get(): T = Await.result(this, Duration.Inf)
+
+  /**
+   * Returns the job IDs run by the underlying async operation.
+   *
+   * This returns the current snapshot of the job list. Certain operations 
may run multiple
+   * job, so multiple calls to this method may return different lists.
--- End diff --

multiple jobs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2455#issuecomment-56206738
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2337#discussion_r17796804
  
--- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala ---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.concurrent.Await
+import scala.concurrent.duration.Duration
+
+import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
+
+import org.apache.spark.SparkContext._
+
+class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers 
with LocalSparkContext {
+
+  before {
+sc = new SparkContext(local, FutureActionSuite)
--- End diff --

can you add a test here for the case when multiple job id's are used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2379#issuecomment-56207066
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20573/consoleFull)
 for   PR 2379 at commit 
[`5acc167`](https://github.com/apache/spark/commit/5acc16712f031d5e3269b9088acee8e7e6c8d431).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2432#issuecomment-56207059
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20575/consoleFull)
 for   PR 2432 at commit 
[`4a93c7f`](https://github.com/apache/spark/commit/4a93c7f7da8d829a8837f3a31aff0f08355e0c5a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2454#issuecomment-56207072
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20570/consoleFull)
 for   PR 2454 at commit 
[`2a79f26`](https://github.com/apache/spark/commit/2a79f26497f9232465aa2e9b496b0d54b9ccda75).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56207056
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20572/consoleFull)
 for   PR 2440 at commit 
[`b340956`](https://github.com/apache/spark/commit/b34095661f2fe060c1819293a203216c16cf5454).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56207061
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20574/consoleFull)
 for   PR 2401 at commit 
[`56988e3`](https://github.com/apache/spark/commit/56988e31363bc07dc8acb369bdaade6b18b98f51).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56207099
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull)
 for   PR 2378 at commit 
[`dffbba2`](https://github.com/apache/spark/commit/dffbba2ba206bbbd3dfc740a55f1b0df341860e7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2446#issuecomment-56207128
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20571/consoleFull)
 for   PR 2446 at commit 
[`e1a8f04`](https://github.com/apache/spark/commit/e1a8f04ba923935e26bc8a78c3e0aff03751aae4).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2379#issuecomment-56207390
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20573/consoleFull)
 for   PR 2379 at commit 
[`5acc167`](https://github.com/apache/spark/commit/5acc16712f031d5e3269b9088acee8e7e6c8d431).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2458#issuecomment-56207035
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20569/consoleFull)
 for   PR 2458 at commit 
[`4274b1d`](https://github.com/apache/spark/commit/4274b1d10fc48746c850207fc27e5acc8630ddc9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2337#issuecomment-56207872
  
It would be good to test the complex case with multiple job ids, but 
overall looks good. @rxin you added this interface - can you take a look (this 
is a very small patch)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2337#discussion_r17797122
  
--- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
@@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] {
   // is cancelled before the action was even run (and thus we have no 
thread to interrupt).
   @volatile private var _cancelled: Boolean = false
 
+  @volatile private var jobs: Seq[Int] = Nil
--- End diff --

Just wondering - any reason to make this a `var` instead of a `val` 
ListBuffer? And then we could return an immutable `Seq` in jobIds?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2447#issuecomment-56208207
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2447#issuecomment-56208954
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20577/consoleFull)
 for   PR 2447 at commit 
[`afc4e09`](https://github.com/apache/spark/commit/afc4e097842e45f50251a9340371b5ded0a65ae0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56210084
  
@mengxr PickleSerializer do not compress data, there is CompressSerializer 
can do it using gzip(level 1). Compression can help for small range of double 
or repeated values, will be worser with random double in large range.

BatchedSerializer can help to reduce the overhead of name of class. In JVM, 
the memory of short lived objects can not be reused without GC, so 
batched-serialization will not increase the gc pressure if the batch size it 
not too large. (depend on how gc is configured)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56211052
  
@mengxr In this PR, I just tried to avoid other changes except 
serialization, we could change the cache behavior or compression later.

It's will be good to have some number of about the performance regression, 
I only see 5% regression in  LogisticRegressionWithSGD.train() with small 
dataset (locally).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17798663
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala
 ---
@@ -19,29 +19,24 @@ package org.apache.spark.deploy.yarn
 
 import java.net.URI
 
+import scala.collection.mutable.{HashMap, LinkedHashMap, Map}
+
 import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.FileStatus
-import org.apache.hadoop.fs.FileSystem
-import org.apache.hadoop.fs.Path
+import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
 import org.apache.hadoop.fs.permission.FsAction
-import org.apache.hadoop.yarn.api.records.LocalResource
-import org.apache.hadoop.yarn.api.records.LocalResourceVisibility
-import org.apache.hadoop.yarn.api.records.LocalResourceType
+import org.apache.hadoop.yarn.api.records._
--- End diff --

Because IDE. I can fix it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56211203
  
LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17798689
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -415,41 +381,153 @@ trait ClientBase extends Logging {
 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout,
 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr)
 
-logInfo(Yarn AM launch context:)
-logInfo(s  user class: ${args.userClass})
-logInfo(s  env:$env)
-logInfo(s  command:${commands.mkString( )})
-
 // TODO: it would be nicer to just make sure there are no null 
commands here
 val printableCommands = commands.map(s = if (s == null) null else 
s).toList
 amContainer.setCommands(printableCommands)
 
-setupSecurityToken(amContainer)
+
logDebug(===)
+logDebug(Yarn AM launch context:)
+logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)})
+logDebug(env:)
+launchEnv.foreach { case (k, v) = logDebug(s$k - $v) }
+logDebug(resources:)
+localResources.foreach { case (k, v) = logDebug(s$k - $v)}
+logDebug(command:)
+logDebug(s${printableCommands.mkString( )})
+
logDebug(===)
 
 // send the acl settings into YARN to control who has access via YARN 
interfaces
 val securityManager = new SecurityManager(sparkConf)
 
amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager))
+setupSecurityToken(amContainer)
+UserGroupInformation.getCurrentUser().addCredentials(credentials)
 
 amContainer
   }
+
+  /**
+   * Report the state of an application until it has exited, either 
successfully or
+   * due to some failure, then return the application state.
+   *
+   * @param returnOnRunning Whether to also return the application state 
when it is RUNNING.
+   * @param logApplicationReport Whether to log details of the application 
report every iteration.
+   * @return state of the application, one of FINISHED, FAILED, KILLED, 
and RUNNING.
+   */
+  def monitorApplication(
+  appId: ApplicationId,
+  returnOnRunning: Boolean = false,
+  logApplicationReport: Boolean = true): YarnApplicationState = {
+val interval = sparkConf.getLong(spark.yarn.report.interval, 1000)
+var lastState: YarnApplicationState = null
+while (true) {
+  Thread.sleep(interval)
+  val report = getApplicationReport(appId)
+  val state = report.getYarnApplicationState
+
+  if (logApplicationReport) {
+logInfo(sApplication report from ResourceManager for app 
${appId.getId} (state: $state))
--- End diff --

Ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17798769
  
--- Diff: sbin/slaves.sh ---
@@ -67,20 +69,26 @@ fi
 
 if [ $HOSTLIST =  ]; then
   if [ $SPARK_SLAVES =  ]; then
-export HOSTLIST=${SPARK_CONF_DIR}/slaves
+if [ -f ${SPARK_CONF_DIR}/slaves ]; then
+  HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves`
+else
+  HOSTLIST=localhost
+fi
   else
-export HOSTLIST=${SPARK_SLAVES}
+HOSTLIST=`cat ${SPARK_SLAVES}`
   fi
 fi
 
+
+
 # By default disable strict host key checking
 if [ $SPARK_SSH_OPTS =  ]; then
   SPARK_SSH_OPTS=-o StrictHostKeyChecking=no
 fi
 
-for slave in `cat $HOSTLIST|sed  s/#.*$//;/^$/d`; do
+for slave in `echo $HOSTLIST|sed  s/#.*$//;/^$/d`; do
  ssh $SPARK_SSH_OPTS $slave $${@// /\\ } \
-   21 | sed s/^/$slave: / 
+   21 | sed s/^/$slave: /
--- End diff --

I agree with matt - this will regress behavior for other users. Can we have 
a flag called `SSH_FOREGROUND` that turns this on?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: fix compile error for hadoop CDH 4.4+

2014-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/151


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1793 - Heavily duplicated test setup cod...

2014-09-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17799143
  
--- Diff: sbin/slaves.sh ---
@@ -67,20 +69,26 @@ fi
 
 if [ $HOSTLIST =  ]; then
   if [ $SPARK_SLAVES =  ]; then
-export HOSTLIST=${SPARK_CONF_DIR}/slaves
+if [ -f ${SPARK_CONF_DIR}/slaves ]; then
+  HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves`
+else
+  HOSTLIST=localhost
--- End diff --

We should change the docs in `spark-standalone.md` to explain two new 
features:

1. You can set SSH_FOREGROUND if you cannot use paswordless SSH (currently, 
it says this is required).
2. If there is no `slaves` file in existence, it will launch a single slave 
at `localhost` by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2444#discussion_r17799162
  
--- Diff: .gitignore ---
@@ -19,6 +19,7 @@ conf/*.sh
 conf/*.properties
 conf/*.conf
 conf/*.xml
+conf/slaves
--- End diff --

Okay this is fine actually, given that we preserve the deafult behavior 
due to your edits below (of starting at localhost).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2444#issuecomment-56212658
  
Made some comments. We need to guard this with a config parameter because 
otherwise it will regress behavior on large clusters where serial vs parallel 
ssh makes a big difference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-09-19 Thread zhzhan

Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-56212828
  
This patch does not include thrift patch, which will be fixed by other 
jiras, because I don't want the scope is too big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2450#discussion_r17799318
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -872,7 +872,12 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   hadoopConf.set(mapred.output.compression.codec, c.getCanonicalName)
   hadoopConf.set(mapred.output.compression.type, 
CompressionType.BLOCK.toString)
 }
-hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
+
+// Useful on EMR where direct output committer is set by default
--- End diff --

For this comment I'd make it more general:

```
// Use existing output committer if already set
```

I'm guessing over time we'll run into many formats that require this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

2014-09-19 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2350#discussion_r17799322
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -37,154 +36,106 @@ import org.apache.hadoop.yarn.api.protocolrecords._
 import org.apache.hadoop.yarn.api.records._
 import org.apache.hadoop.yarn.conf.YarnConfiguration
 import org.apache.hadoop.yarn.util.Records
+
 import org.apache.spark.{Logging, SecurityManager, SparkConf, 
SparkContext, SparkException}
 
 /**
- * The entry point (starting in Client#main() and Client#run()) for 
launching Spark on YARN. The
- * Client submits an application to the YARN ResourceManager.
+ * The entry point (starting in Client#main() and Client#run()) for 
launching Spark on YARN.
+ * The Client submits an application to the YARN ResourceManager.
  */
-trait ClientBase extends Logging {
-  val args: ClientArguments
-  val conf: Configuration
-  val sparkConf: SparkConf
-  val yarnConf: YarnConfiguration
-  val credentials = UserGroupInformation.getCurrentUser().getCredentials()
-  private val SPARK_STAGING: String = .sparkStaging
+private[spark] trait ClientBase extends Logging {
+  import ClientBase._
+
+  protected val args: ClientArguments
+  protected val hadoopConf: Configuration
+  protected val sparkConf: SparkConf
+  protected val yarnConf: YarnConfiguration
+  protected val credentials = 
UserGroupInformation.getCurrentUser.getCredentials
+  protected val amMemoryOverhead = args.amMemoryOverhead // MB
+  protected val executorMemoryOverhead = args.executorMemoryOverhead // MB
   private val distCacheMgr = new ClientDistributedCacheManager()
 
-  // Staging directory is private! - rwx
-  val STAGING_DIR_PERMISSION: FsPermission =
-FsPermission.createImmutable(Integer.parseInt(700, 8).toShort)
-  // App files are world-wide readable and owner writable - rw-r--r--
-  val APP_FILE_PERMISSION: FsPermission =
-FsPermission.createImmutable(Integer.parseInt(644, 8).toShort)
-
-  // Additional memory overhead - in mb.
-  protected def memoryOverhead: Int = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-YarnSparkHadoopUtil.DEFAULT_MEMORY_OVERHEAD)
-
-  // TODO(harvey): This could just go in ClientArguments.
-  def validateArgs() = {
-Map(
-  (args.numExecutors = 0) - Error: You must specify at least 1 
executor!,
-  (args.amMemory = memoryOverhead) - (Error: AM memory size must 
be +
-greater than:  + memoryOverhead),
-  (args.executorMemory = memoryOverhead) - (Error: Executor memory 
size +
-must be greater than:  + memoryOverhead.toString)
-).foreach { case(cond, errStr) =
-  if (cond) {
-logError(errStr)
-throw new IllegalArgumentException(args.getUsageMessage())
-  }
-}
-  }
-
-  def getAppStagingDir(appId: ApplicationId): String = {
-SPARK_STAGING + Path.SEPARATOR + appId.toString() + Path.SEPARATOR
-  }
-
-  def verifyClusterResources(app: GetNewApplicationResponse) = {
-val maxMem = app.getMaximumResourceCapability().getMemory()
-logInfo(Max mem capabililty of a single resource in this cluster  + 
maxMem)
-
-// If we have requested more then the clusters max for a single 
resource then exit.
-if (args.executorMemory  maxMem) {
-  val errorMessage =
-Required executor memory (%d MB), is above the max threshold (%d 
MB) of this cluster.
-  .format(args.executorMemory, maxMem)
-
-  logError(errorMessage)
-  throw new IllegalArgumentException(errorMessage)
-}
-val amMem = args.amMemory + memoryOverhead
+  /**
+   * Fail fast if we have requested more resources per container than is 
available in the cluster.
+   */
+  protected def verifyClusterResources(newAppResponse: 
GetNewApplicationResponse): Unit = {
+val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
+logInfo(Verifying our application has not requested more than the 
maximum  +
+  smemory capability of the cluster ($maxMem MB per container))
+val executorMem = args.executorMemory + executorMemoryOverhead
+if (executorMem  maxMem) {
+  throw new IllegalArgumentException(sRequired executor memory 
($executorMem MB)  +
+sis above the max threshold ($maxMem MB) of this cluster!)
+}
+val amMem = args.amMemory + amMemoryOverhead
 if (amMem  maxMem) {
-
-  val errorMessage = Required AM memory (%d) is above the max 
threshold (%d) of this cluster.
-.format(amMem, maxMem)
-  logError(errorMessage)
-  throw new

[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56213954
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20572/consoleFull)
 for   PR 2440 at commit 
[`b340956`](https://github.com/apache/spark/commit/b34095661f2fe060c1819293a203216c16cf5454).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2458#issuecomment-56214025
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20569/consoleFull)
 for   PR 2458 at commit 
[`4274b1d`](https://github.com/apache/spark/commit/4274b1d10fc48746c850207fc27e5acc8630ddc9).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2450#discussion_r17799792
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with 
SharedSparkContext {
 pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored)
   }
 
+  test(saveAsHadoopFile should respect configured output committers) {
+val pairs = sc.parallelize(Array((new Integer(1), new Integer(1
+val conf = new JobConf(sc.hadoopConfiguration)
+conf.setOutputCommitter(classOf[FakeOutputCommitter])
+pairs.saveAsHadoopFile(ignored, pairs.keyClass, pairs.valueClass, 
classOf[FakeOutputFormat], conf)
+val ran = sys.props.remove(mapred.committer.ran)
--- End diff --

This use of system properties here means this test can't run in parallel. 
It might be good to do things:

1. Guard these tests with a lock so both can't run at the same time.
2. Clear the `mapred.committer.ran` property before starting the test 
(otherwise you could get a false positive).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2450#discussion_r17799890
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with 
SharedSparkContext {
 pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored)
   }
 
+  test(saveAsHadoopFile should respect configured output committers) {
+val pairs = sc.parallelize(Array((new Integer(1), new Integer(1
+val conf = new JobConf(sc.hadoopConfiguration)
--- End diff --

Could this just start with a blank jobConf rather than reading the one from 
the spark context?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56214417
  
There is a related PR #1940


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.

2014-09-19 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2453#issuecomment-56214347
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2450#discussion_r17800209
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/AwsTest.scala 
---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples
+
+import org.apache.commons.logging.LogFactory
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.mapred._
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.SparkContext._
+
+/**
+ * An OutputCommitter similar to the one used by default for s3:// URLs in 
EMR.
+ */
+class DirectOutputCommitter extends OutputCommitter {
--- End diff --

It's great that you did this integration test to verify this is working. 
However, we usually won't merge things like this into the repo because tests 
that aren't run regularly as part of our harness don't provide much testing 
value (and often become out of date, etc).

AFAIK the unit test provides pretty good coverage here. Would you mind 
dropping this from the PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2450#issuecomment-56214986
  
Thanks for sending this. The approach seems solid. I made some small 
comments in a few places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3605. Fix typo in SchemaRDD.

2014-09-19 Thread sryza

GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/2460

SPARK-3605. Fix typo in SchemaRDD.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-3605

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2460


commit 09d940ba78c3ed432c4982d167f979fa94a82c56
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-09-19T18:20:34Z

SPARK-3605. Fix typo in SchemaRDD.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2337#discussion_r17800470
  
--- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala ---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.concurrent.Await
+import scala.concurrent.duration.Duration
+
+import org.scalatest.{BeforeAndAfter, FunSuite, Matchers}
+
+import org.apache.spark.SparkContext._
+
+class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers 
with LocalSparkContext {
+
+  before {
+sc = new SparkContext(local, FutureActionSuite)
--- End diff --

Isn't that the test on L41 (complex async action)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3605. Fix typo in SchemaRDD.

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2460#issuecomment-56215386
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20580/consoleFull)
 for   PR 2460 at commit 
[`09d940b`](https://github.com/apache/spark/commit/09d940ba78c3ed432c4982d167f979fa94a82c56).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2440#issuecomment-56215397
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20581/consoleFull)
 for   PR 2440 at commit 
[`c81439b`](https://github.com/apache/spark/commit/c81439be1595bd2403c97065b58c4e4319bdf37e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2337#discussion_r17800549
  
--- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala ---
@@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] {
   // is cancelled before the action was even run (and thus we have no 
thread to interrupt).
   @volatile private var _cancelled: Boolean = false
 
+  @volatile private var jobs: Seq[Int] = Nil
--- End diff --

I'm trying to avoid synchonization. Having a mutable list here means I'd 
have to synchronize when returning the immutable Seq in `jobIds`; with the 
volatile var, I'm only doing read operations on the `Seq`s themselves, so I 
don't need to synchronize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17800692
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -57,13 +250,709 @@ trait Matrix extends Serializable {
  * @param numCols number of columns
  * @param values matrix entries in column major
  */
-class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
+class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix with Serializable {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit = {
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+
+  private[mllib] def elementWiseOperateOnColumnsInPlace(
+  f: (Double, Double) = Double,
+  y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+val len = y_vals.length
+require(y_vals.length == numRows)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  len){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(i))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateOnRowsInPlace(
+ f: (Double, Double) = Double,
+ y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+require(y_vals.length == numCols)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  numRows){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(j))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix =  {
+val y_val = y.toArray
+val len = values.length
+require(y_val.length == values.length)
+var j = 0
+while (j  len){
+  values(j) = f(values(j), y_val(j))
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) 
= Double, y: Double): DenseMatrix =  {
+var j = 0
+val len = values.length
+while (j  len){
+  values(j) = f(values(j), y)
+  j += 1
+}
+this
+  }
+
+  private[mllib] def operateInPlace(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix = {
+if (y.numCols==1 || y.numRows == 1){
+  require(numCols != numRows, Operation is ambiguous. Please use 
elementWiseOperateOnRows  +
+or elementWiseOperateOnColumns instead)
+}
+if (y.numCols == 1  y.numRows == 1){
+  elementWiseOperateScalarInPlace(f, y.toArray(0))
+} else {
+  if (y.numCols==1) {
+elementWiseOperateOnColumnsInPlace(f, y)
+  }else if (y.numRows==1){
+elementWiseOperateOnRowsInPlace(f, y)
+  }else{
+elementWiseOperateInPlace(f, y)
+  }
+}
+  }
+
+  private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnColumnsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnRowsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateScalar(f: (Double, Double) = 
Double, y: Double): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateScalarInPlace(f, y)
+  }
+
+  private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): 
DenseMatrix = {
+val dup = this.copy
+dup.operateInPlace(f, y)
+  }
+
+  def map(f: Double = Double) = new

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17800664
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -57,13 +250,709 @@ trait Matrix extends Serializable {
  * @param numCols number of columns
  * @param values matrix entries in column major
  */
-class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
+class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix with Serializable {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit = {
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+
+  private[mllib] def elementWiseOperateOnColumnsInPlace(
+  f: (Double, Double) = Double,
+  y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+val len = y_vals.length
+require(y_vals.length == numRows)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  len){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(i))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateOnRowsInPlace(
+ f: (Double, Double) = Double,
+ y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+require(y_vals.length == numCols)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  numRows){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(j))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix =  {
+val y_val = y.toArray
+val len = values.length
+require(y_val.length == values.length)
+var j = 0
+while (j  len){
+  values(j) = f(values(j), y_val(j))
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) 
= Double, y: Double): DenseMatrix =  {
+var j = 0
+val len = values.length
+while (j  len){
+  values(j) = f(values(j), y)
+  j += 1
+}
+this
+  }
+
+  private[mllib] def operateInPlace(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix = {
+if (y.numCols==1 || y.numRows == 1){
+  require(numCols != numRows, Operation is ambiguous. Please use 
elementWiseOperateOnRows  +
+or elementWiseOperateOnColumns instead)
+}
+if (y.numCols == 1  y.numRows == 1){
+  elementWiseOperateScalarInPlace(f, y.toArray(0))
+} else {
+  if (y.numCols==1) {
+elementWiseOperateOnColumnsInPlace(f, y)
+  }else if (y.numRows==1){
+elementWiseOperateOnRowsInPlace(f, y)
+  }else{
+elementWiseOperateInPlace(f, y)
+  }
+}
+  }
+
+  private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnColumnsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnRowsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateScalar(f: (Double, Double) = 
Double, y: Double): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateScalarInPlace(f, y)
+  }
+
+  private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): 
DenseMatrix = {
+val dup = this.copy
+dup.operateInPlace(f, y)
+  }
+
+  def map(f: Double = Double) = new

[GitHub] spark pull request: [Docs] Fix outdated docs for standalone cluste...

2014-09-19 Thread andrewor14

GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/2461

[Docs] Fix outdated docs for standalone cluster

This is now supported!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark document-standalone-cluster

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2461.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2461


commit 35e30eee13786b7743820145a121ccef176d627b
Author: Andrew Or andrewo...@gmail.com
Date:   2014-09-19T18:26:07Z

Fix outdated docs for standalone cluster




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17800699
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -57,13 +250,709 @@ trait Matrix extends Serializable {
  * @param numCols number of columns
  * @param values matrix entries in column major
  */
-class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
+class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix with Serializable {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit = {
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+
+  private[mllib] def elementWiseOperateOnColumnsInPlace(
+  f: (Double, Double) = Double,
+  y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+val len = y_vals.length
+require(y_vals.length == numRows)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  len){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(i))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateOnRowsInPlace(
+ f: (Double, Double) = Double,
+ y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+require(y_vals.length == numCols)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  numRows){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(j))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix =  {
+val y_val = y.toArray
+val len = values.length
+require(y_val.length == values.length)
+var j = 0
+while (j  len){
+  values(j) = f(values(j), y_val(j))
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) 
= Double, y: Double): DenseMatrix =  {
+var j = 0
+val len = values.length
+while (j  len){
+  values(j) = f(values(j), y)
+  j += 1
+}
+this
+  }
+
+  private[mllib] def operateInPlace(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix = {
+if (y.numCols==1 || y.numRows == 1){
+  require(numCols != numRows, Operation is ambiguous. Please use 
elementWiseOperateOnRows  +
+or elementWiseOperateOnColumns instead)
+}
+if (y.numCols == 1  y.numRows == 1){
+  elementWiseOperateScalarInPlace(f, y.toArray(0))
+} else {
+  if (y.numCols==1) {
+elementWiseOperateOnColumnsInPlace(f, y)
+  }else if (y.numRows==1){
+elementWiseOperateOnRowsInPlace(f, y)
+  }else{
+elementWiseOperateInPlace(f, y)
+  }
+}
+  }
+
+  private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnColumnsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnRowsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateScalar(f: (Double, Double) = 
Double, y: Double): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateScalarInPlace(f, y)
+  }
+
+  private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): 
DenseMatrix = {
+val dup = this.copy
+dup.operateInPlace(f, y)
+  }
+
+  def map(f: Double = Double) = new

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17800687
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -37,11 +44,197 @@ trait Matrix extends Serializable {
   private[mllib] def toBreeze: BM[Double]
 
   /** Gets the (i, j)-th element. */
-  private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j)
+  private[mllib] def apply(i: Int, j: Int): Double
+
+  /** Return the index for the (i, j)-th element in the backing array. */
+  private[mllib] def index(i: Int, j: Int): Int
+
+  /** Update element at (i, j) */
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit
+
+  /** Get a deep copy of the matrix. */
+  def copy: Matrix
 
+  /** Convenience method for `Matrix`-`Matrix` multiplication.
+* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported 
*/
+  def multiply(y: Matrix): DenseMatrix = {
+val C: DenseMatrix = DenseMatrix.zeros(numRows, y.numCols)
+BLAS.gemm(false, false, 1.0, this, y, 0.0, C)
+C
+  }
+
+  /** Convenience method for `Matrix`-`DenseVector` multiplication. */
+  def multiply(y: DenseVector): DenseVector = {
+val output = new DenseVector(new Array[Double](numRows))
+BLAS.gemv(1.0, this, y, 0.0, output)
+output
+  }
+
+  /** Convenience method for `Matrix`^T^-`Matrix` multiplication.
+* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported 
*/
+  def transposeMultiply(y: Matrix): DenseMatrix = {
--- End diff --

How hard would it be to have matrices store a transpose bit indicated if 
they are transposed (without the data being moved)?  I envision:
* transpose() function which sets this bit (so transpose is a lazy 
operation)
* eliminate transposeMultiply
* perhaps include a transposePhysical or tranpose(physical: Boolean) method 
which forces data movement
I'm also OK with adding that support later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17800735
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -57,13 +250,709 @@ trait Matrix extends Serializable {
  * @param numCols number of columns
  * @param values matrix entries in column major
  */
-class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
+class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix with Serializable {
 
-  require(values.length == numRows * numCols)
+  require(values.length == numRows * numCols, The number of values 
supplied doesn't match the  +
+ssize of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols})
 
   override def toArray: Array[Double] = values
 
-  private[mllib] override def toBreeze: BM[Double] = new 
BDM[Double](numRows, numCols, values)
+  private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, 
numCols, values)
+
+  private[mllib] def apply(i: Int): Double = values(i)
+
+  private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j))
+
+  private[mllib] def index(i: Int, j: Int): Int = i + numRows * j
+
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit = {
+values(index(i, j)) = v
+  }
+
+  override def copy = new DenseMatrix(numRows, numCols, values.clone())
+
+  private[mllib] def elementWiseOperateOnColumnsInPlace(
+  f: (Double, Double) = Double,
+  y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+val len = y_vals.length
+require(y_vals.length == numRows)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  len){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(i))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateOnRowsInPlace(
+ f: (Double, Double) = Double,
+ y: Matrix): DenseMatrix = {
+val y_vals = y.toArray
+require(y_vals.length == numCols)
+var j = 0
+while (j  numCols){
+  var i = 0
+  while (i  numRows){
+val idx = index(i, j)
+values(idx) = f(values(idx), y_vals(j))
+i += 1
+  }
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix =  {
+val y_val = y.toArray
+val len = values.length
+require(y_val.length == values.length)
+var j = 0
+while (j  len){
+  values(j) = f(values(j), y_val(j))
+  j += 1
+}
+this
+  }
+
+  private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) 
= Double, y: Double): DenseMatrix =  {
+var j = 0
+val len = values.length
+while (j  len){
+  values(j) = f(values(j), y)
+  j += 1
+}
+this
+  }
+
+  private[mllib] def operateInPlace(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix = {
+if (y.numCols==1 || y.numRows == 1){
+  require(numCols != numRows, Operation is ambiguous. Please use 
elementWiseOperateOnRows  +
+or elementWiseOperateOnColumns instead)
+}
+if (y.numCols == 1  y.numRows == 1){
+  elementWiseOperateScalarInPlace(f, y.toArray(0))
+} else {
+  if (y.numCols==1) {
+elementWiseOperateOnColumnsInPlace(f, y)
+  }else if (y.numRows==1){
+elementWiseOperateOnRowsInPlace(f, y)
+  }else{
+elementWiseOperateInPlace(f, y)
+  }
+}
+  }
+
+  private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnColumnsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = 
Double, y: Matrix): DenseMatrix = {
+val dup = this.copy
+dup.elementWiseOperateOnRowsInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: 
Matrix): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateInPlace(f, y)
+  }
+
+  private[mllib] def elementWiseOperateScalar(f: (Double, Double) = 
Double, y: Double): DenseMatrix =  {
+val dup = this.copy
+dup.elementWiseOperateScalarInPlace(f, y)
+  }
+
+  private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): 
DenseMatrix = {
+val dup = this.copy
+dup.operateInPlace(f, y)
+  }
+
+  def map(f: Double = Double) = new

[GitHub] spark pull request: [MLLib] Fix example code variable name misspel...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2459#issuecomment-56216041
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20568/consoleFull)
 for   PR 2459 at commit 
[`b370a91`](https://github.com/apache/spark/commit/b370a919451ca7e8c1b3eec1b35b941e48571717).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2337#issuecomment-56216044
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20583/consoleFull)
 for   PR 2337 at commit 
[`e166a68`](https://github.com/apache/spark/commit/e166a680575ae96032d7ca03aba4566105cdb388).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.

2014-09-19 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56216271
  
So, I'm a little disappointed that this doesn't at least follow the Yarn 
model of one setting that defines the overhead. Instead, it has two settings, 
one for a fraction and one to define some minimum if the fraction is somehow 
less than that. That sounds too complicated.

What's the argument against Yarn's model of a single setting with an 
absolute overhead value? That doesn't require the user to do math, and makes 
things easier when for some reason the user requires lots of overhead (e.g. 
large usage of off-heap memory) that is not necessarily related to the heap 
size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17801072
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -37,11 +44,197 @@ trait Matrix extends Serializable {
   private[mllib] def toBreeze: BM[Double]
 
   /** Gets the (i, j)-th element. */
-  private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j)
+  private[mllib] def apply(i: Int, j: Int): Double
+
+  /** Return the index for the (i, j)-th element in the backing array. */
+  private[mllib] def index(i: Int, j: Int): Int
+
+  /** Update element at (i, j) */
+  private[mllib] def update(i: Int, j: Int, v: Double): Unit
+
+  /** Get a deep copy of the matrix. */
+  def copy: Matrix
 
+  /** Convenience method for `Matrix`-`Matrix` multiplication.
+* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported 
*/
--- End diff --

Just wondering (not sure myself): Which is prefered:
`SparseMatrix`
or
[[SparseMatrix]]
in docs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2454#issuecomment-56216397
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20570/consoleFull)
 for   PR 2454 at commit 
[`2a79f26`](https://github.com/apache/spark/commit/2a79f26497f9232465aa2e9b496b0d54b9ccda75).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2451#issuecomment-56216573
  
Could the methods be ordered in the file (grouped by public, 
private[mllib], private, etc.?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2451#discussion_r17801264
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -57,13 +250,709 @@ trait Matrix extends Serializable {
  * @param numCols number of columns
  * @param values matrix entries in column major
  */
-class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix {
+class DenseMatrix(val numRows: Int, val numCols: Int, val values: 
Array[Double]) extends Matrix with Serializable {
--- End diff --

long line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...

2014-09-19 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2451#issuecomment-56216806
  
Also, is it odd that the user can't access the matrix data, except via 
toArray (or maybe side effects of the function given to map)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56216747
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20574/consoleFull)
 for   PR 2401 at commit 
[`56988e3`](https://github.com/apache/spark/commit/56988e31363bc07dc8acb369bdaade6b18b98f51).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2432#issuecomment-56216738
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20575/consoleFull)
 for   PR 2432 at commit 
[`4a93c7f`](https://github.com/apache/spark/commit/4a93c7f7da8d829a8837f3a31aff0f08355e0c5a).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Matrix extends Serializable `
  * `class SparseMatrix(`
  * `sealed trait Vector extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-56216817
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull)
 for   PR 2378 at commit 
[`dffbba2`](https://github.com/apache/spark/commit/dffbba2ba206bbbd3dfc740a55f1b0df341860e7).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.

2014-09-19 Thread brndnmtthws

Github user brndnmtthws commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56216904
  
I thought there was some desire to have the same thing also #1391?

Furthermore, from my experience writing frameworks, I think a much better 
model is the fractional overhead (relative to the heap size), for the reasons I 
mentioned above.  If you do some internet searching, you'll see that I've been 
doing quite a bit of this for a while.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 328 matches

Mail list logo