[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...
GitHub user ravipesala opened a pull request: https://github.com/apache/spark/pull/2456 [SPARK-3536][SQL] SELECT on empty parquet table throws exception It return null metadata from parquet if querying on empty parquet file while calculating splits.So added null check and returns the empty splits. Author : ravipesala ravindra.pes...@huawei.com You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/spark SPARK-3536 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2456.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2456 commit 1e81a50631b1f44ad7de65b83408a40218234745 Author: ravipesala ravindra.pes...@huawei.com Date: 2014-09-18T18:02:46Z Fixed the issue when querying on empty parquet file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/1903#issuecomment-56140430 Thanks! Merged into master and branch-1.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2062][GraphX] VertexRDD.apply does not ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1903 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-56142506 @sryza Thanks Sandy. Will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2455#issuecomment-56144570 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2455#issuecomment-56144582 this is ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56147622 @davies Does `PickleSerializer` compress data? If not, maybe we should cache the deserialized RDD instead of the one from `_.reserialize`. They have the same storage. I understand that batch-serialization can help GC. But algorithms like linear methods should only allocate short-lived objects. Is batch-serialization worth the tradeoff? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...
Github user ankurdave commented on the pull request: https://github.com/apache/spark/pull/2446#issuecomment-56151121 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3268][SQL] DoubleType, FloatType and De...
GitHub user gvramana opened a pull request: https://github.com/apache/spark/pull/2457 [SPARK-3268][SQL] DoubleType, FloatType and DecimalType modulus support Supported modulus operation using % operator on fractional datatypes FloatType, DoubleType and DecimalType Example: SELECT 1388632775.0 % 60 from tablename LIMIT 1 Author : Venkata Ramana Gollamudi ramana.gollam...@huawei.com You can merge this pull request into a Git repository by running: $ git pull https://github.com/gvramana/spark double_modulus_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2457.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2457 commit 296d2539c0d745d0450441997390052352b8731d Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com Date: 2014-09-18T11:06:10Z modified to add modulus support to fractional types float,double,decimal commit 513d0e0ce4fdaf3faf11698d4ea079c79538f402 Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com Date: 2014-09-18T11:06:10Z modified to add modulus support to fractional types float,double,decimal commit e112c09ccf0be8354afe3359a4d3e18c6346475c Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com Date: 2014-09-19T07:47:35Z corrected the testcase commit 3624471e5b65ccb92fb84d7de9303669ec79965e Author: Venkata Ramana Gollamudi ramana.gollam...@huawei.com Date: 2014-09-19T08:01:25Z modified testcase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [GitHub] spark pull request: [SPARK-3529] [SQL] Delete the temp files after...
Hm deleteOnExit should at least not hurt and I thought it will delete dirs if they are empty, which may be so if temp files inside never existed or were cleaned up themselves. But yeah always delete explicitly in the normal execution path even in the event of normal exceptions. On Sep 19, 2014 3:00 AM, mattf g...@git.apache.org wrote: Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2393#issuecomment-56127248 +1 lgtm fyi, i checked, deleteOnExit isn't an option because it cannot recursively delete --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56152819 Sorry for asking - but have you tested this on a real cluster? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56152843 Oh and thanks for doing this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2439#issuecomment-56153581 @jegonzal you should take a look :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/2456#issuecomment-56157072 Please review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...
GitHub user adrian-wang opened a pull request: https://github.com/apache/spark/pull/2458 [SPARK-3598][SQL]cast to timestamp should be the same as hive this patch fixes timestamp smaller than 0 and cast int as timestamp select cast(1000 as timestamp) from src limit 1; should return 1970-01-01 00:00:01, but we now take it as 1000 seconds. also, current implementation has bug when the time is before 1970-01-01 00:00:00. @rxin @marmbrus @chenghao-intel You can merge this pull request into a Git repository by running: $ git pull https://github.com/adrian-wang/spark timestamp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2458.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2458 commit 1234f666283172b28d5f17904fc3f2f5065a21ca Author: Daoyuan Wang daoyuan.w...@intel.com Date: 2014-09-19T10:11:49Z fix timestamp smaller than 0 and cast int as timestamp --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/791#issuecomment-56164031 Hi @liyezhang556520 , thanks for pointing this out! I have updated my PR, please review @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...
Github user liyezhang556520 commented on a diff in the pull request: https://github.com/apache/spark/pull/791#discussion_r17781184 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -239,18 +250,18 @@ private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: Long) val currentSize = vector.estimateSize() if (currentSize = memoryThreshold) { val amountToRequest = (currentSize * memoryGrowthFactor - memoryThreshold).toLong -// Hold the accounting lock, in case another thread concurrently puts a block that -// takes up the unrolling space we just ensured here -accountingLock.synchronized { - if (!reserveUnrollMemoryForThisThread(amountToRequest)) { -// If the first request is not granted, try again after ensuring free space -// If there is still not enough space, give up and drop the partition -val spaceToEnsure = maxUnrollMemory - currentUnrollMemory -if (spaceToEnsure 0) { - val result = ensureFreeSpace(blockId, spaceToEnsure) - droppedBlocks ++= result.droppedBlocks +if (!reserveUnrollMemoryForThisThread(amountToRequest)) { + val spaceToEnsure = maxUnrollMemory - currentUnrollMemory + if (spaceToEnsure 0) { +val task = planFreeSpace(blockId, spaceToEnsure) --- End diff -- Hi @cloud-fan , you removed `accountingLock.synchronized` here, so there will be more than one thread call `planFreeSpace` here for reserving memory. And each thread will asking for memory with size `maxUnrollMemory - currentUnrollMemory`. I think the logic is not the same with the original intention. There is second question, what if `maxUnrollMemory` is large (`maxMemory*unrollFraction` might be dozens of GB large), while the requested memory `amountToRequest` is small (maybe dozens of MB), then you only use one thread to free the size, which is `spaceToEnsure`, this seems doesn't solve the IO issue. Third, since you are lazy drop the to be dropped blocks, how can you avoid OOM which is @andrewor14 pointed out (the putting speed is faster than dropping)? Does the three problems exists in the current patch? Maybe I missed something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2444#issuecomment-56172067 +1 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user mattf commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17782052 --- Diff: sbin/slaves.sh --- @@ -67,20 +69,26 @@ fi if [ $HOSTLIST = ]; then if [ $SPARK_SLAVES = ]; then -export HOSTLIST=${SPARK_CONF_DIR}/slaves +if [ -f ${SPARK_CONF_DIR}/slaves ]; then + HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves` +else + HOSTLIST=localhost +fi else -export HOSTLIST=${SPARK_SLAVES} +HOSTLIST=`cat ${SPARK_SLAVES}` --- End diff -- thanks for pointing that out. i didn't read closely enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-56174217 @mridulm any comments? I'm ok with it if its a consistent problem for users. One thing we definitely need to do is document it and possibly look at including better log and error messages. We should at least log the size of the overhead it calculates. It would also be nice to log what it is when we fail to get a container large enough or it fails due to the cluster max allocation limit was hit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17785127 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -37,154 +36,106 @@ import org.apache.hadoop.yarn.api.protocolrecords._ import org.apache.hadoop.yarn.api.records._ import org.apache.hadoop.yarn.conf.YarnConfiguration import org.apache.hadoop.yarn.util.Records + import org.apache.spark.{Logging, SecurityManager, SparkConf, SparkContext, SparkException} /** - * The entry point (starting in Client#main() and Client#run()) for launching Spark on YARN. The - * Client submits an application to the YARN ResourceManager. + * The entry point (starting in Client#main() and Client#run()) for launching Spark on YARN. + * The Client submits an application to the YARN ResourceManager. */ -trait ClientBase extends Logging { - val args: ClientArguments - val conf: Configuration - val sparkConf: SparkConf - val yarnConf: YarnConfiguration - val credentials = UserGroupInformation.getCurrentUser().getCredentials() - private val SPARK_STAGING: String = .sparkStaging +private[spark] trait ClientBase extends Logging { + import ClientBase._ + + protected val args: ClientArguments + protected val hadoopConf: Configuration + protected val sparkConf: SparkConf + protected val yarnConf: YarnConfiguration + protected val credentials = UserGroupInformation.getCurrentUser.getCredentials + protected val amMemoryOverhead = args.amMemoryOverhead // MB + protected val executorMemoryOverhead = args.executorMemoryOverhead // MB private val distCacheMgr = new ClientDistributedCacheManager() - // Staging directory is private! - rwx - val STAGING_DIR_PERMISSION: FsPermission = -FsPermission.createImmutable(Integer.parseInt(700, 8).toShort) - // App files are world-wide readable and owner writable - rw-r--r-- - val APP_FILE_PERMISSION: FsPermission = -FsPermission.createImmutable(Integer.parseInt(644, 8).toShort) - - // Additional memory overhead - in mb. - protected def memoryOverhead: Int = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -YarnSparkHadoopUtil.DEFAULT_MEMORY_OVERHEAD) - - // TODO(harvey): This could just go in ClientArguments. - def validateArgs() = { -Map( - (args.numExecutors = 0) - Error: You must specify at least 1 executor!, - (args.amMemory = memoryOverhead) - (Error: AM memory size must be + -greater than: + memoryOverhead), - (args.executorMemory = memoryOverhead) - (Error: Executor memory size + -must be greater than: + memoryOverhead.toString) -).foreach { case(cond, errStr) = - if (cond) { -logError(errStr) -throw new IllegalArgumentException(args.getUsageMessage()) - } -} - } - - def getAppStagingDir(appId: ApplicationId): String = { -SPARK_STAGING + Path.SEPARATOR + appId.toString() + Path.SEPARATOR - } - - def verifyClusterResources(app: GetNewApplicationResponse) = { -val maxMem = app.getMaximumResourceCapability().getMemory() -logInfo(Max mem capabililty of a single resource in this cluster + maxMem) - -// If we have requested more then the clusters max for a single resource then exit. -if (args.executorMemory maxMem) { - val errorMessage = -Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster. - .format(args.executorMemory, maxMem) - - logError(errorMessage) - throw new IllegalArgumentException(errorMessage) -} -val amMem = args.amMemory + memoryOverhead + /** + * Fail fast if we have requested more resources per container than is available in the cluster. + */ + protected def verifyClusterResources(newAppResponse: GetNewApplicationResponse): Unit = { +val maxMem = newAppResponse.getMaximumResourceCapability().getMemory() +logInfo(Verifying our application has not requested more than the maximum + + smemory capability of the cluster ($maxMem MB per container)) +val executorMem = args.executorMemory + executorMemoryOverhead +if (executorMem maxMem) { + throw new IllegalArgumentException(sRequired executor memory ($executorMem MB) + +sis above the max threshold ($maxMem MB) of this cluster!) +} +val amMem = args.amMemory + amMemoryOverhead if (amMem maxMem) { - - val errorMessage = Required AM memory (%d) is above the max threshold (%d) of this cluster. -.format(amMem, maxMem) - logError(errorMessage) - throw new
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17786319 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -415,41 +381,153 @@ trait ClientBase extends Logging { 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout, 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr) -logInfo(Yarn AM launch context:) -logInfo(s user class: ${args.userClass}) -logInfo(s env:$env) -logInfo(s command:${commands.mkString( )}) - // TODO: it would be nicer to just make sure there are no null commands here val printableCommands = commands.map(s = if (s == null) null else s).toList amContainer.setCommands(printableCommands) -setupSecurityToken(amContainer) + logDebug(===) +logDebug(Yarn AM launch context:) +logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)}) +logDebug(env:) +launchEnv.foreach { case (k, v) = logDebug(s$k - $v) } +logDebug(resources:) +localResources.foreach { case (k, v) = logDebug(s$k - $v)} +logDebug(command:) +logDebug(s${printableCommands.mkString( )}) + logDebug(===) // send the acl settings into YARN to control who has access via YARN interfaces val securityManager = new SecurityManager(sparkConf) amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager)) +setupSecurityToken(amContainer) +UserGroupInformation.getCurrentUser().addCredentials(credentials) amContainer } + + /** + * Report the state of an application until it has exited, either successfully or + * due to some failure, then return the application state. + * --- End diff -- missing the appId param --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib] Fix example code variable name misspel...
GitHub user rnowling opened a pull request: https://github.com/apache/spark/pull/2459 [MLLib] Fix example code variable name misspelling in MLLib Feature Extraction guide You can merge this pull request into a Git repository by running: $ git pull https://github.com/rnowling/spark tfidf-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2459.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2459 commit b370a919451ca7e8c1b3eec1b35b941e48571717 Author: RJ Nowling rnowl...@gmail.com Date: 2014-09-19T14:09:13Z Fix variable name misspelling in MLLib Feature Extraction guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17786722 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala --- @@ -19,29 +19,24 @@ package org.apache.spark.deploy.yarn import java.net.URI +import scala.collection.mutable.{HashMap, LinkedHashMap, Map} + import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.FileStatus -import org.apache.hadoop.fs.FileSystem -import org.apache.hadoop.fs.Path +import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} import org.apache.hadoop.fs.permission.FsAction -import org.apache.hadoop.yarn.api.records.LocalResource -import org.apache.hadoop.yarn.api.records.LocalResourceVisibility -import org.apache.hadoop.yarn.api.records.LocalResourceType +import org.apache.hadoop.yarn.api.records._ --- End diff -- just curious, why change this to ._ and all the others to {}?I'm not sure if we have a standard for that? Generally I go for explicitly listing them the ones out I'm using. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2350#issuecomment-56183509 This mostly looks good. A couple minor comments is all. I do also still want to run through some tests on alpha. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...
Github user rnowling commented on the pull request: https://github.com/apache/spark/pull/2439#issuecomment-56183914 @ankurdave I'd be a bit concerned about how that affects the correctness of the algorithm. Especially since this will round every value down when maybe you only one to round the edge case down. Would you give me some time to check the original paper before you commit this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56184849 I did indeed test it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user gss2002 commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-56185334 We have been using this fix for a few weeks now against Hive 13. The only outstanding issue I see and this could be something larger is the fact that Spark Thrift service doesn't seem to support the hive.server2.enable.doAs = true. It doesn't set proxy user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17788554 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -415,41 +381,153 @@ trait ClientBase extends Logging { 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout, 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr) -logInfo(Yarn AM launch context:) -logInfo(s user class: ${args.userClass}) -logInfo(s env:$env) -logInfo(s command:${commands.mkString( )}) - // TODO: it would be nicer to just make sure there are no null commands here val printableCommands = commands.map(s = if (s == null) null else s).toList amContainer.setCommands(printableCommands) -setupSecurityToken(amContainer) + logDebug(===) +logDebug(Yarn AM launch context:) +logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)}) +logDebug(env:) +launchEnv.foreach { case (k, v) = logDebug(s$k - $v) } +logDebug(resources:) +localResources.foreach { case (k, v) = logDebug(s$k - $v)} +logDebug(command:) +logDebug(s${printableCommands.mkString( )}) + logDebug(===) // send the acl settings into YARN to control who has access via YARN interfaces val securityManager = new SecurityManager(sparkConf) amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager)) +setupSecurityToken(amContainer) +UserGroupInformation.getCurrentUser().addCredentials(credentials) amContainer } + + /** + * Report the state of an application until it has exited, either successfully or + * due to some failure, then return the application state. + * + * @param returnOnRunning Whether to also return the application state when it is RUNNING. + * @param logApplicationReport Whether to log details of the application report every iteration. + * @return state of the application, one of FINISHED, FAILED, KILLED, and RUNNING. + */ + def monitorApplication( + appId: ApplicationId, + returnOnRunning: Boolean = false, + logApplicationReport: Boolean = true): YarnApplicationState = { +val interval = sparkConf.getLong(spark.yarn.report.interval, 1000) +var lastState: YarnApplicationState = null +while (true) { + Thread.sleep(interval) + val report = getApplicationReport(appId) + val state = report.getYarnApplicationState + + if (logApplicationReport) { +logInfo(sApplication report from ResourceManager for app ${appId.getId} (state: $state)) --- End diff -- seems like we wouldn't need the from ResourceManager here. Also could we put the full application id here instead of just the last bit. Its must easier to copy and paste if the user wants to grab it and use in yarn command of ui. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/1297#discussion_r17790219 --- Diff: core/src/main/scala/org/apache/spark/rdd/IndexedRDDLike.scala --- @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.rdd + +import scala.collection.immutable.LongMap +import scala.language.higherKinds +import scala.reflect.ClassTag + +import org.apache.spark._ +import org.apache.spark.SparkContext._ +import org.apache.spark.storage.StorageLevel + +import IndexedRDD.Id + +/** + * Contains members that are shared among all variants of IndexedRDD (e.g., IndexedRDD, + * VertexRDD). + * + * @tparam V the type of the values stored in the IndexedRDD + * @tparam P the type of the partitions making up the IndexedRDD + * @tparam Self the type of the implementing container. This allows transformation methods on any + * implementing container to yield a result of the same type. + */ +private[spark] trait IndexedRDDLike[ +@specialized(Long, Int, Double) V, +P[X] : IndexedRDDPartitionLike[X, P], +Self[X] : IndexedRDDLike[X, P, Self]] + extends RDD[(Id, V)] { + + /** A generator for ClassTags of the value type V. */ + protected implicit def vTag: ClassTag[V] + + /** A generator for ClassTags of the partition type P. */ + protected implicit def pTag[V2]: ClassTag[P[V2]] + + /** Accessor for the IndexedRDD variant that is mixing in this trait. */ + protected def self: Self[V] + + /** The underlying representation of the IndexedRDD as an RDD of partitions. */ + def partitionsRDD: RDD[P[V]] + require(partitionsRDD.partitioner.isDefined) + + def withPartitionsRDD[V2: ClassTag](partitionsRDD: RDD[P[V2]]): Self[V2] + + override val partitioner = partitionsRDD.partitioner + + override protected def getPartitions: Array[Partition] = partitionsRDD.partitions + + override protected def getPreferredLocations(s: Partition): Seq[String] = +partitionsRDD.preferredLocations(s) + + override def persist(newLevel: StorageLevel): this.type = { +partitionsRDD.persist(newLevel) +this + } + + override def unpersist(blocking: Boolean = true): this.type = { +partitionsRDD.unpersist(blocking) +this + } + + override def count(): Long = { +partitionsRDD.map(_.size).reduce(_ + _) + } + + /** Provides the `RDD[(Id, V)]` equivalent output. */ + override def compute(part: Partition, context: TaskContext): Iterator[(Id, V)] = { +firstParent[P[V]].iterator(part, context).next.iterator + } + + /** Gets the value corresponding to the specified key, if any. */ + def get(k: Id): Option[V] = multiget(Array(k)).get(k) + + /** Gets the values corresponding to the specified keys, if any. */ + def multiget(ks: Array[Id]): Map[Id, V] = { +val ksByPartition = ks.groupBy(k = self.partitioner.get.getPartition(k)) +val partitions = ksByPartition.keys.toSeq +def unionMaps(maps: TraversableOnce[LongMap[V]]): LongMap[V] = { + maps.foldLeft(LongMap.empty[V]) { +(accum, map) = accum.unionWith(map, (id, a, b) = a) + } +} +// TODO: avoid sending all keys to all partitions by creating and zipping an RDD of keys --- End diff -- would this be another use of the `bulkMultiget` I suggested in jira? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/1297#discussion_r17791303 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ImmutableLongOpenHashSet.scala --- @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.reflect._ +import com.google.common.hash.Hashing + +/** + * A fast, immutable hash set optimized for insertions and lookups (but not deletions) of `Long` + * elements. Because it exposes the position of a key in the underlying array, this is useful as a + * building block for higher level data structures such as a hash map (for example, + * IndexedRDDPartition). + * + * It uses quadratic probing with a power-of-2 hash table size, which is guaranteed to explore all + * spaces for each key (see http://en.wikipedia.org/wiki/Quadratic_probing). + */ +private[spark] class ImmutableLongOpenHashSet( +/** Underlying array of elements used as a hash table. */ +val data: ImmutableVector[Long], +/** Whether or not there is an element at the corresponding position in `data`. */ +val bitset: ImmutableBitSet, +/** + * Position of a focused element. This is useful when returning a modified set along with a + * pointer to the location of modification. + */ +val focus: Int, +/** Load threshold at which to grow the underlying vectors. */ +loadFactor: Double + ) extends Serializable { + + require(loadFactor 1.0, Load factor must be less than 1.0) + require(loadFactor 0.0, Load factor must be greater than 0.0) + require(capacity == nextPowerOf2(capacity), data capacity must be a power of 2) + + import OpenHashSet.{INVALID_POS, NONEXISTENCE_MASK, POSITION_MASK, Hasher, LongHasher} + + private val hasher: Hasher[Long] = new LongHasher + + private def mask = capacity - 1 + private def growThreshold = (loadFactor * capacity).toInt + + def withFocus(focus: Int): ImmutableLongOpenHashSet = +new ImmutableLongOpenHashSet(data, bitset, focus, loadFactor) + + /** The number of elements in the set. */ + def size: Int = bitset.cardinality + + /** The capacity of the set (i.e. size of the underlying vector). */ + def capacity: Int = data.size + + /** Return true if this set contains the specified element. */ + def contains(k: Long): Boolean = getPos(k) != INVALID_POS + + /** + * Nondestructively add an element to the set, returning a new set. If the set is over capacity + * after the insertion, grows the set and rehashes all elements. + */ + def add(k: Long): ImmutableLongOpenHashSet = { +addWithoutResize(k).rehashIfNeeded(ImmutableLongOpenHashSet.grow, ImmutableLongOpenHashSet.move) + } + + /** + * Add an element to the set. This one differs from add in that it doesn't trigger rehashing. + * The caller is responsible for calling rehashIfNeeded. + * + * Use (retval.focus POSITION_MASK) to get the actual position, and + * (retval.focus NONEXISTENCE_MASK) == 0 for prior existence. + */ + def addWithoutResize(k: Long): ImmutableLongOpenHashSet = { +var pos = hashcode(hasher.hash(k)) mask +var i = 1 +var result: ImmutableLongOpenHashSet = null +while (result == null) { + if (!bitset.get(pos)) { +// This is a new key. +result = new ImmutableLongOpenHashSet( + data.updated(pos, k), bitset.set(pos), pos | NONEXISTENCE_MASK, loadFactor) + } else if (data(pos) == k) { +// Found an existing key. +result = this.withFocus(pos) + } else { +val delta = i +pos = (pos + delta) mask +i += 1 + } +} +result + } + + /** + * Rehash the set if it is overloaded. + * @param allocateFunc Callback invoked when we
[GitHub] spark pull request: [SPARK-927] detect numpy at time of use
Github user mattf commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56197446 for some additional input, @pwendell - do you think requiring numpy for core would be acceptable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...
Github user patmcdonough commented on a diff in the pull request: https://github.com/apache/spark/pull/2447#discussion_r17794069 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -208,6 +208,23 @@ abstract class RDD[T: ClassTag]( } /** + * Get the number of partitions in this RDD + * + * {{{ + * scala val rdd = sc.parallelize(1 to 4, 2) + * rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at console:12 + * + * scala rdd.getNumPartitions + * res1: Int = 2 + * }}} --- End diff -- Good point, although it's worth noting this was essentially ported directly from the python API (including the doc). Any doc changes should be consistent across both versions if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/1297#issuecomment-56199798 This looks great! my comments are minor. I know its early to be discussing example docs, but I just wanted to mention that I can see caching being an area of confusion. Eg., you wouldn't want to serialize cache each update to an indexedRDD, as each cache would make a full copy and not get the benefits of the ImmutableVectors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/2451#issuecomment-56202513 @anantasty: If you could look through the code and mark places where you're like What the heck is going on here, it would be easier for me to write up proper comments. I'm going to add a lot today, I can incorporate yours as well. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56206573 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56206744 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3536][SQL] SELECT on empty parquet tabl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2456#issuecomment-56206732 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3268][SQL] DoubleType, FloatType and De...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2457#issuecomment-56206727 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2337#discussion_r17796741 --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala --- @@ -83,6 +83,15 @@ trait FutureAction[T] extends Future[T] { */ @throws(classOf[Exception]) def get(): T = Await.result(this, Duration.Inf) + + /** + * Returns the job IDs run by the underlying async operation. + * + * This returns the current snapshot of the job list. Certain operations may run multiple + * job, so multiple calls to this method may return different lists. --- End diff -- multiple jobs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3250] Implement Gap Sampling optimizati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2455#issuecomment-56206738 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2337#discussion_r17796804 --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.concurrent.Await +import scala.concurrent.duration.Duration + +import org.scalatest.{BeforeAndAfter, FunSuite, Matchers} + +import org.apache.spark.SparkContext._ + +class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers with LocalSparkContext { + + before { +sc = new SparkContext(local, FutureActionSuite) --- End diff -- can you add a test here for the case when multiple job id's are used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2379#issuecomment-56207066 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20573/consoleFull) for PR 2379 at commit [`5acc167`](https://github.com/apache/spark/commit/5acc16712f031d5e3269b9088acee8e7e6c8d431). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2432#issuecomment-56207059 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20575/consoleFull) for PR 2432 at commit [`4a93c7f`](https://github.com/apache/spark/commit/4a93c7f7da8d829a8837f3a31aff0f08355e0c5a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2454#issuecomment-56207072 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20570/consoleFull) for PR 2454 at commit [`2a79f26`](https://github.com/apache/spark/commit/2a79f26497f9232465aa2e9b496b0d54b9ccda75). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56207056 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20572/consoleFull) for PR 2440 at commit [`b340956`](https://github.com/apache/spark/commit/b34095661f2fe060c1819293a203216c16cf5454). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56207061 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20574/consoleFull) for PR 2401 at commit [`56988e3`](https://github.com/apache/spark/commit/56988e31363bc07dc8acb369bdaade6b18b98f51). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56207099 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull) for PR 2378 at commit [`dffbba2`](https://github.com/apache/spark/commit/dffbba2ba206bbbd3dfc740a55f1b0df341860e7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1987] EdgePartitionBuilder: More memory...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2446#issuecomment-56207128 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20571/consoleFull) for PR 2446 at commit [`e1a8f04`](https://github.com/apache/spark/commit/e1a8f04ba923935e26bc8a78c3e0aff03751aae4). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2098] All Spark processes should suppor...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2379#issuecomment-56207390 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20573/consoleFull) for PR 2379 at commit [`5acc167`](https://github.com/apache/spark/commit/5acc16712f031d5e3269b9088acee8e7e6c8d431). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2458#issuecomment-56207035 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20569/consoleFull) for PR 2458 at commit [`4274b1d`](https://github.com/apache/spark/commit/4274b1d10fc48746c850207fc27e5acc8630ddc9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56207872 It would be good to test the complex case with multiple job ids, but overall looks good. @rxin you added this interface - can you take a look (this is a very small patch)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2337#discussion_r17797122 --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala --- @@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] { // is cancelled before the action was even run (and thus we have no thread to interrupt). @volatile private var _cancelled: Boolean = false + @volatile private var jobs: Seq[Int] = Nil --- End diff -- Just wondering - any reason to make this a `var` instead of a `val` ListBuffer? And then we could return an immutable `Seq` in jobIds? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2447#issuecomment-56208207 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3580: New public method for RDD's to hav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2447#issuecomment-56208954 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20577/consoleFull) for PR 2447 at commit [`afc4e09`](https://github.com/apache/spark/commit/afc4e097842e45f50251a9340371b5ded0a65ae0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56210084 @mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated values, will be worser with random double in large range. BatchedSerializer can help to reduce the overhead of name of class. In JVM, the memory of short lived objects can not be reused without GC, so batched-serialization will not increase the gc pressure if the batch size it not too large. (depend on how gc is configured) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56211052 @mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later. It's will be good to have some number of about the performance regression, I only see 5% regression in LogisticRegressionWithSGD.train() with small dataset (locally). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17798663 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala --- @@ -19,29 +19,24 @@ package org.apache.spark.deploy.yarn import java.net.URI +import scala.collection.mutable.{HashMap, LinkedHashMap, Map} + import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.FileStatus -import org.apache.hadoop.fs.FileSystem -import org.apache.hadoop.fs.Path +import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} import org.apache.hadoop.fs.permission.FsAction -import org.apache.hadoop.yarn.api.records.LocalResource -import org.apache.hadoop.yarn.api.records.LocalResourceVisibility -import org.apache.hadoop.yarn.api.records.LocalResourceType +import org.apache.hadoop.yarn.api.records._ --- End diff -- Because IDE. I can fix it up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56211203 LGTM pending tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17798689 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -415,41 +381,153 @@ trait ClientBase extends Logging { 1, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stdout, 2, ApplicationConstants.LOG_DIR_EXPANSION_VAR + /stderr) -logInfo(Yarn AM launch context:) -logInfo(s user class: ${args.userClass}) -logInfo(s env:$env) -logInfo(s command:${commands.mkString( )}) - // TODO: it would be nicer to just make sure there are no null commands here val printableCommands = commands.map(s = if (s == null) null else s).toList amContainer.setCommands(printableCommands) -setupSecurityToken(amContainer) + logDebug(===) +logDebug(Yarn AM launch context:) +logDebug(suser class: ${Option(args.userClass).getOrElse(N/A)}) +logDebug(env:) +launchEnv.foreach { case (k, v) = logDebug(s$k - $v) } +logDebug(resources:) +localResources.foreach { case (k, v) = logDebug(s$k - $v)} +logDebug(command:) +logDebug(s${printableCommands.mkString( )}) + logDebug(===) // send the acl settings into YARN to control who has access via YARN interfaces val securityManager = new SecurityManager(sparkConf) amContainer.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager)) +setupSecurityToken(amContainer) +UserGroupInformation.getCurrentUser().addCredentials(credentials) amContainer } + + /** + * Report the state of an application until it has exited, either successfully or + * due to some failure, then return the application state. + * + * @param returnOnRunning Whether to also return the application state when it is RUNNING. + * @param logApplicationReport Whether to log details of the application report every iteration. + * @return state of the application, one of FINISHED, FAILED, KILLED, and RUNNING. + */ + def monitorApplication( + appId: ApplicationId, + returnOnRunning: Boolean = false, + logApplicationReport: Boolean = true): YarnApplicationState = { +val interval = sparkConf.getLong(spark.yarn.report.interval, 1000) +var lastState: YarnApplicationState = null +while (true) { + Thread.sleep(interval) + val report = getApplicationReport(appId) + val state = report.getYarnApplicationState + + if (logApplicationReport) { +logInfo(sApplication report from ResourceManager for app ${appId.getId} (state: $state)) --- End diff -- Ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17798769 --- Diff: sbin/slaves.sh --- @@ -67,20 +69,26 @@ fi if [ $HOSTLIST = ]; then if [ $SPARK_SLAVES = ]; then -export HOSTLIST=${SPARK_CONF_DIR}/slaves +if [ -f ${SPARK_CONF_DIR}/slaves ]; then + HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves` +else + HOSTLIST=localhost +fi else -export HOSTLIST=${SPARK_SLAVES} +HOSTLIST=`cat ${SPARK_SLAVES}` fi fi + + # By default disable strict host key checking if [ $SPARK_SSH_OPTS = ]; then SPARK_SSH_OPTS=-o StrictHostKeyChecking=no fi -for slave in `cat $HOSTLIST|sed s/#.*$//;/^$/d`; do +for slave in `echo $HOSTLIST|sed s/#.*$//;/^$/d`; do ssh $SPARK_SSH_OPTS $slave $${@// /\\ } \ - 21 | sed s/^/$slave: / + 21 | sed s/^/$slave: / --- End diff -- I agree with matt - this will regress behavior for other users. Can we have a flag called `SSH_FOREGROUND` that turns this on? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: fix compile error for hadoop CDH 4.4+
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/151 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1793 - Heavily duplicated test setup cod...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17799143 --- Diff: sbin/slaves.sh --- @@ -67,20 +69,26 @@ fi if [ $HOSTLIST = ]; then if [ $SPARK_SLAVES = ]; then -export HOSTLIST=${SPARK_CONF_DIR}/slaves +if [ -f ${SPARK_CONF_DIR}/slaves ]; then + HOSTLIST=`cat ${SPARK_CONF_DIR}/slaves` +else + HOSTLIST=localhost --- End diff -- We should change the docs in `spark-standalone.md` to explain two new features: 1. You can set SSH_FOREGROUND if you cannot use paswordless SSH (currently, it says this is required). 2. If there is no `slaves` file in existence, it will launch a single slave at `localhost` by default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2444#discussion_r17799162 --- Diff: .gitignore --- @@ -19,6 +19,7 @@ conf/*.sh conf/*.properties conf/*.conf conf/*.xml +conf/slaves --- End diff -- Okay this is fine actually, given that we preserve the deafult behavior due to your edits below (of starting at localhost). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3584] sbin/slaves doesn't work when we ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2444#issuecomment-56212658 Made some comments. We need to guard this with a config parameter because otherwise it will regress behavior on large clusters where serial vs parallel ssh makes a big difference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-56212828 This patch does not include thrift patch, which will be fixed by other jiras, because I don't want the scope is too big. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2450#discussion_r17799318 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -872,7 +872,12 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) hadoopConf.set(mapred.output.compression.codec, c.getCanonicalName) hadoopConf.set(mapred.output.compression.type, CompressionType.BLOCK.toString) } -hadoopConf.setOutputCommitter(classOf[FileOutputCommitter]) + +// Useful on EMR where direct output committer is set by default --- End diff -- For this comment I'd make it more general: ``` // Use existing output committer if already set ``` I'm guessing over time we'll run into many formats that require this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2350#discussion_r17799322 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -37,154 +36,106 @@ import org.apache.hadoop.yarn.api.protocolrecords._ import org.apache.hadoop.yarn.api.records._ import org.apache.hadoop.yarn.conf.YarnConfiguration import org.apache.hadoop.yarn.util.Records + import org.apache.spark.{Logging, SecurityManager, SparkConf, SparkContext, SparkException} /** - * The entry point (starting in Client#main() and Client#run()) for launching Spark on YARN. The - * Client submits an application to the YARN ResourceManager. + * The entry point (starting in Client#main() and Client#run()) for launching Spark on YARN. + * The Client submits an application to the YARN ResourceManager. */ -trait ClientBase extends Logging { - val args: ClientArguments - val conf: Configuration - val sparkConf: SparkConf - val yarnConf: YarnConfiguration - val credentials = UserGroupInformation.getCurrentUser().getCredentials() - private val SPARK_STAGING: String = .sparkStaging +private[spark] trait ClientBase extends Logging { + import ClientBase._ + + protected val args: ClientArguments + protected val hadoopConf: Configuration + protected val sparkConf: SparkConf + protected val yarnConf: YarnConfiguration + protected val credentials = UserGroupInformation.getCurrentUser.getCredentials + protected val amMemoryOverhead = args.amMemoryOverhead // MB + protected val executorMemoryOverhead = args.executorMemoryOverhead // MB private val distCacheMgr = new ClientDistributedCacheManager() - // Staging directory is private! - rwx - val STAGING_DIR_PERMISSION: FsPermission = -FsPermission.createImmutable(Integer.parseInt(700, 8).toShort) - // App files are world-wide readable and owner writable - rw-r--r-- - val APP_FILE_PERMISSION: FsPermission = -FsPermission.createImmutable(Integer.parseInt(644, 8).toShort) - - // Additional memory overhead - in mb. - protected def memoryOverhead: Int = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -YarnSparkHadoopUtil.DEFAULT_MEMORY_OVERHEAD) - - // TODO(harvey): This could just go in ClientArguments. - def validateArgs() = { -Map( - (args.numExecutors = 0) - Error: You must specify at least 1 executor!, - (args.amMemory = memoryOverhead) - (Error: AM memory size must be + -greater than: + memoryOverhead), - (args.executorMemory = memoryOverhead) - (Error: Executor memory size + -must be greater than: + memoryOverhead.toString) -).foreach { case(cond, errStr) = - if (cond) { -logError(errStr) -throw new IllegalArgumentException(args.getUsageMessage()) - } -} - } - - def getAppStagingDir(appId: ApplicationId): String = { -SPARK_STAGING + Path.SEPARATOR + appId.toString() + Path.SEPARATOR - } - - def verifyClusterResources(app: GetNewApplicationResponse) = { -val maxMem = app.getMaximumResourceCapability().getMemory() -logInfo(Max mem capabililty of a single resource in this cluster + maxMem) - -// If we have requested more then the clusters max for a single resource then exit. -if (args.executorMemory maxMem) { - val errorMessage = -Required executor memory (%d MB), is above the max threshold (%d MB) of this cluster. - .format(args.executorMemory, maxMem) - - logError(errorMessage) - throw new IllegalArgumentException(errorMessage) -} -val amMem = args.amMemory + memoryOverhead + /** + * Fail fast if we have requested more resources per container than is available in the cluster. + */ + protected def verifyClusterResources(newAppResponse: GetNewApplicationResponse): Unit = { +val maxMem = newAppResponse.getMaximumResourceCapability().getMemory() +logInfo(Verifying our application has not requested more than the maximum + + smemory capability of the cluster ($maxMem MB per container)) +val executorMem = args.executorMemory + executorMemoryOverhead +if (executorMem maxMem) { + throw new IllegalArgumentException(sRequired executor memory ($executorMem MB) + +sis above the max threshold ($maxMem MB) of this cluster!) +} +val amMem = args.amMemory + amMemoryOverhead if (amMem maxMem) { - - val errorMessage = Required AM memory (%d) is above the max threshold (%d) of this cluster. -.format(amMem, maxMem) - logError(errorMessage) - throw new
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56213954 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20572/consoleFull) for PR 2440 at commit [`b340956`](https://github.com/apache/spark/commit/b34095661f2fe060c1819293a203216c16cf5454). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3598][SQL]cast to timestamp should be t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2458#issuecomment-56214025 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20569/consoleFull) for PR 2458 at commit [`4274b1d`](https://github.com/apache/spark/commit/4274b1d10fc48746c850207fc27e5acc8630ddc9). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2450#discussion_r17799792 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with SharedSparkContext { pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored) } + test(saveAsHadoopFile should respect configured output committers) { +val pairs = sc.parallelize(Array((new Integer(1), new Integer(1 +val conf = new JobConf(sc.hadoopConfiguration) +conf.setOutputCommitter(classOf[FakeOutputCommitter]) +pairs.saveAsHadoopFile(ignored, pairs.keyClass, pairs.valueClass, classOf[FakeOutputFormat], conf) +val ran = sys.props.remove(mapred.committer.ran) --- End diff -- This use of system properties here means this test can't run in parallel. It might be good to do things: 1. Guard these tests with a lock so both can't run at the same time. 2. Clear the `mapred.committer.ran` property before starting the test (otherwise you could get a false positive). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2450#discussion_r17799890 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with SharedSparkContext { pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored) } + test(saveAsHadoopFile should respect configured output committers) { +val pairs = sc.parallelize(Array((new Integer(1), new Integer(1 +val conf = new JobConf(sc.hadoopConfiguration) --- End diff -- Could this just start with a blank jobConf rather than reading the one from the spark context? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56214417 There is a related PR #1940 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3597][Mesos] Implement `killTask`.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2453#issuecomment-56214347 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2450#discussion_r17800209 --- Diff: examples/src/main/scala/org/apache/spark/examples/AwsTest.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples + +import org.apache.commons.logging.LogFactory +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.mapred._ +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.SparkContext._ + +/** + * An OutputCommitter similar to the one used by default for s3:// URLs in EMR. + */ +class DirectOutputCommitter extends OutputCommitter { --- End diff -- It's great that you did this integration test to verify this is working. However, we usually won't merge things like this into the repo because tests that aren't run regularly as part of our harness don't provide much testing value (and often become out of date, etc). AFAIK the unit test provides pretty good coverage here. Would you mind dropping this from the PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2450#issuecomment-56214986 Thanks for sending this. The approach seems solid. I made some small comments in a few places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3605. Fix typo in SchemaRDD.
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/2460 SPARK-3605. Fix typo in SchemaRDD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-3605 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2460 commit 09d940ba78c3ed432c4982d167f979fa94a82c56 Author: Sandy Ryza sa...@cloudera.com Date: 2014-09-19T18:20:34Z SPARK-3605. Fix typo in SchemaRDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2337#discussion_r17800470 --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.concurrent.Await +import scala.concurrent.duration.Duration + +import org.scalatest.{BeforeAndAfter, FunSuite, Matchers} + +import org.apache.spark.SparkContext._ + +class FutureActionSuite extends FunSuite with BeforeAndAfter with Matchers with LocalSparkContext { + + before { +sc = new SparkContext(local, FutureActionSuite) --- End diff -- Isn't that the test on L41 (complex async action)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3605. Fix typo in SchemaRDD.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2460#issuecomment-56215386 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20580/consoleFull) for PR 2460 at commit [`09d940b`](https://github.com/apache/spark/commit/09d940ba78c3ed432c4982d167f979fa94a82c56). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3574. Shuffle finish time always reporte...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2440#issuecomment-56215397 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20581/consoleFull) for PR 2440 at commit [`c81439b`](https://github.com/apache/spark/commit/c81439be1595bd2403c97065b58c4e4319bdf37e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2337#discussion_r17800549 --- Diff: core/src/main/scala/org/apache/spark/FutureAction.scala --- @@ -171,6 +179,8 @@ class ComplexFutureAction[T] extends FutureAction[T] { // is cancelled before the action was even run (and thus we have no thread to interrupt). @volatile private var _cancelled: Boolean = false + @volatile private var jobs: Seq[Int] = Nil --- End diff -- I'm trying to avoid synchonization. Having a mutable list here means I'd have to synchronize when returning the immutable Seq in `jobIds`; with the volatile var, I'm only doing read operations on the `Seq`s themselves, so I don't need to synchronize. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17800692 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -57,13 +250,709 @@ trait Matrix extends Serializable { * @param numCols number of columns * @param values matrix entries in column major */ -class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { +class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double): Unit = { +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) + + private[mllib] def elementWiseOperateOnColumnsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +val len = y_vals.length +require(y_vals.length == numRows) +var j = 0 +while (j numCols){ + var i = 0 + while (i len){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(i)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateOnRowsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +require(y_vals.length == numCols) +var j = 0 +while (j numCols){ + var i = 0 + while (i numRows){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(j)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val y_val = y.toArray +val len = values.length +require(y_val.length == values.length) +var j = 0 +while (j len){ + values(j) = f(values(j), y_val(j)) + j += 1 +} +this + } + + private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) = Double, y: Double): DenseMatrix = { +var j = 0 +val len = values.length +while (j len){ + values(j) = f(values(j), y) + j += 1 +} +this + } + + private[mllib] def operateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +if (y.numCols==1 || y.numRows == 1){ + require(numCols != numRows, Operation is ambiguous. Please use elementWiseOperateOnRows + +or elementWiseOperateOnColumns instead) +} +if (y.numCols == 1 y.numRows == 1){ + elementWiseOperateScalarInPlace(f, y.toArray(0)) +} else { + if (y.numCols==1) { +elementWiseOperateOnColumnsInPlace(f, y) + }else if (y.numRows==1){ +elementWiseOperateOnRowsInPlace(f, y) + }else{ +elementWiseOperateInPlace(f, y) + } +} + } + + private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnColumnsInPlace(f, y) + } + + private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnRowsInPlace(f, y) + } + + private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateInPlace(f, y) + } + + private[mllib] def elementWiseOperateScalar(f: (Double, Double) = Double, y: Double): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateScalarInPlace(f, y) + } + + private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.operateInPlace(f, y) + } + + def map(f: Double = Double) = new
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17800664 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -57,13 +250,709 @@ trait Matrix extends Serializable { * @param numCols number of columns * @param values matrix entries in column major */ -class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { +class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double): Unit = { +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) + + private[mllib] def elementWiseOperateOnColumnsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +val len = y_vals.length +require(y_vals.length == numRows) +var j = 0 +while (j numCols){ + var i = 0 + while (i len){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(i)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateOnRowsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +require(y_vals.length == numCols) +var j = 0 +while (j numCols){ + var i = 0 + while (i numRows){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(j)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val y_val = y.toArray +val len = values.length +require(y_val.length == values.length) +var j = 0 +while (j len){ + values(j) = f(values(j), y_val(j)) + j += 1 +} +this + } + + private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) = Double, y: Double): DenseMatrix = { +var j = 0 +val len = values.length +while (j len){ + values(j) = f(values(j), y) + j += 1 +} +this + } + + private[mllib] def operateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +if (y.numCols==1 || y.numRows == 1){ + require(numCols != numRows, Operation is ambiguous. Please use elementWiseOperateOnRows + +or elementWiseOperateOnColumns instead) +} +if (y.numCols == 1 y.numRows == 1){ + elementWiseOperateScalarInPlace(f, y.toArray(0)) +} else { + if (y.numCols==1) { +elementWiseOperateOnColumnsInPlace(f, y) + }else if (y.numRows==1){ +elementWiseOperateOnRowsInPlace(f, y) + }else{ +elementWiseOperateInPlace(f, y) + } +} + } + + private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnColumnsInPlace(f, y) + } + + private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnRowsInPlace(f, y) + } + + private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateInPlace(f, y) + } + + private[mllib] def elementWiseOperateScalar(f: (Double, Double) = Double, y: Double): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateScalarInPlace(f, y) + } + + private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.operateInPlace(f, y) + } + + def map(f: Double = Double) = new
[GitHub] spark pull request: [Docs] Fix outdated docs for standalone cluste...
GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/2461 [Docs] Fix outdated docs for standalone cluster This is now supported! You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark document-standalone-cluster Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2461 commit 35e30eee13786b7743820145a121ccef176d627b Author: Andrew Or andrewo...@gmail.com Date: 2014-09-19T18:26:07Z Fix outdated docs for standalone cluster --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17800699 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -57,13 +250,709 @@ trait Matrix extends Serializable { * @param numCols number of columns * @param values matrix entries in column major */ -class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { +class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double): Unit = { +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) + + private[mllib] def elementWiseOperateOnColumnsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +val len = y_vals.length +require(y_vals.length == numRows) +var j = 0 +while (j numCols){ + var i = 0 + while (i len){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(i)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateOnRowsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +require(y_vals.length == numCols) +var j = 0 +while (j numCols){ + var i = 0 + while (i numRows){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(j)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val y_val = y.toArray +val len = values.length +require(y_val.length == values.length) +var j = 0 +while (j len){ + values(j) = f(values(j), y_val(j)) + j += 1 +} +this + } + + private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) = Double, y: Double): DenseMatrix = { +var j = 0 +val len = values.length +while (j len){ + values(j) = f(values(j), y) + j += 1 +} +this + } + + private[mllib] def operateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +if (y.numCols==1 || y.numRows == 1){ + require(numCols != numRows, Operation is ambiguous. Please use elementWiseOperateOnRows + +or elementWiseOperateOnColumns instead) +} +if (y.numCols == 1 y.numRows == 1){ + elementWiseOperateScalarInPlace(f, y.toArray(0)) +} else { + if (y.numCols==1) { +elementWiseOperateOnColumnsInPlace(f, y) + }else if (y.numRows==1){ +elementWiseOperateOnRowsInPlace(f, y) + }else{ +elementWiseOperateInPlace(f, y) + } +} + } + + private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnColumnsInPlace(f, y) + } + + private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnRowsInPlace(f, y) + } + + private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateInPlace(f, y) + } + + private[mllib] def elementWiseOperateScalar(f: (Double, Double) = Double, y: Double): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateScalarInPlace(f, y) + } + + private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.operateInPlace(f, y) + } + + def map(f: Double = Double) = new
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17800687 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -37,11 +44,197 @@ trait Matrix extends Serializable { private[mllib] def toBreeze: BM[Double] /** Gets the (i, j)-th element. */ - private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j) + private[mllib] def apply(i: Int, j: Int): Double + + /** Return the index for the (i, j)-th element in the backing array. */ + private[mllib] def index(i: Int, j: Int): Int + + /** Update element at (i, j) */ + private[mllib] def update(i: Int, j: Int, v: Double): Unit + + /** Get a deep copy of the matrix. */ + def copy: Matrix + /** Convenience method for `Matrix`-`Matrix` multiplication. +* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported */ + def multiply(y: Matrix): DenseMatrix = { +val C: DenseMatrix = DenseMatrix.zeros(numRows, y.numCols) +BLAS.gemm(false, false, 1.0, this, y, 0.0, C) +C + } + + /** Convenience method for `Matrix`-`DenseVector` multiplication. */ + def multiply(y: DenseVector): DenseVector = { +val output = new DenseVector(new Array[Double](numRows)) +BLAS.gemv(1.0, this, y, 0.0, output) +output + } + + /** Convenience method for `Matrix`^T^-`Matrix` multiplication. +* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported */ + def transposeMultiply(y: Matrix): DenseMatrix = { --- End diff -- How hard would it be to have matrices store a transpose bit indicated if they are transposed (without the data being moved)? I envision: * transpose() function which sets this bit (so transpose is a lazy operation) * eliminate transposeMultiply * perhaps include a transposePhysical or tranpose(physical: Boolean) method which forces data movement I'm also OK with adding that support later on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17800735 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -57,13 +250,709 @@ trait Matrix extends Serializable { * @param numCols number of columns * @param values matrix entries in column major */ -class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { +class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable { - require(values.length == numRows * numCols) + require(values.length == numRows * numCols, The number of values supplied doesn't match the + +ssize of the matrix! values.length: ${values.length}, numRows * numCols: ${numRows * numCols}) override def toArray: Array[Double] = values - private[mllib] override def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + private[mllib] def toBreeze: BM[Double] = new BDM[Double](numRows, numCols, values) + + private[mllib] def apply(i: Int): Double = values(i) + + private[mllib] def apply(i: Int, j: Int): Double = values(index(i, j)) + + private[mllib] def index(i: Int, j: Int): Int = i + numRows * j + + private[mllib] def update(i: Int, j: Int, v: Double): Unit = { +values(index(i, j)) = v + } + + override def copy = new DenseMatrix(numRows, numCols, values.clone()) + + private[mllib] def elementWiseOperateOnColumnsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +val len = y_vals.length +require(y_vals.length == numRows) +var j = 0 +while (j numCols){ + var i = 0 + while (i len){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(i)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateOnRowsInPlace( + f: (Double, Double) = Double, + y: Matrix): DenseMatrix = { +val y_vals = y.toArray +require(y_vals.length == numCols) +var j = 0 +while (j numCols){ + var i = 0 + while (i numRows){ +val idx = index(i, j) +values(idx) = f(values(idx), y_vals(j)) +i += 1 + } + j += 1 +} +this + } + + private[mllib] def elementWiseOperateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val y_val = y.toArray +val len = values.length +require(y_val.length == values.length) +var j = 0 +while (j len){ + values(j) = f(values(j), y_val(j)) + j += 1 +} +this + } + + private[mllib] def elementWiseOperateScalarInPlace(f: (Double, Double) = Double, y: Double): DenseMatrix = { +var j = 0 +val len = values.length +while (j len){ + values(j) = f(values(j), y) + j += 1 +} +this + } + + private[mllib] def operateInPlace(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +if (y.numCols==1 || y.numRows == 1){ + require(numCols != numRows, Operation is ambiguous. Please use elementWiseOperateOnRows + +or elementWiseOperateOnColumns instead) +} +if (y.numCols == 1 y.numRows == 1){ + elementWiseOperateScalarInPlace(f, y.toArray(0)) +} else { + if (y.numCols==1) { +elementWiseOperateOnColumnsInPlace(f, y) + }else if (y.numRows==1){ +elementWiseOperateOnRowsInPlace(f, y) + }else{ +elementWiseOperateInPlace(f, y) + } +} + } + + private[mllib] def elementWiseOperateOnColumns(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnColumnsInPlace(f, y) + } + + private[mllib] def elementWiseOperateOnRows(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateOnRowsInPlace(f, y) + } + + private[mllib] def elementWiseOperate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateInPlace(f, y) + } + + private[mllib] def elementWiseOperateScalar(f: (Double, Double) = Double, y: Double): DenseMatrix = { +val dup = this.copy +dup.elementWiseOperateScalarInPlace(f, y) + } + + private[mllib] def operate(f: (Double, Double) = Double, y: Matrix): DenseMatrix = { +val dup = this.copy +dup.operateInPlace(f, y) + } + + def map(f: Double = Double) = new
[GitHub] spark pull request: [MLLib] Fix example code variable name misspel...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2459#issuecomment-56216041 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20568/consoleFull) for PR 2459 at commit [`b370a91`](https://github.com/apache/spark/commit/b370a919451ca7e8c1b3eec1b35b941e48571717). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56216044 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20583/consoleFull) for PR 2337 at commit [`e166a68`](https://github.com/apache/spark/commit/e166a680575ae96032d7ca03aba4566105cdb388). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56216271 So, I'm a little disappointed that this doesn't at least follow the Yarn model of one setting that defines the overhead. Instead, it has two settings, one for a fraction and one to define some minimum if the fraction is somehow less than that. That sounds too complicated. What's the argument against Yarn's model of a single setting with an absolute overhead value? That doesn't require the user to do math, and makes things easier when for some reason the user requires lots of overhead (e.g. large usage of off-heap memory) that is not necessarily related to the heap size. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17801072 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -37,11 +44,197 @@ trait Matrix extends Serializable { private[mllib] def toBreeze: BM[Double] /** Gets the (i, j)-th element. */ - private[mllib] def apply(i: Int, j: Int): Double = toBreeze(i, j) + private[mllib] def apply(i: Int, j: Int): Double + + /** Return the index for the (i, j)-th element in the backing array. */ + private[mllib] def index(i: Int, j: Int): Int + + /** Update element at (i, j) */ + private[mllib] def update(i: Int, j: Int, v: Double): Unit + + /** Get a deep copy of the matrix. */ + def copy: Matrix + /** Convenience method for `Matrix`-`Matrix` multiplication. +* Note: `SparseMatrix`-`SparseMatrix` multiplication is not supported */ --- End diff -- Just wondering (not sure myself): Which is prefered: `SparseMatrix` or [[SparseMatrix]] in docs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3599]Avoid loaing properties file frequ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2454#issuecomment-56216397 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20570/consoleFull) for PR 2454 at commit [`2a79f26`](https://github.com/apache/spark/commit/2a79f26497f9232465aa2e9b496b0d54b9ccda75). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2451#issuecomment-56216573 Could the methods be ordered in the file (grouped by public, private[mllib], private, etc.? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2451#discussion_r17801264 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -57,13 +250,709 @@ trait Matrix extends Serializable { * @param numCols number of columns * @param values matrix entries in column major */ -class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix { +class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix with Serializable { --- End diff -- long line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-1486][MLlib] Multi Model Training ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2451#issuecomment-56216806 Also, is it odd that the user can't access the matrix data, except via toArray (or maybe side effects of the function given to map)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56216747 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20574/consoleFull) for PR 2401 at commit [`56988e3`](https://github.com/apache/spark/commit/56988e31363bc07dc8acb369bdaade6b18b98f51). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2432#issuecomment-56216738 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20575/consoleFull) for PR 2432 at commit [`4a93c7f`](https://github.com/apache/spark/commit/4a93c7f7da8d829a8837f3a31aff0f08355e0c5a). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Matrix extends Serializable ` * `class SparseMatrix(` * `sealed trait Vector extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56216817 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull) for PR 2378 at commit [`dffbba2`](https://github.com/apache/spark/commit/dffbba2ba206bbbd3dfc740a55f1b0df341860e7). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Fix resource handling.
Github user brndnmtthws commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56216904 I thought there was some desire to have the same thing also #1391? Furthermore, from my experience writing frameworks, I think a much better model is the fractional overhead (relative to the heap size), for the reasons I mentioned above. If you do some internet searching, you'll see that I've been doing quite a bit of this for a while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org