[jira] [Updated] (SPARK-3862) MultiWayBroadcastInnerHashJoin
[ https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3862: --- Target Version/s: 1.6.0 (was: 1.5.0) MultiWayBroadcastInnerHashJoin -- Key: SPARK-3862 URL: https://issues.apache.org/jira/browse/SPARK-3862 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin It is common to have a single fact table inner join many small dimension tables. We can exploit this fact and create a MultiWayBroadcastInnerHashJoin (or maybe just MultiwayDimensionJoin) operator that optimizes for this pattern. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9437) SizeEstimator overflows for primitive arrays
Imran Rashid created SPARK-9437: --- Summary: SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9437: --- Assignee: Imran Rashid (was: Apache Spark) SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9437: --- Assignee: Apache Spark (was: Imran Rashid) SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Apache Spark Priority: Minor {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646342#comment-14646342 ] Apache Spark commented on SPARK-9437: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/7750 SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3862) MultiWayBroadcastInnerHashJoin
[ https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646376#comment-14646376 ] Reynold Xin commented on SPARK-3862: I don't think that's extreme at all -- very plausible candidate for 1.6! MultiWayBroadcastInnerHashJoin -- Key: SPARK-3862 URL: https://issues.apache.org/jira/browse/SPARK-3862 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin It is common to have a single fact table inner join many small dimension tables. We can exploit this fact and create a MultiWayBroadcastInnerHashJoin (or maybe just MultiwayDimensionJoin) operator that optimizes for this pattern. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9404) UnsafeArrayData
[ https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9404: --- Assignee: Wenchen Fan (was: Apache Spark) UnsafeArrayData --- Key: SPARK-9404 URL: https://issues.apache.org/jira/browse/SPARK-9404 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan An Unsafe-based ArrayData implementation. To begin with, we can encode data this way: first 4 bytes is the # elements then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null. followed by the elements themselves For example, [10, 11, 12, 13, null, 14], internally should be represented as (each 4 bytes) 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9404) UnsafeArrayData
[ https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646444#comment-14646444 ] Apache Spark commented on SPARK-9404: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7752 UnsafeArrayData --- Key: SPARK-9404 URL: https://issues.apache.org/jira/browse/SPARK-9404 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan An Unsafe-based ArrayData implementation. To begin with, we can encode data this way: first 4 bytes is the # elements then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null. followed by the elements themselves For example, [10, 11, 12, 13, null, 14], internally should be represented as (each 4 bytes) 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction
[ https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646277#comment-14646277 ] Badari Madhav commented on SPARK-9312: -- Updated JIRA and PR, please review. The OneVsRest model does not provide confidence factor(not probability) along with the prediction - Key: SPARK-9312 URL: https://issues.apache.org/jira/browse/SPARK-9312 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0, 1.4.1 Reporter: Badari Madhav Labels: features Original Estimate: 72h Remaining Estimate: 72h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9416) Yarn logs say that Spark Python job has succeeded even though job has failed in Yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646365#comment-14646365 ] Apache Spark commented on SPARK-9416: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7751 Yarn logs say that Spark Python job has succeeded even though job has failed in Yarn cluster mode - Key: SPARK-9416 URL: https://issues.apache.org/jira/browse/SPARK-9416 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Reporter: Elkhan Dadashov While running Spark Word count python example with intentional mistake in Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as SUCCEEDED, but log files for Spark application state correct results indicating that the job failed. Terminal log output application log output contradict each other. If i run same job on local mode then terminal logs and application logs match, where both state that job has failed to expected error in python script. More details: Scenario While running Spark Word count python example on Yarn cluster mode, if I make intentional error in wordcount.py by changing this line (I'm using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i tested): lines = sc.textFile(sys.argv[1], 1) into this line: lines = sc.textFile(nonExistentVariable,1) where nonExistentVariable variable was never created and initialized. then i run that example with this command (I put README.md into HDFS before running this command): ./bin/spark-submit --master yarn-cluster wordcount.py /README.md The job runs and finishes successfully according the log printed in the terminal : Terminal logs: ... 15/07/23 16:19:17 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING) 15/07/23 16:19:18 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING) 15/07/23 16:19:19 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING) 15/07/23 16:19:20 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING) 15/07/23 16:19:21 INFO yarn.Client: Application report for application_1437612288327_0013 (state: FINISHED) 15/07/23 16:19:21 INFO yarn.Client: client token: N/A diagnostics: Shutdown hook called before final status was reported. ApplicationMaster host: 10.0.53.59 ApplicationMaster RPC port: 0 queue: default start time: 1437693551439 final status: SUCCEEDED tracking URL: http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1 user: edadashov 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called 15/07/23 16:19:21 INFO util.Utils: Deleting directory /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444 But if look at log files generated for this application in HDFS - it indicates failure of the job with correct reason: Application log files: ... \00 stdout\00 179Traceback (most recent call last): File wordcount.py, line 32, in module lines = sc.textFile(nonExistentVariable,1) NameError: name 'nonExistentVariable' is not defined (Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching application log results - failure of the job (NameError: name 'nonExistentVariable' is not defined) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API
[ https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646419#comment-14646419 ] Dmitry Goldenberg commented on SPARK-9434: -- Thanks Cody and Sean. I'll take a look at the process of submitting changes. Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API - Key: SPARK-9434 URL: https://issues.apache.org/jira/browse/SPARK-9434 Project: Spark Issue Type: Improvement Components: Documentation, Examples, Streaming Affects Versions: 1.4.1 Reporter: Dmitry Goldenberg We've been getting some mixed information regarding how to cause our direct streaming consumers to resume processing from where they left off in terms of the Kafka offsets. On the one hand side, we're hearing If you are restarting the streaming app with Direct kafka from the checkpoint information (that is, restarting), then the last read offsets are automatically recovered, and the data will start processing from that offset. All the N records added in T will stay buffered in Kafka. (where T is the interval of time during which the consumer was down). On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which are marked as won't fix which seem to ask for the functionality we need, with comments like I don't want to add more config options with confusing semantics around what is being used for the system of record for offsets, I'd rather make it easy for people to explicitly do what they need. The use-case is actually very clear and doesn't ask for confusing semantics. An API option to resume reading where you left off, in addition to the smallest or greatest auto.offset.reset should be *very* useful, probably for quite a few folks. We're asking for this as an enhancement request. SPARK-8833 states I am waiting for getting enough usecase to float in before I take a final call. We're adding to that. In the meantime, can you clarify the confusion? Does direct streaming persist the progress information into DStream checkpoints or does it not? If it does, why is it that we're not seeing that happen? Our consumers start with auto.offset.reset=greatest and that causes them to read from the first offset of data that is written to Kafka *after* the consumer has been restarted, meaning we're missing data that had come in while the consumer was down. If the progress is stored in DStream checkpoints, we want to know a) how to cause that to work for us and b) where the said checkpointing data is stored physically. Conversely, if this is not accurate, then is our only choice to manually persist the offsets into Zookeeper? If that is the case then a) we'd like a clear, more complete code sample to be published, since the one in the Kafka streaming guide is incomplete (it lacks the actual lines of code persisting the offsets) and b) we'd like to request that SPARK-8833 be revisited as a feature worth implementing in the API. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API
[ https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Goldenberg resolved SPARK-9434. -- Resolution: Not A Problem Documentation on checkpointing is available in the Spark doc set. Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API - Key: SPARK-9434 URL: https://issues.apache.org/jira/browse/SPARK-9434 Project: Spark Issue Type: Improvement Components: Documentation, Examples, Streaming Affects Versions: 1.4.1 Reporter: Dmitry Goldenberg We've been getting some mixed information regarding how to cause our direct streaming consumers to resume processing from where they left off in terms of the Kafka offsets. On the one hand side, we're hearing If you are restarting the streaming app with Direct kafka from the checkpoint information (that is, restarting), then the last read offsets are automatically recovered, and the data will start processing from that offset. All the N records added in T will stay buffered in Kafka. (where T is the interval of time during which the consumer was down). On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which are marked as won't fix which seem to ask for the functionality we need, with comments like I don't want to add more config options with confusing semantics around what is being used for the system of record for offsets, I'd rather make it easy for people to explicitly do what they need. The use-case is actually very clear and doesn't ask for confusing semantics. An API option to resume reading where you left off, in addition to the smallest or greatest auto.offset.reset should be *very* useful, probably for quite a few folks. We're asking for this as an enhancement request. SPARK-8833 states I am waiting for getting enough usecase to float in before I take a final call. We're adding to that. In the meantime, can you clarify the confusion? Does direct streaming persist the progress information into DStream checkpoints or does it not? If it does, why is it that we're not seeing that happen? Our consumers start with auto.offset.reset=greatest and that causes them to read from the first offset of data that is written to Kafka *after* the consumer has been restarted, meaning we're missing data that had come in while the consumer was down. If the progress is stored in DStream checkpoints, we want to know a) how to cause that to work for us and b) where the said checkpointing data is stored physically. Conversely, if this is not accurate, then is our only choice to manually persist the offsets into Zookeeper? If that is the case then a) we'd like a clear, more complete code sample to be published, since the one in the Kafka streaming guide is incomplete (it lacks the actual lines of code persisting the offsets) and b) we'd like to request that SPARK-8833 be revisited as a feature worth implementing in the API. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9404) UnsafeArrayData
[ https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9404: --- Assignee: Apache Spark (was: Wenchen Fan) UnsafeArrayData --- Key: SPARK-9404 URL: https://issues.apache.org/jira/browse/SPARK-9404 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark An Unsafe-based ArrayData implementation. To begin with, we can encode data this way: first 4 bytes is the # elements then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null. followed by the elements themselves For example, [10, 11, 12, 13, null, 14], internally should be represented as (each 4 bytes) 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9438) restarting leader zookeeper causes spark master to die when the spark master election is assigned to zookeeper
Amir Rad created SPARK-9438: --- Summary: restarting leader zookeeper causes spark master to die when the spark master election is assigned to zookeeper Key: SPARK-9438 URL: https://issues.apache.org/jira/browse/SPARK-9438 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Saprk 1.2.0 and Zookeeper version: 3.4.6-1569965 Reporter: Amir Rad When Spark Master Election is assigned to Zookeeper, restarting the leader Zookeeper causes the master spark to die. Steps to reproduce: create a cluster of 3 spark nodes. set Spark-env to: SPARK_LOCAL_DIRS=/home/sparkcde/data_spark/data SPARK_MASTER_OPTS=-Dspark.deploy.spreadOut=false SPARK_WORKER_DIR=/home/sparkcde/data_spark/worker SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=s1:2181,s2:2181,s3:2181 Identify the spark master identify the zookeeper leader. Stop zookeeper leader check spark master: It is dead start zookeeper leader check spark master: still dead If you continue the same pattern of stopping and starting zookeeper leader, eventually you will lose the whole spark cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5259) Fix endless retry stage by add task equal() and hashcode() to avoid stage.pendingTasks not empty while stage map output is available
[ https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5259: Target Version/s: 1.5.0 Priority: Blocker (was: Major) Fix endless retry stage by add task equal() and hashcode() to avoid stage.pendingTasks not empty while stage map output is available - Key: SPARK-5259 URL: https://issues.apache.org/jira/browse/SPARK-5259 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: SuYan Priority: Blocker 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet but covered the partitions. then stage is Available is true. {code} def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } {code} but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test(Make sure mapStage.pendingtasks is set() + while MapStage.isAvailable is true while stage was retry ) { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus(hostB, 1)), (Success, makeMapStatus(hostB, 2)), (Success, makeMapStatus(hostC, 3)), (Success, makeMapStatus(hostB, 4)), (Success, makeMapStatus(hostB, 5)), (Success, makeMapStatus(hostC, 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus(hostA, 1)), (Success, makeMapStatus(hostB, 2)), (Success, makeMapStatus(hostA, 1)), (Success, makeMapStatus(hostB, 2)), (Success, makeMapStatus(hostA, 1)) )) runEvent(ExecutorLost(exec-hostA)) runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuyffleId, -1, 0, Fetch Mata data failed), null, null, null, null)) scheduler.resubmitFailedStages() runEvent(CompletionEvent(taskSets(1).tasks(0), Success, makeMapStatus(hostC, 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Success, makeMapStatus(hostC, 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(4), Success, makeMapStatus(hostC, 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(5), Success, makeMapStatus(hostB, 2), null, null, null)) val stage = scheduler.stageIdToStage(taskSets(1).stageId) assert(stage.attemptId == 2) assert(stage.isAvailable) assert(stage.pendingTasks.size == 0) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-746) Automatically Use Avro Serialization for Avro Objects
[ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-746: --- Assignee: Joseph Batchik Automatically Use Avro Serialization for Avro Objects - Key: SPARK-746 URL: https://issues.apache.org/jira/browse/SPARK-746 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Cogan Assignee: Joseph Batchik All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well). Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests
[ https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646301#comment-14646301 ] Hunter Morgan commented on SPARK-5079: -- Would it be feasible to restrict test dependencies so that only a single slf4j implementation is present, make that implementation http://projects.lidalia.org.uk/slf4j-test/, and look for log messages to determine job failure? Detect failed jobs / batches in Spark Streaming unit tests -- Key: SPARK-5079 URL: https://issues.apache.org/jira/browse/SPARK-5079 Project: Spark Issue Type: Bug Components: Streaming Reporter: Josh Rosen Assignee: Ilya Ganelin Currently, it is possible to write Spark Streaming unit tests where Spark jobs fail but the streaming tests succeed because we rely on wall-clock time plus output comparision in order to check whether a test has passed, and hence may miss cases where errors occurred if they didn't affect these results. We should strengthen the tests to check that no job failures occurred while processing batches. See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for additional context. The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might also fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5945: Target Version/s: 1.5.0 Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin Priority: Blocker While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic
[ https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646312#comment-14646312 ] Meihua Wu commented on SPARK-9246: -- Cool. I see. will keep updating about the progress. I have a question: is topDocumentsPerTopic exact or approximate (like describeTopics which, according to ScalaDoc, may not return exactly the top-weighted terms for each topic; to get a more precise set of top terms, increase maxTermsPerTopic.)? DistributedLDAModel predict top docs per topic -- Key: SPARK-9246 URL: https://issues.apache.org/jira/browse/SPARK-9246 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Original Estimate: 72h Remaining Estimate: 72h For each topic, return top documents based on topicDistributions. Synopsis: {code} /** * @param maxDocuments Max docs to return for each topic * @return Array over topics of (sorted top docs, corresponding doc-topic weights) */ def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])] {code} Note: We will need to make sure that the above return value format is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3181: - Shepherd: Xiangrui Meng Add Robust Regression Algorithm with Huber Estimator Key: SPARK-3181 URL: https://issues.apache.org/jira/browse/SPARK-3181 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. In 1973, Huber introduced M-estimation for regression which stands for maximum likelihood type. The method is resistant to outliers in the response variable and has been widely used. The new feature for MLlib will contain 3 new files /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala and one new class HuberRobustGradient in /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9127) Rand/Randn codegen fails with long seed
[ https://issues.apache.org/jira/browse/SPARK-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9127. Resolution: Fixed Fix Version/s: 1.5.0 Rand/Randn codegen fails with long seed --- Key: SPARK-9127 URL: https://issues.apache.org/jira/browse/SPARK-9127 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Reynold Xin Priority: Blocker Fix For: 1.5.0 {code} ERROR GenerateMutableProjection: failed to compile: public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificProjection(expr); } class SpecificProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow = null; private org.apache.spark.util.random.XORShiftRandom rng4 = new org.apache.spark.util.random.XORShiftRandom(-5419823303878592871 + org.apache.spark.TaskContext.getPartitionId()); public SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; mutableRow = new org.apache.spark.sql.catalyst.expressions.GenericMutableRow(2); } public org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { mutableRow = row; return this; } /* Provide immutable access to the last projected row. */ public InternalRow currentValue() { return (InternalRow) mutableRow; } public Object apply(Object _i) { InternalRow i = (InternalRow) _i; boolean isNull0 = i.isNullAt(0); long primitive1 = isNull0 ? -1L : (i.getLong(0)); if(isNull0) mutableRow.setNullAt(0); else mutableRow.setLong(0, primitive1); final double primitive3 = rng4.nextDouble(); if(false) mutableRow.setNullAt(1); else mutableRow.setDouble(1, primitive3); return mutableRow; } } org.codehaus.commons.compiler.CompileException: Line 10, Column 117: Value of decimal integer literal '5419823303878592871' is out of range at org.codehaus.janino.UnitCompiler.compileException(UnitCompiler.java:10473) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4696) at org.codehaus.janino.UnitCompiler.access$9200(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$11.visitIntegerLiteral(UnitCompiler.java:4402) at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427) at org.codehaus.janino.UnitCompiler.getNegatedConstantValue2(UnitCompiler.java:4856) at org.codehaus.janino.UnitCompiler.getNegatedConstantValue2(UnitCompiler.java:4890) at org.codehaus.janino.UnitCompiler.access$10400(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$12.visitIntegerLiteral(UnitCompiler.java:4823) at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) at org.codehaus.janino.UnitCompiler.getNegatedConstantValue(UnitCompiler.java:4848) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4451) at org.codehaus.janino.UnitCompiler.access$8800(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$11.visitUnaryOperation(UnitCompiler.java:4393) at org.codehaus.janino.Java$UnaryOperation.accept(Java.java:3647) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4498) at org.codehaus.janino.UnitCompiler.access$8900(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$11.visitBinaryOperation(UnitCompiler.java:4394) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:3768) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4360) at org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6681) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126) at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275) at
[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5945: Priority: Blocker (was: Major) Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin Priority: Blocker While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor
[ https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-8029: Target Version/s: 1.5.0 Priority: Blocker (was: Major) ShuffleMapTasks must be robust to concurrent attempts on the same executor -- Key: SPARK-8029 URL: https://issues.apache.org/jira/browse/SPARK-8029 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Blocker Attachments: AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf When stages get retried, a task may have more than one attempt running at the same time, on the same executor. Currently this causes problems for ShuffleMapTasks, since all attempts try to write to the same output files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests
[ https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646441#comment-14646441 ] Hunter Morgan commented on SPARK-5079: -- I'm successfully using this pattern to test a stream element parsing helper. {code:title=Sample test code|borderStyle=solid} assumeThat(slf4j is bound to slf4j-test, StaticLoggerBinder.getSingleton(). getLoggerFactoryClassStr(), is(uk.org.lidalia.slf4jtest.TestLoggerFactory)); TestLogger logger = TestLoggerFactory.getTestLogger( org.apache.spark.scheduler.TaskSetManager); logger.setEnabledLevelsForAllThreads(ImmutableSet.of(Level.ERROR)); // ... ListLoggingEvent failures = logger.getAllLoggingEvents().stream(). filter(loggingEvent - loggingEvent. getMessage().contains(failed)). collect(Collectors.toList()); assertEquals(jobs didn't fail, 0, failures.size()); {code} Detect failed jobs / batches in Spark Streaming unit tests -- Key: SPARK-5079 URL: https://issues.apache.org/jira/browse/SPARK-5079 Project: Spark Issue Type: Bug Components: Streaming Reporter: Josh Rosen Assignee: Ilya Ganelin Currently, it is possible to write Spark Streaming unit tests where Spark jobs fail but the streaming tests succeed because we rely on wall-clock time plus output comparision in order to check whether a test has passed, and hence may miss cases where errors occurred if they didn't affect these results. We should strengthen the tests to check that no job failures occurred while processing batches. See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for additional context. The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might also fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9352) Add tests for standalone scheduling code
[ https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646486#comment-14646486 ] Andrew Or commented on SPARK-9352: -- This is actually already fixed. It was reverted at one point but it's merged back into 1.4 Add tests for standalone scheduling code Key: SPARK-9352 URL: https://issues.apache.org/jira/browse/SPARK-9352 Project: Spark Issue Type: Improvement Components: Deploy, Tests Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.4.2, 1.5.0 There are no tests for the standalone Master scheduling code! This has caused issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have some level of confidence that this code actually works... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration
[ https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-9441: --- Description: I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at org.apache.spark.SparkContext.init(SparkContext.scala:272) I read on blogs where people suggest to modify classpath and put right version before, put scala libs before in classpath and similar suggestions. which is all ridiculous. I think typesafe config package included with spark-core lib is incorrect. I did following with my maven build and now it works. But i think someone need to fix spark-core package. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId exclusions exclusion artifactIdconfig/artifactId groupIdcom.typesafe/groupId /exclusion /exclusions /dependency dependency groupIdcom.typesafe/groupId artifactIdconfig/artifactId version1.2.1/version /dependency was: I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at
[jira] [Updated] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration
[ https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-9441: --- Description: I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at org.apache.spark.SparkContext.init(SparkContext.scala:272) I read on blogs where people suggest to modify classpath and put right version before, put scala libs before in classpath and similar suggestions. which is all ridiculous. I think typesafe config package included with spark-core lib is incorrect. I did following with my maven build and now it works. But i think someone need to fix spark-core package. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId exclusions exclusion artifactIdconfig/artifactId groupIdcom.typesafe/groupId /exclusion /exclusions /dependency dependency groupIdcom.typesafe/groupId artifactIdconfig/artifactId version1.2.1/version /dependency was: I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at
[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646535#comment-14646535 ] Joseph K. Bradley commented on SPARK-5133: -- That's my hope... I'll try to send a PR later today. Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Original Estimate: 168h Remaining Estimate: 168h Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration
nirav patel created SPARK-9441: -- Summary: NoSuchMethodError: Com.typesafe.config.Config.getDuration Key: SPARK-9441 URL: https://issues.apache.org/jira/browse/SPARK-9441 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1 Reporter: nirav patel I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at org.apache.spark.SparkContext.init(SparkContext.scala:272) I read on blogs where people suggest to modify classpath and all which is ridiculous. I think typesafe config package included with spark-core lib is incorrect. I did following with my maven build and now it works. But i think someone need to fix spark-core package. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId exclusions exclusion artifactIdconfig/artifactId groupIdcom.typesafe/groupId /exclusion /exclusions /dependency dependency groupIdcom.typesafe/groupId artifactIdconfig/artifactId version1.2.1/version /dependency -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8198) date/time function: months_between
[ https://issues.apache.org/jira/browse/SPARK-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646584#comment-14646584 ] Apache Spark commented on SPARK-8198: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7754 date/time function: months_between -- Key: SPARK-8198 URL: https://issues.apache.org/jira/browse/SPARK-8198 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin months_between(date1, date2): double Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format '-MM-dd' or '-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') = 3.94959677 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9133) Add and Subtract should support date/timestamp and interval type
[ https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9133: --- Assignee: Apache Spark (was: Davies Liu) Add and Subtract should support date/timestamp and interval type Key: SPARK-9133 URL: https://issues.apache.org/jira/browse/SPARK-9133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Should support date + interval interval + date timestamp + interval interval + timestamp The best way to support this is probably to resolve this to a date add/substract expression, rather than making add/subtract support these types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646582#comment-14646582 ] Apache Spark commented on SPARK-8187: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7754 date/time function: date_sub Key: SPARK-8187 URL: https://issues.apache.org/jira/browse/SPARK-8187 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_sub(timestamp startdate, int days): timestamp date_sub(timestamp startdate, interval i): timestamp date_sub(date date, int days): date date_sub(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9444) RemoveEvaluationFromSort reorders sort order
Michael Armbrust created SPARK-9444: --- Summary: RemoveEvaluationFromSort reorders sort order Key: SPARK-9444 URL: https://issues.apache.org/jira/browse/SPARK-9444 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Wenchen Fan Priority: Blocker {code} Seq((1,2,3,4)).toDF(a, b, c, d).registerTempTable(test) scala sqlContext.sql(SELECT * FROM test ORDER BY a + 1, b).queryExecution == Parsed Logical Plan == 'Sort [('a + 1) ASC,'b ASC], true 'Project [unresolvedalias(*)] 'UnresolvedRelation [test], None == Optimized Logical Plan == Project [a#4,b#5,c#6,d#7] Sort [b#5 ASC,_sortCondition#9 ASC], true LocalRelation [a#4,b#5,c#6,d#7,_sortCondition#9], [[1,2,3,4,2]] {code} Notice how we are now sorting by b before a + 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9446) Clear Active SparkContext in stop() method
[ https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9446: --- Assignee: Apache Spark Clear Active SparkContext in stop() method -- Key: SPARK-9446 URL: https://issues.apache.org/jira/browse/SPARK-9446 Project: Spark Issue Type: Bug Reporter: Ted Yu Assignee: Apache Spark In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log: {code} 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: address removed 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more {code} Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9446) Clear Active SparkContext in stop() method
[ https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9446: --- Assignee: (was: Apache Spark) Clear Active SparkContext in stop() method -- Key: SPARK-9446 URL: https://issues.apache.org/jira/browse/SPARK-9446 Project: Spark Issue Type: Bug Reporter: Ted Yu In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log: {code} 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: address removed 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more {code} Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9446) Clear Active SparkContext in stop() method
[ https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646728#comment-14646728 ] Apache Spark commented on SPARK-9446: - User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/7756 Clear Active SparkContext in stop() method -- Key: SPARK-9446 URL: https://issues.apache.org/jira/browse/SPARK-9446 Project: Spark Issue Type: Bug Reporter: Ted Yu In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log: {code} 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: address removed 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more {code} Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9446) Clear Active SparkContext in stop() method
Ted Yu created SPARK-9446: - Summary: Clear Active SparkContext in stop() method Key: SPARK-9446 URL: https://issues.apache.org/jira/browse/SPARK-9446 Project: Spark Issue Type: Bug Reporter: Ted Yu In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log: {code} 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: address removed 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more {code} Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9436) Simplify Pregel by merging joins
[ https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave reassigned SPARK-9436: - Assignee: Ankur Dave Simplify Pregel by merging joins Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Assignee: Ankur Dave Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9425) Support DecimalType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9425: --- Assignee: Davies Liu (was: Apache Spark) Support DecimalType in UnsafeRow Key: SPARK-9425 URL: https://issues.apache.org/jira/browse/SPARK-9425 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The DecimalType has a precision up to 38 digits, so it's possible to serialize a Decimal as bounded byte array. We could have a fast path for DecimalType with precision under 18 digits, which could fit in a single long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov resolved SPARK-9380. - Resolution: Fixed Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9446) Clear Active SparkContext in stop() method
[ https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9446: - Affects Version/s: 1.4.1 Priority: Minor (was: Major) Component/s: Spark Core [~tedyu] you need to fill out the JIRA fields when creating one Clear Active SparkContext in stop() method -- Key: SPARK-9446 URL: https://issues.apache.org/jira/browse/SPARK-9446 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Ted Yu Priority: Minor In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log: {code} 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: address removed 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread Yarn application state monitor org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more {code} Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins
[ https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-9436: -- Assignee: Alexander Ulanov (was: Ankur Dave) Simplify Pregel by merging joins Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Assignee: Alexander Ulanov Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9436) Simplify Pregel by merging joins
[ https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-9436. --- Resolution: Fixed Fix Version/s: (was: 1.4.0) 1.5.0 Issue resolved by pull request 7749 [https://github.com/apache/spark/pull/7749] Simplify Pregel by merging joins Key: SPARK-9436 URL: https://issues.apache.org/jira/browse/SPARK-9436 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Reporter: Alexander Ulanov Priority: Minor Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h Pregel code contains two consecutive joins: ``` g.vertices.innerJoin(messages)(vprog) ... g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } ``` They can be replaced by one join. Ankur Dave proposed a patch based on our discussion in mailing list: https://www.mail-archive.com/dev@spark.apache.org/msg10316.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646812#comment-14646812 ] Feynman Liang commented on SPARK-9440: -- PR at https://github.com/apache/spark/pull/7757 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Critical LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9425) Support DecimalType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646827#comment-14646827 ] Apache Spark commented on SPARK-9425: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7758 Support DecimalType in UnsafeRow Key: SPARK-9425 URL: https://issues.apache.org/jira/browse/SPARK-9425 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker The DecimalType has a precision up to 38 digits, so it's possible to serialize a Decimal as bounded byte array. We could have a fast path for DecimalType with precision under 18 digits, which could fit in a single long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances
[ https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9448: --- Assignee: Reynold Xin (was: Apache Spark) GenerateUnsafeProjection should not share expressions across instances -- Key: SPARK-9448 URL: https://issues.apache.org/jira/browse/SPARK-9448 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems when the expressions have mutable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646841#comment-14646841 ] Neelesh Srinivas Salian commented on SPARK-9380: Seems to be corrected. https://github.com/apache/spark/blob/master/docs/graphx-programming-guide.md Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances
[ https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646839#comment-14646839 ] Apache Spark commented on SPARK-9448: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7759 GenerateUnsafeProjection should not share expressions across instances -- Key: SPARK-9448 URL: https://issues.apache.org/jira/browse/SPARK-9448 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems when the expressions have mutable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646733#comment-14646733 ] Carsten Blank commented on SPARK-5754: -- And in 1.5.0. Actually I have quickly hacked a couple of lines to fix this issue. However, I also suggest that a command helper should be written, such that we populate a List of CommandEntry-Objects (oho creative naming!!) that each have key, value and a keyValueSeparator. We could assemble the correct command (either as a String or a List of Strings) plattform dependent. That way we could make sure, that commands don't break the container, regardless of the plattform. One more thought regarding the above mentioned persisting problem with -XX:OnOutOfMemoryError='kill %p' I have checked and it seems to come from the fact, that cmd expects a environment variable at %p. Consequently every thing breaks. One way to deal with this is using '%%'. Again, this kind of stuff should be in a helper class. I could contribute if that is okay. However, I never have so a bit of help would be appreciated! Spark AM not launching on Windows - Key: SPARK-5754 URL: https://issues.apache.org/jira/browse/SPARK-5754 Project: Spark Issue Type: Bug Components: Windows, YARN Affects Versions: 1.1.1, 1.2.0 Environment: Windows Server 2012, Hadoop 2.4.1. Reporter: Inigo I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM container fails to start. The problem seems to be in the generation of the YARN command which adds single quotes (') surrounding some of the java options. In particular, the part of the code that is adding those is the escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not like the quotes for these options. Here is an example of the command that the container tries to execute: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.yarn.secondary.jars=' '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Once I transform it into: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp -Dspark.yarn.secondary.jars= -Dspark.app.name=org.apache.spark.examples.SparkPi -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Everything seems to start. How should I deal with this? Creating a separate function like escapeForShell for Windows and call it whenever I detect this is for Windows? Or should I add some sanity check on YARN? I checked a little and there seems to be people that is able to run Spark on YARN on Windows, so it might be something else. I didn't find anything related on Jira either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances
[ https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9448: --- Assignee: Apache Spark (was: Reynold Xin) GenerateUnsafeProjection should not share expressions across instances -- Key: SPARK-9448 URL: https://issues.apache.org/jira/browse/SPARK-9448 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Blocker We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems when the expressions have mutable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9380: --- Assignee: (was: Apache Spark) Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646845#comment-14646845 ] Apache Spark commented on SPARK-9380: - User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/7695 Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9380: --- Assignee: Apache Spark Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Assignee: Apache Spark Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9430) Rename IntervalType to CalendarInterval
[ https://issues.apache.org/jira/browse/SPARK-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9430. Resolution: Fixed Fix Version/s: 1.5.0 Rename IntervalType to CalendarInterval --- Key: SPARK-9430 URL: https://issues.apache.org/jira/browse/SPARK-9430 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Based on offline discussion with [~marmbrus]. In 1.6, I think we should just create a TimeInterval type, which stores only the interval in terms of number of microseconds. TimeInterval can then be comparable. In 1.5, we should rename the existing IntervalType to CalendarInterval, so we won't have name clashes in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances
Reynold Xin created SPARK-9448: -- Summary: GenerateUnsafeProjection should not share expressions across instances Key: SPARK-9448 URL: https://issues.apache.org/jira/browse/SPARK-9448 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems when the expressions have mutable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-9380: Comment: was deleted (was: It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. ) Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646854#comment-14646854 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9447) Update python API to include RandomForest as classifier changes.
holdenk created SPARK-9447: -- Summary: Update python API to include RandomForest as classifier changes. Key: SPARK-9447 URL: https://issues.apache.org/jira/browse/SPARK-9447 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Minor The API should still work after SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets merged in, but we might want to extend provide predictRaw and similar in the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9447) Update python API to include RandomForest as classifier changes.
[ https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646759#comment-14646759 ] Joseph K. Bradley commented on SPARK-9447: -- Upgrading to Major priority since users should be able to specify new output column names from Python. Update python API to include RandomForest as classifier changes. Key: SPARK-9447 URL: https://issues.apache.org/jira/browse/SPARK-9447 Project: Spark Issue Type: Improvement Reporter: holdenk The API should still work after SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets merged in, but we might want to extend provide predictRaw and similar in the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.
[ https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9447: - Shepherd: Joseph K. Bradley Update python API to include RandomForest as classifier changes. Key: SPARK-9447 URL: https://issues.apache.org/jira/browse/SPARK-9447 Project: Spark Issue Type: Improvement Reporter: holdenk The API should still work after SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets merged in, but we might want to extend provide predictRaw and similar in the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.
[ https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9447: - Priority: Major (was: Minor) Update python API to include RandomForest as classifier changes. Key: SPARK-9447 URL: https://issues.apache.org/jira/browse/SPARK-9447 Project: Spark Issue Type: Improvement Reporter: holdenk The API should still work after SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets merged in, but we might want to extend provide predictRaw and similar in the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646768#comment-14646768 ] Inigo Goiri commented on SPARK-5754: Thank you [~cbvoxel]] for looking into this. I can send any patches you need, just let me know what you need. Spark AM not launching on Windows - Key: SPARK-5754 URL: https://issues.apache.org/jira/browse/SPARK-5754 Project: Spark Issue Type: Bug Components: Windows, YARN Affects Versions: 1.1.1, 1.2.0 Environment: Windows Server 2012, Hadoop 2.4.1. Reporter: Inigo I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM container fails to start. The problem seems to be in the generation of the YARN command which adds single quotes (') surrounding some of the java options. In particular, the part of the code that is adding those is the escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not like the quotes for these options. Here is an example of the command that the container tries to execute: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.yarn.secondary.jars=' '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Once I transform it into: @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp -Dspark.yarn.secondary.jars= -Dspark.app.name=org.apache.spark.examples.SparkPi -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar' --executor-memory 1024 --executor-cores 1 --num-executors 2 Everything seems to start. How should I deal with this? Creating a separate function like escapeForShell for Windows and call it whenever I detect this is for Windows? Or should I add some sanity check on YARN? I checked a little and there seems to be people that is able to run Spark on YARN on Windows, so it might be something else. I didn't find anything related on Jira either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9425) Support DecimalType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9425: --- Assignee: Apache Spark (was: Davies Liu) Support DecimalType in UnsafeRow Key: SPARK-9425 URL: https://issues.apache.org/jira/browse/SPARK-9425 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker The DecimalType has a precision up to 38 digits, so it's possible to serialize a Decimal as bounded byte array. We could have a fast path for DecimalType with precision under 18 digits, which could fit in a single long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646853#comment-14646853 ] Alexander Ulanov commented on SPARK-9380: - It seems that I did not name the PR correctly. I renamed it and resolved this issue. Sorry for inconvenience. Pregel example fix in graphx-programming-guide -- Key: SPARK-9380 URL: https://issues.apache.org/jira/browse/SPARK-9380 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Pregel operator to express single source shortest path does not work due to incorrect type of the graph: Graph[Int, Double] should be Graph[Long, Double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups
[ https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646487#comment-14646487 ] Andrew Or commented on SPARK-8414: -- Oops, you're right. I always mix them up... Ensure ContextCleaner actually triggers clean ups - Key: SPARK-8414 URL: https://issues.apache.org/jira/browse/SPARK-8414 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Right now it cleans up old references only through natural GCs, which may not occur if the driver has infinite RAM. We should do a periodic GC to make sure that we actually do clean things up. Something like once per 30 minutes seems relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
DB Tsai created SPARK-9442: -- Summary: java.lang.ArithmeticException: / by zero when reading Parquet Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646610#comment-14646610 ] Apache Spark commented on SPARK-7157: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7755 Add approximate stratified sampling to DataFrame Key: SPARK-7157 URL: https://issues.apache.org/jira/browse/SPARK-7157 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646708#comment-14646708 ] DB Tsai commented on SPARK-9442: I will try to turn on the logging at the info level. Thanks. java.lang.ArithmeticException: / by zero when reading Parquet - Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. Another note: By explicitly looping through the data to count, it will works. {code} sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) {code} I think maybe some metadata in parquet files are corrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646491#comment-14646491 ] Feynman Liang commented on SPARK-9440: -- Working on this LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Blocker LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
Feynman Liang created SPARK-9440: Summary: LocalLDAModel should save docConcentration, topicConcentration, and gammaShap Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Blocker LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic
[ https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646570#comment-14646570 ] Meihua Wu commented on SPARK-9246: -- Got it. Thanks! DistributedLDAModel predict top docs per topic -- Key: SPARK-9246 URL: https://issues.apache.org/jira/browse/SPARK-9246 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Original Estimate: 72h Remaining Estimate: 72h For each topic, return top documents based on topicDistributions. Synopsis: {code} /** * @param maxDocuments Max docs to return for each topic * @return Array over topics of (sorted top docs, corresponding doc-topic weights) */ def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])] {code} Note: We will need to make sure that the above return value format is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic
[ https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646571#comment-14646571 ] Meihua Wu commented on SPARK-9246: -- Got it. Thanks! DistributedLDAModel predict top docs per topic -- Key: SPARK-9246 URL: https://issues.apache.org/jira/browse/SPARK-9246 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Original Estimate: 72h Remaining Estimate: 72h For each topic, return top documents based on topicDistributions. Synopsis: {code} /** * @param maxDocuments Max docs to return for each topic * @return Array over topics of (sorted top docs, corresponding doc-topic weights) */ def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])] {code} Note: We will need to make sure that the above return value format is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9445) Adding custom SerDe when creating Hive tables
[ https://issues.apache.org/jira/browse/SPARK-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yashwanth Rao Dannamaneni closed SPARK-9445. Resolution: Duplicate Adding custom SerDe when creating Hive tables - Key: SPARK-9445 URL: https://issues.apache.org/jira/browse/SPARK-9445 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Yashwanth Rao Dannamaneni -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646692#comment-14646692 ] François Garillot commented on SPARK-9442: -- This may not be a Spark issue, but rather a Parquet Hadoop Reader issue. It looks like the total time spent reading may be reported wrong here: https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java#L110 Note this function outputs log messages that look like: Assembled and processed x records from y columns in t ms: n rec/ms, p cell/ms Do you see any of those, turning logging on at the info level ? java.lang.ArithmeticException: / by zero when reading Parquet - Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646705#comment-14646705 ] DB Tsai edited comment on SPARK-9442 at 7/29/15 8:28 PM: - Another note: By explicitly looping through the data to count, it will works. {code} sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) {code} I think maybe some metadata in parquet files are corrupted. was (Author: dbtsai): Another note: By explicitly looping through the data to count, it will works. [code] sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) [code] I think maybe some metadata in parquet files are corrupted. java.lang.ArithmeticException: / by zero when reading Parquet - Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. Another note: By explicitly looping through the data to count, it will works. {code} sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) {code} I think maybe some metadata in parquet files are corrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646546#comment-14646546 ] Joseph K. Bradley commented on SPARK-6884: -- That's a shame! I hope you're able to figure something out for the next patch you work on. Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: ML Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration
[ https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646578#comment-14646578 ] Sean Owen commented on SPARK-9441: -- Spark uses Akka which uses typesafe config 1.2.1, which contains this method, and matches what you're using. Before modifying your project, which you don't need to do, use {{mvn dependency:tree}} to verify what version of typesafe config you're getting, and from where. My guess is you're getting an older version from other dependency, and that's conflicting. There's a longer story here about Akka, and Typesafe Config, and shading, but if I'm right about that, then you just need to manage the version in your app in {{dependencyManagement}} and not write exclusions. I think this is less than ideal but is still 'by design' while Akka is around here, and, should be a reliable workaround if that's your app's situation. NoSuchMethodError: Com.typesafe.config.Config.getDuration - Key: SPARK-9441 URL: https://issues.apache.org/jira/browse/SPARK-9441 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1 Reporter: nirav patel I recently migrated my spark based rest service from 1.0.2 to 1.3.1 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(npatel); users with modify permissions: Set(npatel) Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125) at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120) at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171) at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269) at org.apache.spark.SparkContext.init(SparkContext.scala:272) I read on blogs where people suggest to modify classpath and put right version before, put scala libs before in classpath and similar suggestions. which is all ridiculous. I think typesafe config package included with spark-core lib is incorrect. I did following with my maven build and now it works. But i think someone need to fix spark-core package. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId exclusions exclusion artifactIdconfig/artifactId groupIdcom.typesafe/groupId /exclusion /exclusions /dependency dependency groupIdcom.typesafe/groupId artifactIdconfig/artifactId version1.2.1/version /dependency -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python
[ https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7016: -- Target Version/s: 1.6.0 (was: 1.5.0) Refactor dev/run-tests(-jenkins) from Bash to Python Key: SPARK-7016 URL: https://issues.apache.org/jira/browse/SPARK-7016 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Priority: Critical Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-9442: --- Description: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. Another note: By explicitly looping through the data to count, it will works. [code] sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) [code] I think maybe some metadata in parquet files are corrupted. was: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at
[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-9442: --- Description: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. Another note: By explicitly looping through the data to count, it will works. {code} sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) {code} I think maybe some metadata in parquet files are corrupted. was: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at
[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646705#comment-14646705 ] DB Tsai commented on SPARK-9442: Another note: By explicitly looping through the data to count, it will works. [code] sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) [code] I think maybe some metadata in parquet files are corrupted. java.lang.ArithmeticException: / by zero when reading Parquet - Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. Another note: By explicitly looping through the data to count, it will works. [code] sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 1L).reduce((x, y) = x + y) [code] I think maybe some metadata in parquet files are corrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646565#comment-14646565 ] Joseph K. Bradley commented on SPARK-9440: -- This JIRA is marked as a blocker since it is *required* by [SPARK-6793]. If this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new fields in LDA models which are not saved by save/load. LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Blocker LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9440: - Priority: Critical (was: Blocker) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Critical LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646565#comment-14646565 ] Joseph K. Bradley edited comment on SPARK-9440 at 7/29/15 6:28 PM: --- This JIRA is marked as critical since it is *required* by [SPARK-6793]. If this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new fields in LDA models which are not saved by save/load. was (Author: josephkb): This JIRA is marked as a blocker since it is *required* by [SPARK-6793]. If this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new fields in LDA models which are not saved by save/load. LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Critical LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
[ https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9440: - Shepherd: Joseph K. Bradley LocalLDAModel should save docConcentration, topicConcentration, and gammaShap - Key: SPARK-9440 URL: https://issues.apache.org/jira/browse/SPARK-9440 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Feynman Liang Priority: Blocker LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and {{bound}} (see SPARK-6793) to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646699#comment-14646699 ] Steve Loughran commented on SPARK-6152: --- Chill and Kryo need to be in sync; there's also the need to be compatible with the version Hive uses, (which has historically been addressed with custom versions of Hive). If spark could jump to Kryo 3.x, classpath conflict with hive would go away, provided the wire formats of serialized classes were compatible: hive's spark-client JAR uses kryo 2.2.x to talk to spark. Spark does not support Java 8 compiled Scala classes Key: SPARK-6152 URL: https://issues.apache.org/jira/browse/SPARK-6152 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Java 8+ Scala 2.11 Reporter: Ronald Chen Priority: Minor Spark uses reflectasm to check Scala closures which fails if the *user defined Scala closures* are compiled to Java 8 class version The cause is reflectasm does not support Java 8 https://github.com/EsotericSoftware/reflectasm/issues/35 Workaround: Don't compile Scala classes to Java 8, Scala 2.11 does not support nor require any Java 8 features Stack trace: {code} java.lang.IllegalArgumentException at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) at org.apache.spark.rdd.RDD.map(RDD.scala:288) at ...my Scala 2.11 compiled to Java 8 code calling into spark {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9352) Add tests for standalone scheduling code
[ https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9352: - Labels: (was: backport-needed) Add tests for standalone scheduling code Key: SPARK-9352 URL: https://issues.apache.org/jira/browse/SPARK-9352 Project: Spark Issue Type: Improvement Components: Deploy, Tests Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.4.2, 1.5.0 There are no tests for the standalone Master scheduling code! This has caused issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have some level of confidence that this code actually works... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9352) Add tests for standalone scheduling code
[ https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-9352. Resolution: Fixed Fix Version/s: 1.4.2 Add tests for standalone scheduling code Key: SPARK-9352 URL: https://issues.apache.org/jira/browse/SPARK-9352 Project: Spark Issue Type: Improvement Components: Deploy, Tests Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.4.2, 1.5.0 There are no tests for the standalone Master scheduling code! This has caused issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have some level of confidence that this code actually works... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-9442: --- Description: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} java.lang.ArithmeticException: / by zero when reading Parquet - Key: SPARK-9442 URL: https://issues.apache.org/jira/browse/SPARK-9442 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: DB Tsai I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at
[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic
[ https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646554#comment-14646554 ] Joseph K. Bradley commented on SPARK-9246: -- If it's much easier or faster computationally, it's OK if it's approximate. (It probably should be analogous to describeTopics, so I guess it will be approximate.) I think we should make both exact at some point, with a little more work, but either is OK for now. DistributedLDAModel predict top docs per topic -- Key: SPARK-9246 URL: https://issues.apache.org/jira/browse/SPARK-9246 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Original Estimate: 72h Remaining Estimate: 72h For each topic, return top documents based on topicDistributions. Synopsis: {code} /** * @param maxDocuments Max docs to return for each topic * @return Array over topics of (sorted top docs, corresponding doc-topic weights) */ def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], Array[Double])] {code} Note: We will need to make sure that the above return value format is Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-9442: --- Description: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) ... 21 more {code} BTW, no all the tasks fail, and some of them are successful. was: I am counting how many records in my nested parquet file with this schema, {code} scala u1aTesting.printSchema root |-- profileId: long (nullable = true) |-- country: string (nullable = true) |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- videoId: long (nullable = true) |||-- date: long (nullable = true) |||-- label: double (nullable = true) |||-- weight: double (nullable = true) |||-- features: vector (nullable = true) {code} and the number of the records in the nested data array is around 10k, and each of the parquet file is around 600MB. The total size is around 120GB. I am doing a simple count {code} scala u1aTesting.count parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in file hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129) at
[jira] [Created] (SPARK-9445) Adding custom SerDe when creating Hive tables
Yashwanth Rao Dannamaneni created SPARK-9445: Summary: Adding custom SerDe when creating Hive tables Key: SPARK-9445 URL: https://issues.apache.org/jira/browse/SPARK-9445 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Yashwanth Rao Dannamaneni -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646493#comment-14646493 ] Parv Oberoi commented on SPARK-5133: [~josephkb]: is the plan to still include this in SPARK 1.5 considering that the code cutoff is this week? Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Original Estimate: 168h Remaining Estimate: 168h Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9104) expose network layer memory usage in shuffle part
[ https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9104: --- Assignee: Apache Spark expose network layer memory usage in shuffle part - Key: SPARK-9104 URL: https://issues.apache.org/jira/browse/SPARK-9104 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Zhang, Liye Assignee: Apache Spark The default network transportation is netty, and when transfering blocks for shuffle, the network layer will consume a decent size of memory, we shall collect the memory usage of this part and expose it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9104) expose network layer memory usage in shuffle part
[ https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9104: --- Assignee: (was: Apache Spark) expose network layer memory usage in shuffle part - Key: SPARK-9104 URL: https://issues.apache.org/jira/browse/SPARK-9104 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Zhang, Liye The default network transportation is netty, and when transfering blocks for shuffle, the network layer will consume a decent size of memory, we shall collect the memory usage of this part and expose it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9104) expose network layer memory usage in shuffle part
[ https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646496#comment-14646496 ] Apache Spark commented on SPARK-9104: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/7753 expose network layer memory usage in shuffle part - Key: SPARK-9104 URL: https://issues.apache.org/jira/browse/SPARK-9104 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Zhang, Liye The default network transportation is netty, and when transfering blocks for shuffle, the network layer will consume a decent size of memory, we shall collect the memory usage of this part and expose it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8194) date/time function: add_months
[ https://issues.apache.org/jira/browse/SPARK-8194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646583#comment-14646583 ] Apache Spark commented on SPARK-8194: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7754 date/time function: add_months -- Key: SPARK-8194 URL: https://issues.apache.org/jira/browse/SPARK-8194 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin add_months(string start_date, int num_months): string add_months(date start_date, int num_months): date Returns the date that is num_months after start_date. The time part of start_date is ignored. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9133) Add and Subtract should support date/timestamp and interval type
[ https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9133: --- Assignee: Davies Liu (was: Apache Spark) Add and Subtract should support date/timestamp and interval type Key: SPARK-9133 URL: https://issues.apache.org/jira/browse/SPARK-9133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Should support date + interval interval + date timestamp + interval interval + timestamp The best way to support this is probably to resolve this to a date add/substract expression, rather than making add/subtract support these types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9133) Add and Subtract should support date/timestamp and interval type
[ https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646585#comment-14646585 ] Apache Spark commented on SPARK-9133: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7754 Add and Subtract should support date/timestamp and interval type Key: SPARK-9133 URL: https://issues.apache.org/jira/browse/SPARK-9133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Should support date + interval interval + date timestamp + interval interval + timestamp The best way to support this is probably to resolve this to a date add/substract expression, rather than making add/subtract support these types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646581#comment-14646581 ] Apache Spark commented on SPARK-8186: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7754 date/time function: date_add Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang date_add(timestamp startdate, int days): timestamp date_add(timestamp startdate, interval i): timestamp date_add(date date, int days): date date_add(date date, interval i): date -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org