[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10316: Description: We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operators order caused by non-deterministic expressions in PhysicalOperation. (was: We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operator order caused by non-deterministic expressions in PhysicalOperation.) respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operators order caused by non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation
Wenchen Fan created SPARK-10316: --- Summary: respect nondeterministic expressions in PhysicalOperation Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10316: Assignee: Apache Spark respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operators order caused by non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716786#comment-14716786 ] Apache Spark commented on SPARK-10316: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8486 respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operators order caused by non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10295: -- I believe that YARN currently will release executors even if they have cached data. I also recall that there's a desire to change this behavior, so that executors may stick around with cached data. I am not sure what the current or intended Mesos behavior is, but assume it's the same. Therefore, this message may need to be softened to something like Dynamic allocation is enabled; executors may be removed even when they contain cached data or something similar. I don't think there are hard guarantees about the behavior in any event, and the intent is just to make the user aware that it's possible for cached data to go away with dynamic allocation on. CC [~vanzin] and [~sandyr] Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Question Components: Mesos Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10295: -- Component/s: Mesos Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Question Components: Mesos Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10316: Description: We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operator order caused by non-deterministic expressions in PhysicalOperation. (was: We did a lot of special handling for non-deterministic expressions in ) respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operator order caused by non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6906) Improve Hive integration support
[ https://issues.apache.org/jira/browse/SPARK-6906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716616#comment-14716616 ] Thomas Graves commented on SPARK-6906: -- Thanks for the information. I'm trying to get it to work with our nonstandard version of hive (0.13)+ patches backported. But am having issues with authentication. I'm assuming its something with our version of Hive. Improve Hive integration support Key: SPARK-6906 URL: https://issues.apache.org/jira/browse/SPARK-6906 Project: Spark Issue Type: Story Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.5.0 Right now Spark SQL is very coupled to a specific version of Hive for two primary reasons. - Metadata: we use the Hive Metastore client to retrieve information about tables in a metastore. - Execution: UDFs, UDAFs, SerDes, HiveConf and various helper functions for configuration. Since Hive is generally not compatible across versions, we are currently maintain fairly expensive shim layers to let us talk to both Hive 12 and Hive 13 metastores. Ideally we would be able to talk to more versions of Hive with less maintenance burden. This task is proposing that we separate the hive version that is used for communicating with the metastore from the version that is used for execution. In doing so we can significantly reduce the size of the shim by only providing compatibility for metadata operations. All execution will be done with single version of Hive (the newest version that is supported by Spark SQL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10002) SSH problem during Setup of Spark(1.3.0) cluster on EC2
[ https://issues.apache.org/jira/browse/SPARK-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716767#comment-14716767 ] Zero tolerance commented on SPARK-10002: I met the same problem. Adding the parameter --private-ips seems to work. SSH problem during Setup of Spark(1.3.0) cluster on EC2 --- Key: SPARK-10002 URL: https://issues.apache.org/jira/browse/SPARK-10002 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Environment: EC2, SPARK 1.3.0 cluster setup in vpc/subnet. Reporter: Deepali Bhandari Steps to start a Spark cluster with EC2 scripts 1. I created an ec2 instance in the vpc, and subnet. Amazon Linux 2. I dowloaded spark-1.3.0 3. chmod 400 key file 4. Export aws access and secret keys 5. Now ran the command ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home/ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b66 --subnet-id=subnet-72fd5905 --resume launch deepali-spark-nodocker 6. The master and slave instances are created but cannot ssh says host not resolved. 7. I can ping the master and slave, I can ssh from the command line, but not from the ec2 scripts. 8. I have spent more than 2 days now. But no luck yet. 9. The ec2 scripts dont work .. code has a bug in referencing the cluster nodes via the wrong hostnames SCREEN CONSOLE log ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home /ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b6 6 --subnet-id=subnet-72fd5905 launch deepali-spark-nodocker Downloading Boto from PyPi Finished downloading Boto Setting up security groups... Creating security group deepali-spark-nodocker-master Creating security group deepali-spark-nodocker-slaves Searching for existing cluster deepali-spark-nodocker... Spark AMI: ami-9a6e0daa Launching instances... Launched 1 slaves in us-west-2b, regid = r-0d2088fb Launched master in us-west-2b, regid = r-312088c7 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: None SSH return code: 255 SSH output: ssh: Could not resolve hostname None: Name or service not known -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10316: Description: We did a lot of special handling for non-deterministic expressions in (was: We did a lot of special handling for ) respect nondeterministic expressions in PhysicalOperation - Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10316: Summary: respect non-deterministic expressions in PhysicalOperation (was: respect nondeterministic expressions in PhysicalOperation) respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8472: --- Assignee: Apache Spark Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Assignee: Apache Spark Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8472: --- Assignee: (was: Apache Spark) Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716677#comment-14716677 ] Apache Spark commented on SPARK-8472: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8485 Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10316: Assignee: (was: Apache Spark) respect non-deterministic expressions in PhysicalOperation -- Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for non-deterministic expressions in Optimizer. However, PhysicalOperation just collects all Projects and Filters and messed it up. We should respect the operators order caused by non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size
[ https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10314: -- Priority: Minor (was: Major) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size Key: SPARK-10314 URL: https://issues.apache.org/jira/browse/SPARK-10314 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.4.1 Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4 Reporter: Xiaoyu Wang Priority: Minor RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size {code} val rdd = sc.parallelize(List(1, 2),2) rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) rdd.count() {code} is ok. {code} val rdd = sc.parallelize(List(1, 2),3) rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) rdd.count() {code} got exceptoin: {noformat} 15/08/27 17:53:07 INFO SparkContext: Starting job: count at console:24 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at console:24) with 3 output partitions (allowLocal=false) 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at console:24) 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List() 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List() 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at console:21), which has no missing parents 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with curMem=0, maxMem=741196431 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1096.0 B, free 706.9 MB) 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with curMem=1096, maxMem=741196431 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 788.0 B, free 706.9 MB) 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:43776 (size: 788.0 B, free: 706.9 MB) 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at console:21) 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1269 bytes) 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1270 bytes) 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 1270 bytes) 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started 15/08/27 17:53:08 WARN : tachyon.home is not set. Using /mnt/tachyon_default_home as the default value. 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect master @ localhost/127.0.0.1:19998 15/08/27 17:53:08 INFO : User registered at the master localhost/127.0.0.1:19998 got UserId 109 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was created! 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 was created! 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 was created! 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore on localhost:43776 (size: 0.0 B) 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore on localhost:43776 (size: 2.0 B) 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore on localhost:43776 (size: 2.0 B) 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_1 locally 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_2 locally 15/08/27 17:53:08 INFO Executor:
[jira] [Updated] (SPARK-10315) remove document on spark.akka.failure-detector.threshold
[ https://issues.apache.org/jira/browse/SPARK-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10315: -- Priority: Minor (was: Major) remove document on spark.akka.failure-detector.threshold Key: SPARK-10315 URL: https://issues.apache.org/jira/browse/SPARK-10315 Project: Spark Issue Type: Bug Components: Documentation Reporter: Nan Zhu Priority: Minor this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10316: Description: We did a lot of special handling for respect nondeterministic expressions in PhysicalOperation - Key: SPARK-10316 URL: https://issues.apache.org/jira/browse/SPARK-10316 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan We did a lot of special handling for -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
Velu nambi created SPARK-10319: -- Summary: ALS training using PySpark throws a StackOverflowError Key: SPARK-10319 URL: https://issues.apache.org/jira/browse/SPARK-10319 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: Windows 10, spark - 1.4.1, Reporter: Velu nambi When attempting to train a machine learning model using ALS in Spark's MLLib (1.4) on windows, Pyspark always terminates with a StackoverflowError. I tried adding the checkpoint as described in http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help. Here's the training code and stack trace: {code:none} ranks = [8, 12] lambdas = [0.1, 10.0] numIters = [10, 20] bestModel = None bestValidationRmse = float(inf) bestRank = 0 bestLambda = -1.0 bestNumIter = -1 for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters): ALS.checkpointInterval = 2 model = ALS.train(training, rank, numIter, lmbda) validationRmse = computeRmse(model, validation, numValidation) if (validationRmse bestValidationRmse): bestModel = model bestValidationRmse = validationRmse bestRank = rank bestLambda = lmbda bestNumIter = numIter testRmse = computeRmse(bestModel, test, numTest) {code} Stacktrace: 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 127) java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) at java.io.ObjectInputStream.readHandle(Unknown Source) at java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra
[ https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717139#comment-14717139 ] Sean Owen commented on SPARK-10318: --- I personally don't know, but if this is a question about JDBC + Cassandra it should go to the Cassandra mailing list first. If it's about the DataStax driver ask DataStax. If you suspect it really might have to do with Spark, I'd ask u...@spark.apache.org. A JIRA isn't the right step as this point since it's not clear there is even a problem in Spark here. Getting issue in spark connectivity with cassandra -- Key: SPARK-10318 URL: https://issues.apache.org/jira/browse/SPARK-10318 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.4.0 Environment: Spark on local mode with centos 6.x Reporter: Poorvi Lashkary Priority: Minor Use case: I have to craete spark sql dataframe with the table on cassandra with jdbc driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717220#comment-14717220 ] Sudarshan Kadambi commented on SPARK-10320: --- There is ingest-time analytics (independent, application of transforms over data published to individual topics) and query-time analytics (user queries which requires joins across RDDs holding the transformed data). However, even ingest-time analytics will potentially require joins across data published to different topics. For these reasons, this needs to be a single Spark streaming application. Support new topic subscriptions without requiring restart of the streaming context -- Key: SPARK-10320 URL: https://issues.apache.org/jira/browse/SPARK-10320 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Sudarshan Kadambi Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717235#comment-14717235 ] koert kuipers commented on SPARK-5741: -- i am reading avro and csv mostly. but we try to support multiple inputs across a wide range of formats (currently avro, csv, json, and parquet). i realize parquet supports it, but it does so by explicitly working around the general infrastructure. i am sympathetic to the idea of no longer doing string munging, but that poses some challenges since the main vehicle to carry this information is a Map[String, String] (DataFrameReader.extraOptions). if we could come up with a general way to do this that does not involve string munging, i am happy to work on it. the ideal api in my view would be something like: sqlContext.read.format(...).paths(a, b) alternatively this could be expressed as a union operation of many dataframes, but i do not have the knowledge of the relevant code to understand if that is feasible, scalable and will support predicate pushdown and such. but if that works then i have no need for multiple inputs in DataFrameReader... from what i know from other projects such as scalding, i think its is a very common request to be able to support multiple paths, and you would exclude a significant userbase by not supporting it. but thats just a guess... Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[jira] [Commented] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing
[ https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716819#comment-14716819 ] Steve Loughran commented on SPARK-10317: There's various possible fixes here # {{start-history-server}} script to {{shift;}} out $1 arg then pass the remainder down. # {{start-history-server}} script to prefix $1 arg with {{-d}} while passing down the whole line. # {{HistoryServerArguments}} to convert $1 arg to a directory unless its a recognised - option start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing - Key: SPARK-10317 URL: https://issues.apache.org/jira/browse/SPARK-10317 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Steve Loughran The history server has its argument parsing class in {{HistoryServerArguments}}. However, this doesn't get involved in the {{start-history-server.sh}} codepath where the $0 arg is assigned to {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g {{--property-file}}. This stops the other options being usable from this script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra
[ https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717049#comment-14717049 ] Poorvi Lashkary commented on SPARK-10318: - I have done the following: private static final String C_DRIVER = org.apache.cassandra.cql.jdbc.CassandraDriver; private static final String Cassandra_USERNAME = abc; private static final String C_PWD = abc123; private static final String C_CONNECTION_URL = jdbc:cassandra://localhost:9160/MyKeyspace?user= + Cassandra_USERNAME + password= + C_PWD; MapString, String options = new HashMapString, String(); options.put(driver, C_DRIVER); options.put(url, C_CONNECTION_URL); options.put(dbtable, test); DataFrame jdbcDF = sc.load(jdbc, options); jdbcDF .registerTempTable(datafrm); DataFrame d = sc.sql(select * from datafrm); d.count(); then getting following error: InvalidRequestException(why:line 1:25 no viable alternative at input '1' (SELECT * FROM test WHERE [1]...)) I am not getting why where clause is here. should we must fetch with where clause? Getting issue in spark connectivity with cassandra -- Key: SPARK-10318 URL: https://issues.apache.org/jira/browse/SPARK-10318 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.4.0 Environment: Spark on local mode with centos 6.x Reporter: Poorvi Lashkary Priority: Minor Use case: I have to craete spark sql dataframe with the table on cassandra with jdbc driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra
[ https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717093#comment-14717093 ] Poorvi Lashkary commented on SPARK-10318: - can you provide the way to establish jdbc connection with cassandra without using datastax? like with simple jdbc connection. Getting issue in spark connectivity with cassandra -- Key: SPARK-10318 URL: https://issues.apache.org/jira/browse/SPARK-10318 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.4.0 Environment: Spark on local mode with centos 6.x Reporter: Poorvi Lashkary Priority: Minor Use case: I have to craete spark sql dataframe with the table on cassandra with jdbc driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717134#comment-14717134 ] Sean Owen commented on SPARK-10319: --- Definitely sounds like https://issues.apache.org/jira/browse/SPARK-5955 so either somehow the checkpoint interval isn't taking effect, or this is actually slightly different. If you scroll way way back, what's at the top of the stack? or is it truncated? Does it work with some number of iterations but not others? do you see evidence of checkpointing in the logs? ALS training using PySpark throws a StackOverflowError -- Key: SPARK-10319 URL: https://issues.apache.org/jira/browse/SPARK-10319 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: Windows 10, spark - 1.4.1, Reporter: Velu nambi When attempting to train a machine learning model using ALS in Spark's MLLib (1.4) on windows, Pyspark always terminates with a StackoverflowError. I tried adding the checkpoint as described in http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help. Here's the training code and stack trace: {code:none} ranks = [8, 12] lambdas = [0.1, 10.0] numIters = [10, 20] bestModel = None bestValidationRmse = float(inf) bestRank = 0 bestLambda = -1.0 bestNumIter = -1 for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters): ALS.checkpointInterval = 2 model = ALS.train(training, rank, numIter, lmbda) validationRmse = computeRmse(model, validation, numValidation) if (validationRmse bestValidationRmse): bestModel = model bestValidationRmse = validationRmse bestRank = rank bestLambda = lmbda bestNumIter = numIter testRmse = computeRmse(bestModel, test, numTest) {code} Stacktrace: 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 127) java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) at java.io.ObjectInputStream.readHandle(Unknown Source) at java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Kadambi updated SPARK-10320: -- Description: Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be, thus affecting QoS. was: Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be affecting QoS. Support new topic subscriptions without requiring restart of the streaming context -- Key: SPARK-10320 URL: https://issues.apache.org/jira/browse/SPARK-10320 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Sudarshan Kadambi Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10182) GeneralizedLinearModel doesn't unpersist cached data
[ https://issues.apache.org/jira/browse/SPARK-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10182. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8395 [https://github.com/apache/spark/pull/8395] GeneralizedLinearModel doesn't unpersist cached data Key: SPARK-10182 URL: https://issues.apache.org/jira/browse/SPARK-10182 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Vyacheslav Baranov Assignee: Vyacheslav Baranov Priority: Minor Fix For: 1.6.0 The problem might be reproduced in spark-shell with following code snippet: {code} import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint val samples = Seq[LabeledPoint]( LabeledPoint(1.0, Vectors.dense(1.0, 0.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.0)), LabeledPoint(0.0, Vectors.dense(1.0, 1.0)), LabeledPoint(0.0, Vectors.dense(0.0, 0.0)) ) val rdd = sc.parallelize(samples) for (i - 0 until 10) { val model = { new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(rdd) .clearThreshold() } } {code} After code execution there are 10 {{MapPartitionsRDD}} objects on Storage tab in Spark application UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717184#comment-14717184 ] Sean Owen commented on SPARK-10320: --- It sounds like you listen to topics and processing them fairly independently (as you should). Why not run multiple streaming apps? sure you incur some overhead, but gain isolation and simplicity. Support new topic subscriptions without requiring restart of the streaming context -- Key: SPARK-10320 URL: https://issues.apache.org/jira/browse/SPARK-10320 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Sudarshan Kadambi Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10321) OrcRelation doesn't override sizeInBytes
Cheng Lian created SPARK-10321: -- Summary: OrcRelation doesn't override sizeInBytes Key: SPARK-10321 URL: https://issues.apache.org/jira/browse/SPARK-10321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Critical This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717209#comment-14717209 ] Seth Hendrickson commented on SPARK-4240: - [~josephkb] I think there needs to be some discussion of how and where this fits into the current boosting package architecture. Right now, the ML GBT algorithm just calls the the MLlib implementation of GBTs. While the random forest algorithm has already been moved into the ML package, the GBT algorithm has not and I assume this is because we are waiting on the implementation/result of [SPARK-7129|https://issues.apache.org/jira/browse/SPARK-7129], which calls for a generic boosting algorithm. While this JIRA is specific to gradient boosted trees, it is still affected by the overall boosting architecture. I've got some code that implements the terminal node refinements in the MLlib implementation, but I suspect that there might be some resistance to changing MLlib's implementation. I can continue implementing this in MLlib if we decide that is the route we'd like to take. Otherwise, I think this work needs to wait until GBTs are moved to the ML package. Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. Key: SPARK-4240 URL: https://issues.apache.org/jira/browse/SPARK-4240 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Sung Chung The gradient boosting as currently implemented estimates the loss-gradient in each iteration using regression trees. At every iteration, the regression trees are trained/split to minimize predicted gradient variance. Additionally, the terminal node predictions are computed to minimize the prediction variance. However, such predictions won't be optimal for loss functions other than the mean-squared error. The TreeBoosting refinement can help mitigate this issue by modifying terminal node prediction values so that those predictions would directly minimize the actual loss function. Although this still doesn't change the fact that the tree splits were done through variance reduction, it should still lead to improvement in gradient estimations, and thus better performance. The details of this can be found in the R vignette. This paper also shows how to refine the terminal node predictions. http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context
Sudarshan Kadambi created SPARK-10320: - Summary: Support new topic subscriptions without requiring restart of the streaming context Key: SPARK-10320 URL: https://issues.apache.org/jira/browse/SPARK-10320 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Sudarshan Kadambi Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717053#comment-14717053 ] koert kuipers commented on SPARK-5741: -- i realize i am late to the party but... by doing this you are losing a very important functionality: passing in multiple input paths comma separated. globs only cover a very limited subset of what you can do with multiple paths. for example selecting partitions (by day) for the last 30 days cannot be expressed with a glob. so you are giving up major functionality just to be able to pass in a character that people would generally advice should not be part of a filename anyhow? doesnt sound like a good idea to me. Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079 ] Indrajit commented on SPARK-6817: -- Here are some suggestions on the proposed API. If the idea is to keep the API close to R's current primitives, we should avoid introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)). Since collect already exists in Spark, and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead of introducing a new function dapplyCollect. Relying on existing syntax will reduce the learning curve for users. Was performance the primary intent to introduce dapplyCollect instead of collect(dapply(...))? Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R, the function split provides grouping (https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to implement split using GroupBy in Spark. gapply can then be expressed in terms of dapply and split, and gapplyCollect will become collect(dapply(..split..)). Here is a simple example that uses split and lapply in R: df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5)) print(df) s-split(df$age, df$city) lapply(s, mean) DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717165#comment-14717165 ] Michael Armbrust commented on SPARK-5741: - What format are you trying to read? There [are still ways to read more than one file|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L258], they just don't rely on brittle string munging anymore. Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra
[ https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717078#comment-14717078 ] Sean Owen commented on SPARK-10318: --- This is a Cassandra exception. I don't see that it's traceable to Spark? Getting issue in spark connectivity with cassandra -- Key: SPARK-10318 URL: https://issues.apache.org/jira/browse/SPARK-10318 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.4.0 Environment: Spark on local mode with centos 6.x Reporter: Poorvi Lashkary Priority: Minor Use case: I have to craete spark sql dataframe with the table on cassandra with jdbc driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717115#comment-14717115 ] Anita Tailor commented on SPARK-8292: - No an issue, It's directed graph and there is no incoming edge for node 0. If we add one incoming edge 5\t0 to test data mentioned above, it will give following results, which looks correct (4,Map(0 - 2)) (0,Map(0 - 0)) (6,Map()) (2,Map()) (3,Map()) (5,Map(0 - 1)) ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Priority: Minor Labels: patch Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10253) Remove Guava dependencies in MLlib java tests
[ https://issues.apache.org/jira/browse/SPARK-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10253. --- Resolution: Fixed Fix Version/s: 1.6.0 Remove Guava dependencies in MLlib java tests - Key: SPARK-10253 URL: https://issues.apache.org/jira/browse/SPARK-10253 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.6.0 Many tests depend on Google Guava's {{Lists.newArrayList}} when {{java.util.Arrays.asList}} could be used instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10257) Remove Guava dependencies in spark.mllib JavaTests
[ https://issues.apache.org/jira/browse/SPARK-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10257. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8451 [https://github.com/apache/spark/pull/8451] Remove Guava dependencies in spark.mllib JavaTests -- Key: SPARK-10257 URL: https://issues.apache.org/jira/browse/SPARK-10257 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang Priority: Minor Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9890) User guide for CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9890: --- Assignee: Apache Spark User guide for CountVectorizer -- Key: SPARK-9890 URL: https://issues.apache.org/jira/browse/SPARK-9890 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Apache Spark SPARK-8703 added a count vectorizer as a ML transformer. We should add an accompanying user guide to {{ml-features}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9890) User guide for CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9890: --- Assignee: (was: Apache Spark) User guide for CountVectorizer -- Key: SPARK-9890 URL: https://issues.apache.org/jira/browse/SPARK-9890 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-8703 added a count vectorizer as a ML transformer. We should add an accompanying user guide to {{ml-features}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN
[ https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716806#comment-14716806 ] LINTE commented on SPARK-6918: -- Is this issue really fixed ? I work with secure hadoop 2.7.1 / hbase 1.0.1 / spark 1.4.0 / zookeeper 3.4.5 When I run this simple code in yarn-client mode : -- import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.client.Result val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, ns:table) conf.addResource(new Path(/path/to/hbase/hbase-site.xml)); val rdd = sc.newAPIHadoopRDD(conf,classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) rdd.count() - I have th following error on my executor : 15/08/27 16:56:37 WARN ipc.AbstractRpcClient: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 15/08/27 16:56:37 ERROR ipc.AbstractRpcClient: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'. javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:604) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:153) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:730) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:727) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:727) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:880) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:849) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1173) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:31889) at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:202) at org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:181) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) ... 24 more Secure HBase with Kerberos does not work over YARN -- Key: SPARK-6918 URL:
[jira] [Commented] (SPARK-9890) User guide for CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716836#comment-14716836 ] Apache Spark commented on SPARK-9890: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/8487 User guide for CountVectorizer -- Key: SPARK-9890 URL: https://issues.apache.org/jira/browse/SPARK-9890 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang SPARK-8703 added a count vectorizer as a ML transformer. We should add an accompanying user guide to {{ml-features}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716992#comment-14716992 ] Marcelo Vanzin commented on SPARK-10295: In 1.5 executors with cached data are not released by default (and there's a separate timeout configuration for them). I think we just forgot to delete the log message. Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Question Components: Mesos Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing
Steve Loughran created SPARK-10317: -- Summary: start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing Key: SPARK-10317 URL: https://issues.apache.org/jira/browse/SPARK-10317 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Steve Loughran The history server has its argument parsing class in {{HistoryServerArguments}}. However, this doesn't get involved in the {{start-history-server.sh}} codepath where the $0 arg is assigned to {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g {{--property-file}}. This stops the other options being usable from this script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10318) Getting issue in spark connectivity with cassandra
Poorvi Lashkary created SPARK-10318: --- Summary: Getting issue in spark connectivity with cassandra Key: SPARK-10318 URL: https://issues.apache.org/jira/browse/SPARK-10318 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.4.0 Environment: Spark on local mode with centos 6.x Reporter: Poorvi Lashkary Priority: Minor Use case: I have to craete spark sql dataframe with the table on cassandra with jdbc driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717258#comment-14717258 ] Michael Armbrust commented on SPARK-5741: - It was originally just parquet that would support more than one file, but now all HadoopFSRelations should. (which covers all but CSV, and we should upgrade that library too) I would be in favor of generalizing this support for at least these sources given the following constraints: - We must keep source/binary compatibility. - We should give good errors when the source does not support this feature. - For consistency, I'd prefer if we can just add a {{load(path: String*)}} (but I'm not sure if this is possible given the above). - {{paths(path: *)}} is okay, but I think I'd prefer if it was not the terminal operator. Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717270#comment-14717270 ] Velu nambi commented on SPARK-10319: bq. do you see evidence of checkpointing in the logs? Yes, I see a few files created in the Checkpoint directory. ALS training using PySpark throws a StackOverflowError -- Key: SPARK-10319 URL: https://issues.apache.org/jira/browse/SPARK-10319 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: Windows 10, spark - 1.4.1, Reporter: Velu nambi When attempting to train a machine learning model using ALS in Spark's MLLib (1.4) on windows, Pyspark always terminates with a StackoverflowError. I tried adding the checkpoint as described in http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help. Here's the training code and stack trace: {code:none} ranks = [8, 12] lambdas = [0.1, 10.0] numIters = [10, 20] bestModel = None bestValidationRmse = float(inf) bestRank = 0 bestLambda = -1.0 bestNumIter = -1 for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters): ALS.checkpointInterval = 2 model = ALS.train(training, rank, numIter, lmbda) validationRmse = computeRmse(model, validation, numValidation) if (validationRmse bestValidationRmse): bestModel = model bestValidationRmse = validationRmse bestRank = rank bestLambda = lmbda bestNumIter = numIter testRmse = computeRmse(bestModel, test, numTest) {code} Stacktrace: 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 127) java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) at java.io.ObjectInputStream.readHandle(Unknown Source) at java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9148) User-facing documentation for NaN handling semantics
[ https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9148. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8441 [https://github.com/apache/spark/pull/8441] User-facing documentation for NaN handling semantics Key: SPARK-9148 URL: https://issues.apache.org/jira/browse/SPARK-9148 Project: Spark Issue Type: Technical task Components: Documentation, SQL Reporter: Josh Rosen Priority: Critical Fix For: 1.5.0 Once we've finalized our NaN changes for Spark 1.5, we need to create user-facing documentation to explain our chosen semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717297#comment-14717297 ] Felix Cheung commented on SPARK-9316: - https://issues.apache.org/jira/browse/SPARK-10322 Add support for filtering using `[` (synonym for filter / select) - Key: SPARK-9316 URL: https://issues.apache.org/jira/browse/SPARK-9316 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Felix Cheung Fix For: 1.6.0, 1.5.1 Will help us support queries of the form {code} air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10252) Update Spark SQL Programming Guide for Spark 1.5
[ https://issues.apache.org/jira/browse/SPARK-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10252. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8441 [https://github.com/apache/spark/pull/8441] Update Spark SQL Programming Guide for Spark 1.5 Key: SPARK-10252 URL: https://issues.apache.org/jira/browse/SPARK-10252 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10322) Column %in% function is not working
Felix Cheung created SPARK-10322: Summary: Column %in% function is not working Key: SPARK-10322 URL: https://issues.apache.org/jira/browse/SPARK-10322 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.5.0 Reporter: Felix Cheung $ sparkR ... df$age Column age filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% 30) Error in filter(df$age %in% 30) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context
[ https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717342#comment-14717342 ] Cody Koeninger commented on SPARK-10320: As I said on the list, the best way to deal with this currently is start a new app with your new code, before stopping the old app. In terms of a potential feature addition, I think there are a number of questions that would need to be cleared up... e.g. - when would you change topics? During a streaming listener onbatch completed handler? From a separate thread? - when adding a topic, what would the expectations around starting offset be? As in the current api, provide explicit offsets per partition, start at beginning, or start at end? - if you add partitions for topics that currently exist, and specify a starting offset that's different from where the job is currently, what would the expectation be? - if you add, later remove, then later re-add a topic, what would the expectation regarding saved checkpoints be? Support new topic subscriptions without requiring restart of the streaming context -- Key: SPARK-10320 URL: https://issues.apache.org/jira/browse/SPARK-10320 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Sudarshan Kadambi Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe to current ones once the streaming context has been started. Restarting the streaming context increases the latency of update handling. Consider a streaming application subscribed to n topics. Let's say 1 of the topics is no longer needed in streaming analytics and hence should be dropped. We could do this by stopping the streaming context, removing that topic from the topic list and restarting the streaming context. Since with some DStreams such as DirectKafkaStream, the per-partition offsets are maintained by Spark, we should be able to resume uninterrupted (I think?) from where we left off with a minor delay. However, in instances where expensive state initialization (from an external datastore) may be needed for datasets published to all topics, before streaming updates can be applied to it, it is more convenient to only subscribe or unsubcribe to the incremental changes to the topic list. Without such a feature, updates go unprocessed for longer than they need to be, thus affecting QoS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717374#comment-14717374 ] Marcel Mitsuto commented on SPARK-4105: --- mapPartitions at Exchange.scala:60 +details org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49) org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48) org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49) org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48) org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60) org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49) org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48) org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188) org.apache.spark.sql.execution.Project.execute(basicOperators.scala:40) org.apache.spark.sql.parquet.InsertIntoParquetTable.execute(ParquetTableOperations.scala:265) org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099) org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099) 2015/08/27 19:32:12 2 s 4/2000 (29 failed) 61.8 MB 6.6 MB org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)+details FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: JavaObjectToSerialize.java, SparkFailedToUncompressGenerator.scala We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at
[jira] [Comment Edited] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717297#comment-14717297 ] Felix Cheung edited comment on SPARK-9316 at 8/27/15 6:58 PM: -- (updated) Hi, There shouldn't be any change to filter() the following filter worked previously for me: subsetdf - filter(df, age in (19, 30)) I tried this just now and it worked (with Shivaram's fix) was (Author: felixcheung): https://issues.apache.org/jira/browse/SPARK-10322 Add support for filtering using `[` (synonym for filter / select) - Key: SPARK-9316 URL: https://issues.apache.org/jira/browse/SPARK-9316 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Felix Cheung Fix For: 1.6.0, 1.5.1 Will help us support queries of the form {code} air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10315) remove document on spark.akka.failure-detector.threshold
[ https://issues.apache.org/jira/browse/SPARK-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10315. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8483 [https://github.com/apache/spark/pull/8483] remove document on spark.akka.failure-detector.threshold Key: SPARK-10315 URL: https://issues.apache.org/jira/browse/SPARK-10315 Project: Spark Issue Type: Bug Components: Documentation Reporter: Nan Zhu Priority: Minor Fix For: 1.6.0, 1.5.1 this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10322) Column %in% function is not exported
[ https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-10322: - Summary: Column %in% function is not exported (was: Column %in% function is not working) Column %in% function is not exported Key: SPARK-10322 URL: https://issues.apache.org/jira/browse/SPARK-10322 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.5.0 Reporter: Felix Cheung $ sparkR ... df$age Column age filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% 30) Error in filter(df$age %in% 30) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717328#comment-14717328 ] Felix Cheung commented on SPARK-9316: - As for this, subsetdf - df[age in (19, 30),1:2] Error in df[age in (19, 30),1:2] : object of type 'S4' is not subsettable I believe this should be `df[age in (19, 30),1:2]` instead? I had a iteration of the change to port this but the code turns out to be convoluted, as the character vector can also have something like this age that matches a column. Please let us know if this is something we should support. Add support for filtering using `[` (synonym for filter / select) - Key: SPARK-9316 URL: https://issues.apache.org/jira/browse/SPARK-9316 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Felix Cheung Fix For: 1.6.0, 1.5.1 Will help us support queries of the form {code} air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717263#comment-14717263 ] Velu nambi edited comment on SPARK-10319 at 8/27/15 6:42 PM: - Yes it does seem similar to SPARK-5955, it works when I reduce the iterations to [5,10] (currently set to [10,20]). Here is the small stack trace from top of the stack, let me know 5/08/27 10:35:07 INFO DAGScheduler: Job 12 failed: count at ALS.scala:243, took 3.083999 s Traceback (most recent call last): File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 4.5.3\helpers\pydev\pydevd.py, line 2358, in module globals = debugger.run(setup['file'], None, None, is_module) File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 4.5.3\helpers\pydev\pydevd.py, line 1778, in run pydev_imports.execfile(file, globals, locals) # execute the script File C:/Users/PycharmProjects/MovieLensALS/MovieLensALS.py, line 129, in module model = ALS.train(training, rank, numIter, lmbda) File C:\spark-1.4.1\python\pyspark\mllib\recommendation.py, line 194, in train lambda_, blocks, nonnegative, seed) File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 128, in callMLlibFunc return callJavaFunc(sc, api, *args) File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 121, in callJavaFunc return _java2py(sc, func(*args)) File C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\java_gateway.py, line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\protocol.py, line 308, in get_return_value format(target_id, ., name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o145.trainALSModel. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 124, localhost): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) at java.io.ObjectInputStream.readHandle(Unknown Source) at java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:366) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at
[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717263#comment-14717263 ] Velu nambi commented on SPARK-10319: Yes it does seem similar to SPARK-5955, it works when I set the reduce the iterations to [5,10] (currently set to [10,20]). Here is the small stack trace from top of the stack, let me know 5/08/27 10:35:07 INFO DAGScheduler: Job 12 failed: count at ALS.scala:243, took 3.083999 s Traceback (most recent call last): File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 4.5.3\helpers\pydev\pydevd.py, line 2358, in module globals = debugger.run(setup['file'], None, None, is_module) File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 4.5.3\helpers\pydev\pydevd.py, line 1778, in run pydev_imports.execfile(file, globals, locals) # execute the script File C:/Users/PycharmProjects/MovieLensALS/MovieLensALS.py, line 129, in module model = ALS.train(training, rank, numIter, lmbda) File C:\spark-1.4.1\python\pyspark\mllib\recommendation.py, line 194, in train lambda_, blocks, nonnegative, seed) File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 128, in callMLlibFunc return callJavaFunc(sc, api, *args) File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 121, in callJavaFunc return _java2py(sc, func(*args)) File C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\java_gateway.py, line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\protocol.py, line 308, in get_return_value format(target_id, ., name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o145.trainALSModel. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 (TID 124, localhost): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) at java.io.ObjectInputStream.readHandle(Unknown Source) at java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at scala.collection.immutable.$colon$colon.readObject(List.scala:366) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.io.ObjectInputStream.readSerialData(Unknown Source) at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at
[jira] [Assigned] (SPARK-9148) User-facing documentation for NaN handling semantics
[ https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-9148: --- Assignee: Michael Armbrust User-facing documentation for NaN handling semantics Key: SPARK-9148 URL: https://issues.apache.org/jira/browse/SPARK-9148 Project: Spark Issue Type: Technical task Components: Documentation, SQL Reporter: Josh Rosen Assignee: Michael Armbrust Priority: Critical Fix For: 1.5.0 Once we've finalized our NaN changes for Spark 1.5, we need to create user-facing documentation to explain our chosen semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10322) Column %in% function is not exported
[ https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717305#comment-14717305 ] Felix Cheung commented on SPARK-10322: -- https://github.com/apache/spark/commit/ad7f0f160be096c0fdae6e6cf7e3b6ba4a606de7 SPARK-10308 Column %in% function is not exported Key: SPARK-10322 URL: https://issues.apache.org/jira/browse/SPARK-10322 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.5.0 Reporter: Felix Cheung $ sparkR ... df$age Column age filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% 30) Error in filter(df$age %in% 30) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10322) Column %in% function is not exported
[ https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-10322. -- Resolution: Duplicate Looks like this was fixed last night. Column %in% function is not exported Key: SPARK-10322 URL: https://issues.apache.org/jira/browse/SPARK-10322 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.5.0 Reporter: Felix Cheung $ sparkR ... df$age Column age filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% c(19, 30)) Error in filter(df$age %in% c(19, 30)) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments filter(df$age %in% 30) Error in filter(df$age %in% 30) : error in evaluating the argument 'x' in selecting a method for function 'filter': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717444#comment-14717444 ] Apache Spark commented on SPARK-10295: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/8489 Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10295: -- Component/s: (was: Mesos) Spark Core Documentation Issue Type: Improvement (was: Question) OK, let's think of this as a simple log message update. PR coming. Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Summary: Partition discovery does not throw an exception if the dir structure is valid (was: Need to add a null check in unwrapperFor in HiveInspectors) Partition discovery does not throw an exception if the dir structure is valid - Key: SPARK-10304 URL: https://issues.apache.org/jira/browse/SPARK-10304 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Zhan Zhang Priority: Critical {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 (TID 3504, 10.0.195.227): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10321: Assignee: Apache Spark OrcRelation doesn't override sizeInBytes Key: SPARK-10321 URL: https://issues.apache.org/jira/browse/SPARK-10321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Critical This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10321: Assignee: (was: Apache Spark) OrcRelation doesn't override sizeInBytes Key: SPARK-10321 URL: https://issues.apache.org/jira/browse/SPARK-10321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Critical This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717450#comment-14717450 ] koert kuipers edited comment on SPARK-5741 at 8/27/15 8:22 PM: --- given the requirement of source/binary compatibility i do not think it can be done without some kind of string munging. however the string munging could be restricted to a separate variable, set with paths(path: *) so path does not get polluted. this variable would be exclusively for HadoopFsRelationProvider, and an error thrown in ResolvedDataSource if any other RelationProvider is used with this variable set. also an error would be thrown if path and paths are both set. does this sound reasonable? if not i will keep looking for other solutions was (Author: koert): given the requirement of source/binary compatibility i do not think it can be done without some kind of string munging. however the string munging could be restricted to a separate variable, set with paths(path: *) so path does not get polluted. this variable would be exclusively for HadoopFsRelationProvider, and an error thrown in ResolvedDataSource if any other RelationProvider is used. also an error would be thrown if path and paths are both set. does this sound reasonable? if not i will keep looking for other solutions Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Commented] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717453#comment-14717453 ] Apache Spark commented on SPARK-10321: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8490 OrcRelation doesn't override sizeInBytes Key: SPARK-10321 URL: https://issues.apache.org/jira/browse/SPARK-10321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Critical This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-10321: -- Assignee: Davies Liu OrcRelation doesn't override sizeInBytes Key: SPARK-10321 URL: https://issues.apache.org/jira/browse/SPARK-10321 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Davies Liu Priority: Critical This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10295: Assignee: (was: Apache Spark) Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached
[ https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10295: Assignee: Apache Spark Dynamic allocation in Mesos does not release when RDDs are cached - Key: SPARK-10295 URL: https://issues.apache.org/jira/browse/SPARK-10295 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.5.0 Environment: Spark 1.5.0 RC1 Centos 6 java 7 oracle Reporter: Hans van den Bogert Assignee: Apache Spark Priority: Minor When running spark in coarse grained mode with shuffle service and dynamic allocation, the driver does not release executors if a dataset is cached. The console output OTOH shows: 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not support cached RDDs. Cached data for RDD 9 will be lost when executors are removed. However after the default of 1m, executors are not released. When I perform the same initial setup, loading data, etc, but without caching, the executors are released. Is this intended behaviour? If this is intended behaviour, the console warning is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717473#comment-14717473 ] Deborah Siegel commented on SPARK-9316: --- Now that %in% is exported in namespace, both the filter and the '[' syntax work with it. Thanks [~shivaram]. [~felixcheung] Not apparent to me at the moment why one would need support for quoted syntax in the brackets with %in% working. btw, although filter works with (age in (19, 30)), the bracket notation with the quotes still getting error both ways subsetdf - df[age in (19, 30),1:2] Error in df[age in (19, 30),1:2] : object of type 'S4' is not subsettable subsetdf - df[age in (19, 30),1:2] Error in df[age in (19, 30), 1:2] : object of type 'S4' is not subsettable Add support for filtering using `[` (synonym for filter / select) - Key: SPARK-9316 URL: https://issues.apache.org/jira/browse/SPARK-9316 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Felix Cheung Fix For: 1.6.0, 1.5.1 Will help us support queries of the form {code} air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10304) Need to add a null check in unwrapperFor in HiveInspectors
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717475#comment-14717475 ] Yin Huai commented on SPARK-10304: -- Will field be null? I will try to get more info. Need to add a null check in unwrapperFor in HiveInspectors -- Key: SPARK-10304 URL: https://issues.apache.org/jira/browse/SPARK-10304 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Zhan Zhang Priority: Critical {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 (TID 3504, 10.0.195.227): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10304: - Description: I have a dir structure like {{/path/table1/partition_column=1/}}. When I try to use {{load(/path/)}}, it works and I get a DF. When I query this DF, if it is stored as ORC, there will be the following NPE. But, if it is Parquet, we even can return rows. We should complain to users about the dir struct because {{table1}} does not meet our format. {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 (TID 3504, 10.0.195.227): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) {code} was: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 (TID 3504, 10.0.195.227): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717450#comment-14717450 ] koert kuipers commented on SPARK-5741: -- given the requirement of source/binary compatibility i do not think it can be done without some kind of string munging. however the string munging could be restricted to a separate variable, set with paths(path: *) so path does not get polluted. this variable would be exclusively for HadoopFsRelationProvider, and an error thrown in ResolvedDataSource if any other RelationProvider is used. also an error would be thrown if path and paths are both set. does this sound reasonable? if not i will keep looking for other solutions Support the path contains comma in HiveContext -- Key: SPARK-5741 URL: https://issues.apache.org/jira/browse/SPARK-5741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.3.0 When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ### set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ### Error log ### 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.init(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:221) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
[jira] [Resolved] (SPARK-9901) User guide for RowMatrix Tall-and-skinny QR
[ https://issues.apache.org/jira/browse/SPARK-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9901. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8462 [https://github.com/apache/spark/pull/8462] User guide for RowMatrix Tall-and-skinny QR --- Key: SPARK-9901 URL: https://issues.apache.org/jira/browse/SPARK-9901 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Assignee: yuhao yang Fix For: 1.5.0 SPARK-7368 adds Tall-and-Skinny QR factorization. {{mllib-data-types#rowmatrix}} should be updated to document this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717531#comment-14717531 ] Shivaram Venkataraman commented on SPARK-9316: -- I don't think supporting the age in (19, 30) with `[` is very important as the %in% is more natural for R users. We should update the documentation to reflect this if its misleading though Add support for filtering using `[` (synonym for filter / select) - Key: SPARK-9316 URL: https://issues.apache.org/jira/browse/SPARK-9316 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Felix Cheung Fix For: 1.6.0, 1.5.1 Will help us support queries of the form {code} air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter
[ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10310: - Priority: Critical (was: Blocker) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter -- Key: SPARK-10310 URL: https://issues.apache.org/jira/browse/SPARK-10310 Project: Spark Issue Type: Bug Components: SQL Reporter: Yi Zhou Priority: Critical There is real case using python stream script in Spark SQL query. We found that all result records were wroten in ONE line as input from select pipeline for python script and so it caused script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation. #Key Query: CREATE VIEW temp1 AS SELECT * FROM ( FROM ( SELECT c.wcs_user_sk, w.wp_type, (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec FROM web_clickstreams c, web_page w WHERE c.wcs_web_page_sk = w.wp_web_page_sk AND c.wcs_web_page_sk IS NOT NULL AND c.wcs_user_sk IS NOT NULL AND c.wcs_sales_skIS NULL --abandoned implies: no sale DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec ) clicksAnWebPageType REDUCE wcs_user_sk, tstamp_inSec, wp_type USING 'python sessionize.py 3600' AS ( wp_type STRING, tstamp BIGINT, sessionid STRING) ) sessionized #Key Python Script# for line in sys.stdin: user_sk, tstamp_str, value = line.strip().split(\t) Result Records example from 'select' ## ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview Result Records example in format## 31 3237764860 feedback 31 3237769106 dynamic 31 3237779027 review -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid
[ https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717575#comment-14717575 ] Yin Huai commented on SPARK-10304: -- [~zhazhan] just took a look, it is not an ORC issue. It is an issue related to partition discovery. Partition discovery does not throw an exception if the dir structure is valid - Key: SPARK-10304 URL: https://issues.apache.org/jira/browse/SPARK-10304 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Zhan Zhang Priority: Critical I have a dir structure like {{/path/table1/partition_column=1/}}. When I try to use {{load(/path/)}}, it works and I get a DF. When I query this DF, if it is stored as ORC, there will be the following NPE. But, if it is Parquet, we even can return rows. We should complain to users about the dir struct because {{table1}} does not meet our format. {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 (TID 3504, 10.0.195.227): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466) at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318) at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map
[ https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717713#comment-14717713 ] Esteban Donato commented on SPARK-10296: Sean, thanks your your response. As per your comments, couple of things. You're right that the parameter is to support mapping key-value pairs when the key doesn't change. My point is that when you are in such situation, and you don't want to lose the partitioner, you are forced to use mapPartitions method instead of map method just to use the preservesPartitioning parameter even when the map method would be enough. On the other hand, regarding the changes in the API, I think that shouldn't be an issue if the preservesPartitioning is added as the last parameter with a default value false to make it backwards compatible. Let me know your thoughts add preservesParitioning parameter to RDD.map - Key: SPARK-10296 URL: https://issues.apache.org/jira/browse/SPARK-10296 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Esteban Donato Priority: Minor It would be nice to add the Boolean parameter preservesParitioning with default false to RDD.map method just as it is in RDD.mapPartitions method. If you agree I can submit a pull request with this enhancement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10323) NPE in code-gened In expression
[ https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10323: - Assignee: Davies Liu NPE in code-gened In expression --- Key: SPARK-10323 URL: https://issues.apache.org/jira/browse/SPARK-10323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Davies Liu Priority: Critical To reproduce the problem, you can run {{null in ('str')}}. Let's also take a look InSet and other similar expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717696#comment-14717696 ] Jerome edited comment on SPARK-8514 at 8/27/15 11:03 PM: - I have a draft of the LU Decomposition in BlockMatrix.scala https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala Only one unit test so far: https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala The method here is slightly different than the previously proposed method in that it preforms large block matrices for large BlockMatrix.multiply operations. I'll be adding documentation shortly to github to describe the method. Cheers, J was (Author: nilmeier): I have a draft of the LU Decomposition in BlockMatrix.scala https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala Only one unit test so far: https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala The method here is slightly different than the previously proposed method in that it preforms large block matrices for large BlockMatrix.multiply operations. I'll be adding documentation shortly to github to describe the method. Cheers, J LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, testScript.scala LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10287: - Labels: releasenotes (was: ) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: releasenotes Fix For: 1.5.1 I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10287: - Target Version/s: (was: 1.5.0) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Fix For: 1.5.1 I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-10287. -- Resolution: Fixed Fix Version/s: 1.5.1 Issue resolved by pull request 8469 [https://github.com/apache/spark/pull/8469] After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical Fix For: 1.5.1 I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table
[ https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717737#comment-14717737 ] Yin Huai commented on SPARK-10287: -- We need to put the following release note JSON data source will not automatically load new files that are created by other applications (i.e. files that are not inserted to the dataset through Spark SQL). [SPARK-10287].. After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table Key: SPARK-10287 URL: https://issues.apache.org/jira/browse/SPARK-10287 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: releasenotes Fix For: 1.5.1 I have a partitioned json table with 1824 partitions. {code} val df = sqlContext.read.format(json).load(aPartitionedJsonData) val columnStr = df.schema.map(_.name).mkString(,) println(scolumns: $columnStr) val hash = df .selectExpr(shash($columnStr) as hashValue) .groupBy() .sum(hashValue) .head() .getLong(0) {code} Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717696#comment-14717696 ] Jerome commented on SPARK-8514: --- I have a draft of the LU Decomposition in BlockMatrix.scala https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala Only one unit test so far: https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala The method here is slightly different than the previously proposed method in that it preforms large block matrices for large BlockMatrix.multiply operations. I'll be adding documentation shortly to github to describe the method. Cheers, J LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, testScript.scala LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10307) Fix regression in block matrix multiply (1.4-1.5 regression)
[ https://issues.apache.org/jira/browse/SPARK-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717704#comment-14717704 ] Joseph K. Bradley commented on SPARK-10307: --- I tested this a number of times to try to reproduce the issue on branch-1.5. Weirdly, I reproduced it once, with running times: {code} results:[{time:79.313},{time:82.344},{time:77.169},{time:63.269},{time:86.671},{time:79.732},{time:76.208},{time:91.78},{time:73.738},{time:56.931},{time:75.267},{time:75.316},{time:63.639},{time:66.429},{time:67.172}] {code} But when I tried re-running on branch-1.5 a few times (on both RC1 and the most recent branch with updates post-RC21), I got times like this: {code} results:[{time:49.95},{time:49.081},{time:50.712},{time:49.272},{time:49.81},{time:47.067},{time:52.498},{time:48.093},{time:48.468},{time:49.142},{time:47.212},{time:47.21},{time:48.007},{time:55.267},{time:48.121}] {code} Note these were all on the same EC2 cluster. So...I'd say there is no obvious regression. If something is wrong, then it's pretty subtle. I'll close this for now. Fix regression in block matrix multiply (1.4-1.5 regression) - Key: SPARK-10307 URL: https://issues.apache.org/jira/browse/SPARK-10307 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Running spark-perf on the block-matrix-mult test (BlockMatrix.multiply), I found the running time increased from 50 sec to 80 sec. This was on the default test settings, on 16 r3.2xlarge workers on EC2, and with 15 trials, dropping the first 2. The only relevant changes I found are: * [https://github.com/apache/spark/commit/520ad44b17f72e6465bf990f64b4e289f8a83447] * [https://github.com/apache/spark/commit/99c40cd0d8465525cac34dfa373b81532ef3d719] I'm testing reverting each of those now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9680) Update programming guide section for ml.feature.StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9680. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8436 [https://github.com/apache/spark/pull/8436] Update programming guide section for ml.feature.StopWordsRemover Key: SPARK-9680 URL: https://issues.apache.org/jira/browse/SPARK-9680 Project: Spark Issue Type: Documentation Components: ML Reporter: yuhao yang Assignee: Feynman Liang Priority: Minor Labels: document Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9906) User guide for LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9906. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8197 [https://github.com/apache/spark/pull/8197] User guide for LogisticRegressionSummary Key: SPARK-9906 URL: https://issues.apache.org/jira/browse/SPARK-9906 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Assignee: Manoj Kumar Fix For: 1.5.0 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model statistics to ML pipeline logistic regression models. This feature is not present in mllib and should be documented within {{ml-linear-methods}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10307) Fix regression in block matrix multiply (1.4-1.5 regression)
[ https://issues.apache.org/jira/browse/SPARK-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10307. --- Resolution: Cannot Reproduce Fix regression in block matrix multiply (1.4-1.5 regression) - Key: SPARK-10307 URL: https://issues.apache.org/jira/browse/SPARK-10307 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Running spark-perf on the block-matrix-mult test (BlockMatrix.multiply), I found the running time increased from 50 sec to 80 sec. This was on the default test settings, on 16 r3.2xlarge workers on EC2, and with 15 trials, dropping the first 2. The only relevant changes I found are: * [https://github.com/apache/spark/commit/520ad44b17f72e6465bf990f64b4e289f8a83447] * [https://github.com/apache/spark/commit/99c40cd0d8465525cac34dfa373b81532ef3d719] I'm testing reverting each of those now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10323) NPE in code-gened In expression
Yin Huai created SPARK-10323: Summary: NPE in code-gened In expression Key: SPARK-10323 URL: https://issues.apache.org/jira/browse/SPARK-10323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical To reproduce the problem, you can run {{null in ('str')}}. Let's also take a look InSet and other similar expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4066) Make whether maven builds fails on scalastyle violation configurable
[ https://issues.apache.org/jira/browse/SPARK-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-4066: -- Description: Here is the thread Koert started: http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit It would be flexible if whether maven build fails due to scalastyle violation configurable. was: Here is the thread Koert started: http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit It would be flexible if whether maven build fails due to scalastyle violation configurable. Make whether maven builds fails on scalastyle violation configurable Key: SPARK-4066 URL: https://issues.apache.org/jira/browse/SPARK-4066 Project: Spark Issue Type: Improvement Components: Build Reporter: Ted Yu Priority: Minor Labels: style Attachments: spark-4066-v1.txt Here is the thread Koert started: http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit It would be flexible if whether maven build fails due to scalastyle violation configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10323) NPE in code-gened In expression
[ https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717762#comment-14717762 ] Yin Huai commented on SPARK-10323: -- Seems {{array_contains}} does not have this NPE issue. NPE in code-gened In expression --- Key: SPARK-10323 URL: https://issues.apache.org/jira/browse/SPARK-10323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Davies Liu Priority: Critical To reproduce the problem, you can run {{null in ('str')}}. Let's also take a look InSet and other similar expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10324) MLlib 1.6 Roadmap
Xiangrui Meng created SPARK-10324: - Summary: MLlib 1.6 Roadmap Key: SPARK-10324 URL: https://issues.apache.org/jira/browse/SPARK-10324 Project: Spark Issue Type: Umbrella Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10326) Cannot launch YARN job on Windows
Marcelo Vanzin created SPARK-10326: -- Summary: Cannot launch YARN job on Windows Key: SPARK-10326 URL: https://issues.apache.org/jira/browse/SPARK-10326 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.0 Reporter: Marcelo Vanzin The fix is already in master, and it's one line out of the patch for SPARK-5754; the bug is that a Windows file path cannot be used to create a URI, to {{File.toURI()}} needs to be called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10329) Cost RDD in k-means initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10329: -- Description: Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. was: Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]`) but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. Cost RDD in k-means initialization is not storage-efficient --- Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10329) Cost RDD in k-means initialization is not storage-efficient
Xiangrui Meng created SPARK-10329: - Summary: Cost RDD in k-means initialization is not storage-efficient Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.1, 1.3.1, 1.5.0 Reporter: Xiangrui Meng Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]`) but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org