[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10316:

Description: We did a lot of special handling for non-deterministic 
expressions in Optimizer. However, PhysicalOperation just collects all Projects 
and Filters and messed it up. We should respect the operators order caused by 
non-deterministic expressions in PhysicalOperation.  (was: We did a lot of 
special handling for non-deterministic expressions in Optimizer. However, 
PhysicalOperation just collects all Projects and Filters and messed it up. We 
should respect the operator order caused by non-deterministic expressions in 
PhysicalOperation.)

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 
 Optimizer. However, PhysicalOperation just collects all Projects and Filters 
 and messed it up. We should respect the operators order caused by 
 non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10316:
---

 Summary: respect nondeterministic expressions in PhysicalOperation
 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10316:


Assignee: Apache Spark

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark

 We did a lot of special handling for non-deterministic expressions in 
 Optimizer. However, PhysicalOperation just collects all Projects and Filters 
 and messed it up. We should respect the operators order caused by 
 non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716786#comment-14716786
 ] 

Apache Spark commented on SPARK-10316:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8486

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 
 Optimizer. However, PhysicalOperation just collects all Projects and Filters 
 and messed it up. We should respect the operators order caused by 
 non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10295:
--

I believe that YARN currently will release executors even if they have cached 
data. I also recall that there's a desire to change this behavior, so that 
executors may stick around with cached data. I am not sure what the current or 
intended Mesos behavior is, but assume it's the same.

Therefore, this message may need to be softened to something like Dynamic 
allocation is enabled; executors may be removed even when they contain cached 
data or something similar. I don't think there are hard guarantees about the 
behavior in any event, and the intent is just to make the user aware that it's 
possible for cached data to go away with dynamic allocation on.

CC [~vanzin] and [~sandyr]

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Question
  Components: Mesos
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10295:
--
Component/s: Mesos

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Question
  Components: Mesos
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10316:

Description: We did a lot of special handling for non-deterministic 
expressions in Optimizer. However, PhysicalOperation just collects all Projects 
and Filters and messed it up. We should respect the operator order caused by 
non-deterministic expressions in PhysicalOperation.  (was: We did a lot of 
special handling for non-deterministic expressions in )

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 
 Optimizer. However, PhysicalOperation just collects all Projects and Filters 
 and messed it up. We should respect the operator order caused by 
 non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6906) Improve Hive integration support

2015-08-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716616#comment-14716616
 ] 

Thomas Graves commented on SPARK-6906:
--

Thanks for the information.  I'm trying to get it to work with our nonstandard 
version of hive (0.13)+ patches backported.  But am having issues with 
authentication. I'm assuming its something with our version of Hive.

 Improve Hive integration support
 

 Key: SPARK-6906
 URL: https://issues.apache.org/jira/browse/SPARK-6906
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.5.0


 Right now Spark SQL is very coupled to a specific version of Hive for two 
 primary reasons.
  - Metadata: we use the Hive Metastore client to retrieve information about 
 tables in a metastore.
  - Execution: UDFs, UDAFs, SerDes, HiveConf and various helper functions for 
 configuration.
 Since Hive is generally not compatible across versions, we are currently 
 maintain fairly expensive shim layers to let us talk to both Hive 12 and Hive 
 13 metastores.  Ideally we would be able to talk to more versions of Hive 
 with less maintenance burden.
 This task is proposing that we separate the hive version that is used for 
 communicating with the metastore from the version that is used for execution. 
  In doing so we can significantly reduce the size of the shim by only 
 providing compatibility for metadata operations.  All execution will be done 
 with single version of Hive (the newest version that is supported by Spark 
 SQL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10002) SSH problem during Setup of Spark(1.3.0) cluster on EC2

2015-08-27 Thread Zero tolerance (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716767#comment-14716767
 ] 

Zero tolerance commented on SPARK-10002:


I met the same problem. Adding the parameter --private-ips seems to work.

 SSH problem during Setup of Spark(1.3.0) cluster on EC2
 ---

 Key: SPARK-10002
 URL: https://issues.apache.org/jira/browse/SPARK-10002
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
 Environment: EC2, SPARK 1.3.0 cluster setup in vpc/subnet.
Reporter: Deepali Bhandari

 Steps to start a Spark cluster with EC2 scripts
 1. I created an ec2 instance in the vpc, and subnet. Amazon Linux 
 2. I dowloaded spark-1.3.0
 3. chmod 400 key file
 4. Export aws access and secret keys
 5. Now ran the command
  ./spark-ec2 --key-pair=deepali-ec2-keypair 
 --identity-file=/home/ec2-user/Spark/deepali-ec2-keypair.pem 
 --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b66 
 --subnet-id=subnet-72fd5905 --resume launch deepali-spark-nodocker
  6. The master and slave instances are created but cannot ssh says host not 
 resolved.
  7. I can ping the master and slave, I can ssh from the command line, but not 
 from the ec2 scripts. 
  8. I have spent more than 2 days now. But no luck yet.
  9. The ec2 scripts dont work .. code has a bug in referencing the cluster 
 nodes via the wrong hostnames 
  
 SCREEN CONSOLE log
  ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home 
   
  
 /ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b 
 --vpc-id=vpc-03d67b6  
   
 6 --subnet-id=subnet-72fd5905 launch deepali-spark-nodocker
 Downloading Boto from PyPi
 Finished downloading Boto
 Setting up security groups...
 Creating security group deepali-spark-nodocker-master
 Creating security group deepali-spark-nodocker-slaves
 Searching for existing cluster deepali-spark-nodocker...
 Spark AMI: ami-9a6e0daa
 Launching instances...
 Launched 1 slaves in us-west-2b, regid = r-0d2088fb
 Launched master in us-west-2b, regid = r-312088c7
 Waiting for AWS to propagate instance metadata...
 Waiting for cluster to enter 'ssh-ready' state...
 Warning: SSH connection error. (This could be temporary.)
 Host: None
 SSH return code: 255
 SSH output: ssh: Could not resolve hostname None: Name or service not known



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10316:

Description: We did a lot of special handling for non-deterministic 
expressions in   (was: We did a lot of special handling for )

 respect nondeterministic expressions in PhysicalOperation
 -

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10316:

Summary: respect non-deterministic expressions in PhysicalOperation  (was: 
respect nondeterministic expressions in PhysicalOperation)

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8472) Python API for DCT

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8472:
---

Assignee: Apache Spark

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Apache Spark
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8472) Python API for DCT

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8472:
---

Assignee: (was: Apache Spark)

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8472) Python API for DCT

2015-08-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716677#comment-14716677
 ] 

Apache Spark commented on SPARK-8472:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8485

 Python API for DCT
 --

 Key: SPARK-8472
 URL: https://issues.apache.org/jira/browse/SPARK-8472
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 We need to implement a wrapper for enabling the DCT feature transformer to be 
 used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10316:


Assignee: (was: Apache Spark)

 respect non-deterministic expressions in PhysicalOperation
 --

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for non-deterministic expressions in 
 Optimizer. However, PhysicalOperation just collects all Projects and Filters 
 and messed it up. We should respect the operators order caused by 
 non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10314) [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when parallelism is big than data split size

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10314:
--
Priority: Minor  (was: Major)

 [CORE]RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception 
 when parallelism is big than data split size
 

 Key: SPARK-10314
 URL: https://issues.apache.org/jira/browse/SPARK-10314
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.4.1
 Environment: Spark 1.4.1,Hadoop 2.6.0,Tachyon 0.6.4
Reporter: Xiaoyu Wang
Priority: Minor

 RDD persist to OFF_HEAP tachyon got block rdd_x_x not found exception when 
 parallelism is big than data split size
 {code}
 val rdd = sc.parallelize(List(1, 2),2)
 rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
 rdd.count()
 {code}
 is ok.
 {code}
 val rdd = sc.parallelize(List(1, 2),3)
 rdd.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
 rdd.count()
 {code}
 got exceptoin:
 {noformat}
 15/08/27 17:53:07 INFO SparkContext: Starting job: count at console:24
 15/08/27 17:53:07 INFO DAGScheduler: Got job 0 (count at console:24) with 3 
 output partitions (allowLocal=false)
 15/08/27 17:53:07 INFO DAGScheduler: Final stage: ResultStage 0(count at 
 console:24)
 15/08/27 17:53:07 INFO DAGScheduler: Parents of final stage: List()
 15/08/27 17:53:07 INFO DAGScheduler: Missing parents: List()
 15/08/27 17:53:07 INFO DAGScheduler: Submitting ResultStage 0 
 (ParallelCollectionRDD[0] at parallelize at console:21), which has no 
 missing parents
 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(1096) called with 
 curMem=0, maxMem=741196431
 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 1096.0 B, free 706.9 MB)
 15/08/27 17:53:07 INFO MemoryStore: ensureFreeSpace(788) called with 
 curMem=1096, maxMem=741196431
 15/08/27 17:53:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 788.0 B, free 706.9 MB)
 15/08/27 17:53:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:43776 (size: 788.0 B, free: 706.9 MB)
 15/08/27 17:53:07 INFO SparkContext: Created broadcast 0 from broadcast at 
 DAGScheduler.scala:874
 15/08/27 17:53:07 INFO DAGScheduler: Submitting 3 missing tasks from 
 ResultStage 0 (ParallelCollectionRDD[0] at parallelize at console:21)
 15/08/27 17:53:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
 15/08/27 17:53:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1269 bytes)
 15/08/27 17:53:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1270 bytes)
 15/08/27 17:53:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
 localhost, PROCESS_LOCAL, 1270 bytes)
 15/08/27 17:53:07 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
 15/08/27 17:53:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/08/27 17:53:07 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_2 not found, computing it
 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_1 not found, computing it
 15/08/27 17:53:07 INFO CacheManager: Partition rdd_0_0 not found, computing it
 15/08/27 17:53:07 INFO ExternalBlockStore: ExternalBlockStore started
 15/08/27 17:53:08 WARN : tachyon.home is not set. Using 
 /mnt/tachyon_default_home as the default value.
 15/08/27 17:53:08 INFO : Tachyon client (version 0.6.4) is trying to connect 
 master @ localhost/127.0.0.1:19998
 15/08/27 17:53:08 INFO : User registered at the master 
 localhost/127.0.0.1:19998 got UserId 109
 15/08/27 17:53:08 INFO TachyonBlockManager: Created tachyon directory at 
 /spark/spark-c6ec419f-7c7d-48a6-8448-c2431e761ea5/driver/spark-tachyon-20150827175308-6aa5
 15/08/27 17:53:08 INFO : Trying to get local worker host : localhost
 15/08/27 17:53:08 INFO : Connecting local worker @ localhost/127.0.0.1:29998
 15/08/27 17:53:08 INFO : Folder /mnt/ramdisk/tachyonworker/users/109 was 
 created!
 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4386235351040 
 was created!
 15/08/27 17:53:08 INFO : /mnt/ramdisk/tachyonworker/users/109/4388382834688 
 was created!
 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_0 on ExternalBlockStore 
 on localhost:43776 (size: 0.0 B)
 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_1 on ExternalBlockStore 
 on localhost:43776 (size: 2.0 B)
 15/08/27 17:53:08 INFO BlockManagerInfo: Added rdd_0_2 on ExternalBlockStore 
 on localhost:43776 (size: 2.0 B)
 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_1 locally
 15/08/27 17:53:08 INFO BlockManager: Found block rdd_0_2 locally
 15/08/27 17:53:08 INFO Executor: 

[jira] [Updated] (SPARK-10315) remove document on spark.akka.failure-detector.threshold

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10315:
--
Priority: Minor  (was: Major)

 remove document on spark.akka.failure-detector.threshold
 

 Key: SPARK-10315
 URL: https://issues.apache.org/jira/browse/SPARK-10315
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Nan Zhu
Priority: Minor

 this parameter is not used any longer and there is some mistake in the 
 current document , should be 'akka.remote.watch-failure-detector.threshold'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10316) respect nondeterministic expressions in PhysicalOperation

2015-08-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10316:

Description: We did a lot of special handling for 

 respect nondeterministic expressions in PhysicalOperation
 -

 Key: SPARK-10316
 URL: https://issues.apache.org/jira/browse/SPARK-10316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 We did a lot of special handling for 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2015-08-27 Thread Velu nambi (JIRA)
Velu nambi created SPARK-10319:
--

 Summary: ALS training using PySpark throws a StackOverflowError
 Key: SPARK-10319
 URL: https://issues.apache.org/jira/browse/SPARK-10319
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
 Environment: Windows 10, spark - 1.4.1,
Reporter: Velu nambi


When attempting to train a machine learning model using ALS in Spark's MLLib 
(1.4) on windows, Pyspark always terminates with a StackoverflowError. I tried 
adding the checkpoint as described in http://stackoverflow.com/a/31484461/36130 
-- doesn't seem to help.

Here's the training code and stack trace:

{code:none}
ranks = [8, 12]
lambdas = [0.1, 10.0]
numIters = [10, 20]
bestModel = None
bestValidationRmse = float(inf)
bestRank = 0
bestLambda = -1.0
bestNumIter = -1

for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
ALS.checkpointInterval = 2
model = ALS.train(training, rank, numIter, lmbda)
validationRmse = computeRmse(model, validation, numValidation)

if (validationRmse  bestValidationRmse):
 bestModel = model
 bestValidationRmse = validationRmse
 bestRank = rank
 bestLambda = lmbda
 bestNumIter = numIter

testRmse = computeRmse(bestModel, test, numTest)
{code}

Stacktrace:

15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 127)
java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
at java.io.ObjectInputStream.readHandle(Unknown Source)
at java.io.ObjectInputStream.readClassDesc(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2015-08-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717139#comment-14717139
 ] 

Sean Owen commented on SPARK-10318:
---

I personally don't know, but if this is a question about JDBC + Cassandra it 
should go to the Cassandra mailing list first. If it's about the DataStax 
driver ask DataStax. If you suspect it really might have to do with Spark, I'd 
ask u...@spark.apache.org. A JIRA isn't the right step as this point since it's 
not clear there is even a problem in Spark here.

 Getting issue in spark connectivity with cassandra
 --

 Key: SPARK-10318
 URL: https://issues.apache.org/jira/browse/SPARK-10318
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.4.0
 Environment: Spark on local mode with centos 6.x
Reporter: Poorvi Lashkary
Priority: Minor

 Use case: I have to craete spark sql dataframe with the table on cassandra 
 with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717220#comment-14717220
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

There is ingest-time analytics (independent, application of transforms over 
data published to individual topics) and query-time analytics (user queries 
which requires joins across RDDs holding the transformed data). However, even 
ingest-time analytics will potentially require joins across data published to 
different topics. For these reasons, this needs to be a single Spark streaming 
application.

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717235#comment-14717235
 ] 

koert kuipers commented on SPARK-5741:
--

i am reading avro and csv mostly. but we try to support multiple inputs across 
a wide range of formats (currently avro, csv, json, and parquet).

i realize parquet supports it, but it does so by explicitly working around the 
general infrastructure.

i am sympathetic to the idea of no longer doing string munging, but that poses 
some challenges since the main vehicle to carry this information is a 
Map[String, String] (DataFrameReader.extraOptions).

if we could come up with a general way to do this that does not involve string 
munging, i am happy to work on it. the ideal api in my view would be something 
like:
sqlContext.read.format(...).paths(a, b)

alternatively this could be expressed as a union operation of many dataframes, 
but i do not have the knowledge of the relevant code to understand if that is 
feasible, scalable and will support predicate pushdown and such. but if that 
works then i have no need for multiple inputs in DataFrameReader...

from what i know from other projects such as scalding, i think its is a very 
common request to be able to support multiple paths, and you would exclude a 
significant userbase by not supporting it. but thats just a guess...

 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  

[jira] [Commented] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing

2015-08-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716819#comment-14716819
 ] 

Steve Loughran commented on SPARK-10317:


There's various possible fixes here

# {{start-history-server}} script to {{shift;}} out $1 arg then pass the 
remainder down.
# {{start-history-server}} script to prefix $1 arg with {{-d}} while passing 
down the whole line.
# {{HistoryServerArguments}} to convert $1 arg to a directory unless its a 
recognised - option






 start-history-server.sh CLI parsing incompatible with HistoryServer's arg 
 parsing
 -

 Key: SPARK-10317
 URL: https://issues.apache.org/jira/browse/SPARK-10317
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Steve Loughran

 The history server has its argument parsing class in 
 {{HistoryServerArguments}}. However, this doesn't get involved in the 
 {{start-history-server.sh}} codepath where the $0 arg is assigned to  
 {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g 
 {{--property-file}}.
 This stops the other options being usable from this script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2015-08-27 Thread Poorvi Lashkary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717049#comment-14717049
 ] 

Poorvi Lashkary commented on SPARK-10318:
-

I have done the following:
private static final String C_DRIVER = 
org.apache.cassandra.cql.jdbc.CassandraDriver;
private static final String Cassandra_USERNAME = abc;
private static final String C_PWD = abc123;
private static final String C_CONNECTION_URL = 
jdbc:cassandra://localhost:9160/MyKeyspace?user= + Cassandra_USERNAME + 
password= + C_PWD;
MapString, String options = new HashMapString, String();
options.put(driver, C_DRIVER);
options.put(url, C_CONNECTION_URL);
options.put(dbtable, test);
DataFrame jdbcDF = sc.load(jdbc, options);
jdbcDF .registerTempTable(datafrm);
DataFrame d = sc.sql(select * from datafrm);
d.count();

then getting following error:
InvalidRequestException(why:line 1:25 no viable alternative at input '1' 
(SELECT * FROM test WHERE [1]...))

I am not getting why where clause is here. should we must fetch with where 
clause?

 Getting issue in spark connectivity with cassandra
 --

 Key: SPARK-10318
 URL: https://issues.apache.org/jira/browse/SPARK-10318
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.4.0
 Environment: Spark on local mode with centos 6.x
Reporter: Poorvi Lashkary
Priority: Minor

 Use case: I have to craete spark sql dataframe with the table on cassandra 
 with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2015-08-27 Thread Poorvi Lashkary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717093#comment-14717093
 ] 

Poorvi Lashkary commented on SPARK-10318:
-

can you provide the way to establish jdbc connection with cassandra without 
using datastax? like with simple jdbc connection.

 Getting issue in spark connectivity with cassandra
 --

 Key: SPARK-10318
 URL: https://issues.apache.org/jira/browse/SPARK-10318
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.4.0
 Environment: Spark on local mode with centos 6.x
Reporter: Poorvi Lashkary
Priority: Minor

 Use case: I have to craete spark sql dataframe with the table on cassandra 
 with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2015-08-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717134#comment-14717134
 ] 

Sean Owen commented on SPARK-10319:
---

Definitely sounds like https://issues.apache.org/jira/browse/SPARK-5955 so 
either somehow the checkpoint interval isn't taking effect, or this is actually 
slightly different. If you scroll way way back, what's at the top of the stack? 
or is it truncated? Does it work with some number of iterations but not others? 
do you see evidence of checkpointing in the logs?

 ALS training using PySpark throws a StackOverflowError
 --

 Key: SPARK-10319
 URL: https://issues.apache.org/jira/browse/SPARK-10319
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
 Environment: Windows 10, spark - 1.4.1,
Reporter: Velu nambi

 When attempting to train a machine learning model using ALS in Spark's MLLib 
 (1.4) on windows, Pyspark always terminates with a StackoverflowError. I 
 tried adding the checkpoint as described in 
 http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help.
 Here's the training code and stack trace:
 {code:none}
 ranks = [8, 12]
 lambdas = [0.1, 10.0]
 numIters = [10, 20]
 bestModel = None
 bestValidationRmse = float(inf)
 bestRank = 0
 bestLambda = -1.0
 bestNumIter = -1
 for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
 ALS.checkpointInterval = 2
 model = ALS.train(training, rank, numIter, lmbda)
 validationRmse = computeRmse(model, validation, numValidation)
 if (validationRmse  bestValidationRmse):
  bestModel = model
  bestValidationRmse = validationRmse
  bestRank = rank
  bestLambda = lmbda
  bestNumIter = numIter
 testRmse = computeRmse(bestModel, test, numTest)
 {code}
 Stacktrace:
 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 
 127)
 java.lang.StackOverflowError
 at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
 at java.io.ObjectInputStream.readHandle(Unknown Source)
 at java.io.ObjectInputStream.readClassDesc(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.readObject(Unknown Source)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sudarshan Kadambi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudarshan Kadambi updated SPARK-10320:
--
Description: 
Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
to current ones once the streaming context has been started. Restarting the 
streaming context increases the latency of update handling.

Consider a streaming application subscribed to n topics. Let's say 1 of the 
topics is no longer needed in streaming analytics and hence should be dropped. 
We could do this by stopping the streaming context, removing that topic from 
the topic list and restarting the streaming context. Since with some DStreams 
such as DirectKafkaStream, the per-partition offsets are maintained by Spark, 
we should be able to resume uninterrupted (I think?) from where we left off 
with a minor delay. However, in instances where expensive state initialization 
(from an external datastore) may be needed for datasets published to all 
topics, before streaming updates can be applied to it, it is more convenient to 
only subscribe or unsubcribe to the incremental changes to the topic list. 
Without such a feature, updates go unprocessed for longer than they need to be, 
thus affecting QoS.

  was:
Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
to current ones once the streaming context has been started. Restarting the 
streaming context increases the latency of update handling.

Consider a streaming application subscribed to n topics. Let's say 1 of the 
topics is no longer needed in streaming analytics and hence should be dropped. 
We could do this by stopping the streaming context, removing that topic from 
the topic list and restarting the streaming context. Since with some DStreams 
such as DirectKafkaStream, the per-partition offsets are maintained by Spark, 
we should be able to resume uninterrupted (I think?) from where we left off 
with a minor delay. However, in instances where expensive state initialization 
(from an external datastore) may be needed for datasets published to all 
topics, before streaming updates can be applied to it, it is more convenient to 
only subscribe or unsubcribe to the incremental changes to the topic list. 
Without such a feature, updates go unprocessed for longer than they need to be 
affecting QoS.


 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10182) GeneralizedLinearModel doesn't unpersist cached data

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10182.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8395
[https://github.com/apache/spark/pull/8395]

 GeneralizedLinearModel doesn't unpersist cached data
 

 Key: SPARK-10182
 URL: https://issues.apache.org/jira/browse/SPARK-10182
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Vyacheslav Baranov
Assignee: Vyacheslav Baranov
Priority: Minor
 Fix For: 1.6.0


 The problem might be reproduced in spark-shell with following code snippet:
 {code}
 import org.apache.spark.SparkContext
 import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint
 val samples = Seq[LabeledPoint](
   LabeledPoint(1.0, Vectors.dense(1.0, 0.0)),
   LabeledPoint(1.0, Vectors.dense(0.0, 1.0)),
   LabeledPoint(0.0, Vectors.dense(1.0, 1.0)),
   LabeledPoint(0.0, Vectors.dense(0.0, 0.0))
 )
 val rdd = sc.parallelize(samples)
 for (i - 0 until 10) {
   val model = {
 new LogisticRegressionWithLBFGS()
   .setNumClasses(2)
   .run(rdd)
   .clearThreshold()
   }
 }
 {code}
 After code execution there are 10 {{MapPartitionsRDD}} objects on Storage 
 tab in Spark application UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717184#comment-14717184
 ] 

Sean Owen commented on SPARK-10320:
---

It sounds like you listen to topics and processing them fairly independently 
(as you should). Why not run multiple streaming apps? sure you incur some 
overhead, but gain isolation and simplicity.

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10321:
--

 Summary: OrcRelation doesn't override sizeInBytes
 Key: SPARK-10321
 URL: https://issues.apache.org/jira/browse/SPARK-10321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Critical


This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-08-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717209#comment-14717209
 ] 

Seth Hendrickson commented on SPARK-4240:
-

[~josephkb] I think there needs to be some discussion of how and where this 
fits into the current boosting package architecture. Right now, the ML GBT 
algorithm just calls the the MLlib implementation of GBTs. While the random 
forest algorithm has already been moved into the ML package, the GBT algorithm 
has not and I assume this is because we are waiting on the 
implementation/result of 
[SPARK-7129|https://issues.apache.org/jira/browse/SPARK-7129], which calls for 
a generic boosting algorithm.

While this JIRA is specific to gradient boosted trees, it is still affected by 
the overall boosting architecture. I've got some code that implements the 
terminal node refinements in the MLlib implementation, but I suspect that there 
might be some resistance to changing MLlib's implementation. I can continue 
implementing this in MLlib if we decide that is the route we'd like to take. 
Otherwise, I think this work needs to wait until GBTs are moved to the ML 
package.

 Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
 

 Key: SPARK-4240
 URL: https://issues.apache.org/jira/browse/SPARK-4240
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Sung Chung

 The gradient boosting as currently implemented estimates the loss-gradient in 
 each iteration using regression trees. At every iteration, the regression 
 trees are trained/split to minimize predicted gradient variance. 
 Additionally, the terminal node predictions are computed to minimize the 
 prediction variance.
 However, such predictions won't be optimal for loss functions other than the 
 mean-squared error. The TreeBoosting refinement can help mitigate this issue 
 by modifying terminal node prediction values so that those predictions would 
 directly minimize the actual loss function. Although this still doesn't 
 change the fact that the tree splits were done through variance reduction, it 
 should still lead to improvement in gradient estimations, and thus better 
 performance.
 The details of this can be found in the R vignette. This paper also shows how 
 to refine the terminal node predictions.
 http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Sudarshan Kadambi (JIRA)
Sudarshan Kadambi created SPARK-10320:
-

 Summary: Support new topic subscriptions without requiring restart 
of the streaming context
 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi


Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
to current ones once the streaming context has been started. Restarting the 
streaming context increases the latency of update handling.

Consider a streaming application subscribed to n topics. Let's say 1 of the 
topics is no longer needed in streaming analytics and hence should be dropped. 
We could do this by stopping the streaming context, removing that topic from 
the topic list and restarting the streaming context. Since with some DStreams 
such as DirectKafkaStream, the per-partition offsets are maintained by Spark, 
we should be able to resume uninterrupted (I think?) from where we left off 
with a minor delay. However, in instances where expensive state initialization 
(from an external datastore) may be needed for datasets published to all 
topics, before streaming updates can be applied to it, it is more convenient to 
only subscribe or unsubcribe to the incremental changes to the topic list. 
Without such a feature, updates go unprocessed for longer than they need to be 
affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717053#comment-14717053
 ] 

koert kuipers commented on SPARK-5741:
--

i realize i am late to the party but...

by doing this you are losing a very important functionality: passing in 
multiple input paths comma separated. globs only cover a very limited subset of 
what you can do with multiple paths. for example selecting partitions (by  day) 
for the last 30 days cannot be expressed with a glob.

so you are giving up major functionality just to be able to pass in a character 
that people would generally advice should not be part of a filename anyhow? 
doesnt sound like a good idea to me.

 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 

[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-08-27 Thread Indrajit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079
 ] 

Indrajit  commented on SPARK-6817:
--

Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function split provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement split using GroupBy in Spark.
gapply can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5))
print(df)
s-split(df$age, df$city)
lapply(s, mean)

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717165#comment-14717165
 ] 

Michael Armbrust commented on SPARK-5741:
-

What format are you trying to read?  There [are still ways to read more than 
one 
file|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L258],
 they just don't rely on brittle string munging anymore.

 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
  

[jira] [Commented] (SPARK-10318) Getting issue in spark connectivity with cassandra

2015-08-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717078#comment-14717078
 ] 

Sean Owen commented on SPARK-10318:
---

This is a Cassandra exception. I don't see that it's traceable to Spark?

 Getting issue in spark connectivity with cassandra
 --

 Key: SPARK-10318
 URL: https://issues.apache.org/jira/browse/SPARK-10318
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.4.0
 Environment: Spark on local mode with centos 6.x
Reporter: Poorvi Lashkary
Priority: Minor

 Use case: I have to craete spark sql dataframe with the table on cassandra 
 with jdbc driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8292) ShortestPaths run with error result

2015-08-27 Thread Anita Tailor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717115#comment-14717115
 ] 

Anita Tailor commented on SPARK-8292:
-

No an issue, It's directed graph and there is no incoming edge for node 0. 

If we add one incoming edge 5\t0 to test data mentioned above, it will give 
following results, which looks correct

(4,Map(0 - 2))
(0,Map(0 - 0))
(6,Map())
(2,Map())
(3,Map())
(5,Map(0 - 1)) 

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
Priority: Minor
  Labels: patch
 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10253) Remove Guava dependencies in MLlib java tests

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10253.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

 Remove Guava dependencies in MLlib java tests
 -

 Key: SPARK-10253
 URL: https://issues.apache.org/jira/browse/SPARK-10253
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.6.0


 Many tests depend on Google Guava's {{Lists.newArrayList}} when 
 {{java.util.Arrays.asList}} could be used instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10257) Remove Guava dependencies in spark.mllib JavaTests

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10257.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8451
[https://github.com/apache/spark/pull/8451]

 Remove Guava dependencies in spark.mllib JavaTests
 --

 Key: SPARK-10257
 URL: https://issues.apache.org/jira/browse/SPARK-10257
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9890) User guide for CountVectorizer

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9890:
---

Assignee: Apache Spark

 User guide for CountVectorizer
 --

 Key: SPARK-9890
 URL: https://issues.apache.org/jira/browse/SPARK-9890
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Apache Spark

 SPARK-8703 added a count vectorizer as a ML transformer. We should add an 
 accompanying user guide to {{ml-features}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9890) User guide for CountVectorizer

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9890:
---

Assignee: (was: Apache Spark)

 User guide for CountVectorizer
 --

 Key: SPARK-9890
 URL: https://issues.apache.org/jira/browse/SPARK-9890
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-8703 added a count vectorizer as a ML transformer. We should add an 
 accompanying user guide to {{ml-features}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN

2015-08-27 Thread LINTE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716806#comment-14716806
 ] 

LINTE commented on SPARK-6918:
--

Is this issue really fixed ?
I work with secure hadoop 2.7.1 / hbase 1.0.1 / spark 1.4.0 / zookeeper 3.4.5

When I run this simple code in yarn-client mode :

--
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, ns:table)
conf.addResource(new Path(/path/to/hbase/hbase-site.xml));
val rdd = sc.newAPIHadoopRDD(conf,classOf[TableInputFormat], 
classOf[ImmutableBytesWritable], classOf[Result])
rdd.count()
-

I have th following error on my executor :

15/08/27 16:56:37 WARN ipc.AbstractRpcClient: Exception encountered while 
connecting to the server : javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Failed to find any Kerberos tgt)]
15/08/27 16:56:37 ERROR ipc.AbstractRpcClient: SASL authentication failed. The 
most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:604)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:153)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:730)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:727)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:880)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:849)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1173)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:31889)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:202)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:181)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)
at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 24 more


 Secure HBase with Kerberos does not work over YARN
 --

 Key: SPARK-6918
 URL: 

[jira] [Commented] (SPARK-9890) User guide for CountVectorizer

2015-08-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716836#comment-14716836
 ] 

Apache Spark commented on SPARK-9890:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/8487

 User guide for CountVectorizer
 --

 Key: SPARK-9890
 URL: https://issues.apache.org/jira/browse/SPARK-9890
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang

 SPARK-8703 added a count vectorizer as a ML transformer. We should add an 
 accompanying user guide to {{ml-features}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716992#comment-14716992
 ] 

Marcelo Vanzin commented on SPARK-10295:


In 1.5 executors with cached data are not released by default (and there's a 
separate timeout configuration for them). I think we just forgot to delete the 
log message.

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Question
  Components: Mesos
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing

2015-08-27 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-10317:
--

 Summary: start-history-server.sh CLI parsing incompatible with 
HistoryServer's arg parsing
 Key: SPARK-10317
 URL: https://issues.apache.org/jira/browse/SPARK-10317
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Steve Loughran


The history server has its argument parsing class in 
{{HistoryServerArguments}}. However, this doesn't get involved in the 
{{start-history-server.sh}} codepath where the $0 arg is assigned to  
{{spark.history.fs.logDirectory}} and all other arguments discarded (e.g 
{{--property-file}}.

This stops the other options being usable from this script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10318) Getting issue in spark connectivity with cassandra

2015-08-27 Thread Poorvi Lashkary (JIRA)
Poorvi Lashkary created SPARK-10318:
---

 Summary: Getting issue in spark connectivity with cassandra
 Key: SPARK-10318
 URL: https://issues.apache.org/jira/browse/SPARK-10318
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.4.0
 Environment: Spark on local mode with centos 6.x
Reporter: Poorvi Lashkary
Priority: Minor


Use case: I have to craete spark sql dataframe with the table on cassandra with 
jdbc driver.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717258#comment-14717258
 ] 

Michael Armbrust commented on SPARK-5741:
-

It was originally just parquet that would support more than one file, but now 
all HadoopFSRelations should. (which covers all but CSV, and we should upgrade 
that library too)  I would be in favor of generalizing this support for at 
least these sources given the following constraints:

 - We must keep source/binary compatibility.
 - We should give good errors when the source does not support this feature.
 - For consistency, I'd prefer if we can just add a {{load(path: String*)}} 
(but I'm not sure if this is possible given the above).
 - {{paths(path: *)}} is okay, but I think I'd prefer if it was not the 
terminal operator.

 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 

[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2015-08-27 Thread Velu nambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717270#comment-14717270
 ] 

Velu nambi commented on SPARK-10319:


bq. do you see evidence of checkpointing in the logs? 

Yes, I see a few files created in the Checkpoint directory.

 ALS training using PySpark throws a StackOverflowError
 --

 Key: SPARK-10319
 URL: https://issues.apache.org/jira/browse/SPARK-10319
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
 Environment: Windows 10, spark - 1.4.1,
Reporter: Velu nambi

 When attempting to train a machine learning model using ALS in Spark's MLLib 
 (1.4) on windows, Pyspark always terminates with a StackoverflowError. I 
 tried adding the checkpoint as described in 
 http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help.
 Here's the training code and stack trace:
 {code:none}
 ranks = [8, 12]
 lambdas = [0.1, 10.0]
 numIters = [10, 20]
 bestModel = None
 bestValidationRmse = float(inf)
 bestRank = 0
 bestLambda = -1.0
 bestNumIter = -1
 for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
 ALS.checkpointInterval = 2
 model = ALS.train(training, rank, numIter, lmbda)
 validationRmse = computeRmse(model, validation, numValidation)
 if (validationRmse  bestValidationRmse):
  bestModel = model
  bestValidationRmse = validationRmse
  bestRank = rank
  bestLambda = lmbda
  bestNumIter = numIter
 testRmse = computeRmse(bestModel, test, numTest)
 {code}
 Stacktrace:
 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 
 127)
 java.lang.StackOverflowError
 at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
 at java.io.ObjectInputStream.readHandle(Unknown Source)
 at java.io.ObjectInputStream.readClassDesc(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)
 at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.readObject(Unknown Source)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
 at java.io.ObjectInputStream.readSerialData(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9148) User-facing documentation for NaN handling semantics

2015-08-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9148.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8441
[https://github.com/apache/spark/pull/8441]

 User-facing documentation for NaN handling semantics
 

 Key: SPARK-9148
 URL: https://issues.apache.org/jira/browse/SPARK-9148
 Project: Spark
  Issue Type: Technical task
  Components: Documentation, SQL
Reporter: Josh Rosen
Priority: Critical
 Fix For: 1.5.0


 Once we've finalized our NaN changes for Spark 1.5, we need to create 
 user-facing documentation to explain our chosen semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717297#comment-14717297
 ] 

Felix Cheung commented on SPARK-9316:
-

https://issues.apache.org/jira/browse/SPARK-10322

 Add support for filtering using `[` (synonym for filter / select)
 -

 Key: SPARK-9316
 URL: https://issues.apache.org/jira/browse/SPARK-9316
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Felix Cheung
 Fix For: 1.6.0, 1.5.1


 Will help us support queries of the form 
 {code}
 air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10252) Update Spark SQL Programming Guide for Spark 1.5

2015-08-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10252.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8441
[https://github.com/apache/spark/pull/8441]

 Update Spark SQL Programming Guide for Spark 1.5
 

 Key: SPARK-10252
 URL: https://issues.apache.org/jira/browse/SPARK-10252
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10322) Column %in% function is not working

2015-08-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-10322:


 Summary: Column %in% function is not working
 Key: SPARK-10322
 URL: https://issues.apache.org/jira/browse/SPARK-10322
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
Reporter: Felix Cheung


$ sparkR

...

 df$age
Column age
 filter(df$age %in% c(19, 30))
Error in filter(df$age %in% c(19, 30)) :
  error in evaluating the argument 'x' in selecting a method for function 
'filter': Error in match(x, table, nomatch = 0L) :
  'match' requires vector arguments
 filter(df$age %in% c(19, 30))
Error in filter(df$age %in% c(19, 30)) :
  error in evaluating the argument 'x' in selecting a method for function 
'filter': Error in match(x, table, nomatch = 0L) :
  'match' requires vector arguments
 filter(df$age %in% 30)
Error in filter(df$age %in% 30) :
  error in evaluating the argument 'x' in selecting a method for function 
'filter': Error in match(x, table, nomatch = 0L) :
  'match' requires vector arguments




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-27 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717342#comment-14717342
 ] 

Cody Koeninger commented on SPARK-10320:


As I said on the list, the best way to deal with this currently is start a new 
app with your new code, before stopping the old app.

In terms of a potential feature addition, I think there are a number of 
questions that would need to be cleared up... e.g.

- when would you change topics?  During a streaming listener onbatch completed 
handler?  From a separate thread?

- when adding a topic, what would the expectations around starting offset be?  
As in the current api, provide explicit offsets per partition, start at 
beginning, or start at end?

- if you add partitions for topics that currently exist, and specify a starting 
offset that's different from where the job is currently, what would the 
expectation be?
- if you add, later remove, then later re-add a topic, what would the 
expectation regarding saved checkpoints be?

 Support new topic subscriptions without requiring restart of the streaming 
 context
 --

 Key: SPARK-10320
 URL: https://issues.apache.org/jira/browse/SPARK-10320
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Sudarshan Kadambi

 Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
 to current ones once the streaming context has been started. Restarting the 
 streaming context increases the latency of update handling.
 Consider a streaming application subscribed to n topics. Let's say 1 of the 
 topics is no longer needed in streaming analytics and hence should be 
 dropped. We could do this by stopping the streaming context, removing that 
 topic from the topic list and restarting the streaming context. Since with 
 some DStreams such as DirectKafkaStream, the per-partition offsets are 
 maintained by Spark, we should be able to resume uninterrupted (I think?) 
 from where we left off with a minor delay. However, in instances where 
 expensive state initialization (from an external datastore) may be needed for 
 datasets published to all topics, before streaming updates can be applied to 
 it, it is more convenient to only subscribe or unsubcribe to the incremental 
 changes to the topic list. Without such a feature, updates go unprocessed for 
 longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-08-27 Thread Marcel Mitsuto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717374#comment-14717374
 ] 

Marcel Mitsuto commented on SPARK-4105:
---

mapPartitions at Exchange.scala:60 +details 
 org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49)
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48)
org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49)
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48)
org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:60)
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:49)
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:48)
org.apache.spark.sql.execution.joins.HashOuterJoin.execute(HashOuterJoin.scala:188)
org.apache.spark.sql.execution.Project.execute(basicOperators.scala:40)
org.apache.spark.sql.parquet.InsertIntoParquetTable.execute(ParquetTableOperations.scala:265)
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
 2015/08/27 19:32:12  2 s 
4/2000 (29 failed)

   61.8 MB 6.6 MB org.apache.spark.shuffle.FetchFailedException: 
FAILED_TO_UNCOMPRESS(5)+details 

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: JavaObjectToSerialize.java, 
 SparkFailedToUncompressGenerator.scala


 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 

[jira] [Comment Edited] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717297#comment-14717297
 ] 

Felix Cheung edited comment on SPARK-9316 at 8/27/15 6:58 PM:
--

(updated)
Hi,

There shouldn't be any change to filter()

the following filter worked previously for me: 
subsetdf - filter(df, age in (19, 30))

I tried this just now and it worked (with Shivaram's fix)


was (Author: felixcheung):
https://issues.apache.org/jira/browse/SPARK-10322

 Add support for filtering using `[` (synonym for filter / select)
 -

 Key: SPARK-9316
 URL: https://issues.apache.org/jira/browse/SPARK-9316
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Felix Cheung
 Fix For: 1.6.0, 1.5.1


 Will help us support queries of the form 
 {code}
 air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10315) remove document on spark.akka.failure-detector.threshold

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10315.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8483
[https://github.com/apache/spark/pull/8483]

 remove document on spark.akka.failure-detector.threshold
 

 Key: SPARK-10315
 URL: https://issues.apache.org/jira/browse/SPARK-10315
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Nan Zhu
Priority: Minor
 Fix For: 1.6.0, 1.5.1


 this parameter is not used any longer and there is some mistake in the 
 current document , should be 'akka.remote.watch-failure-detector.threshold'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10322) Column %in% function is not exported

2015-08-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-10322:
-
Summary: Column %in% function is not exported  (was: Column %in% function 
is not working)

 Column %in% function is not exported
 

 Key: SPARK-10322
 URL: https://issues.apache.org/jira/browse/SPARK-10322
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
Reporter: Felix Cheung

 $ sparkR
 ...
  df$age
 Column age
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% 30)
 Error in filter(df$age %in% 30) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717328#comment-14717328
 ] 

Felix Cheung commented on SPARK-9316:
-

As for this,

 subsetdf - df[age in (19, 30),1:2]
Error in df[age in (19, 30),1:2] : 
object of type 'S4' is not subsettable

I believe this should be `df[age in (19, 30),1:2]` instead?
I had a iteration of the change to port this but the code turns out to be 
convoluted, as the character vector can also have something like this age 
that matches a column.

Please let us know if this is something we should support.

 

 Add support for filtering using `[` (synonym for filter / select)
 -

 Key: SPARK-9316
 URL: https://issues.apache.org/jira/browse/SPARK-9316
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Felix Cheung
 Fix For: 1.6.0, 1.5.1


 Will help us support queries of the form 
 {code}
 air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2015-08-27 Thread Velu nambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717263#comment-14717263
 ] 

Velu nambi edited comment on SPARK-10319 at 8/27/15 6:42 PM:
-

Yes it does seem similar to SPARK-5955, it works when I reduce the iterations 
to [5,10] (currently set to [10,20]).

Here is the small stack trace from top of the stack, let me know 

5/08/27 10:35:07 INFO DAGScheduler: Job 12 failed: count at ALS.scala:243, took 
3.083999 s
Traceback (most recent call last):
  File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 
4.5.3\helpers\pydev\pydevd.py, line 2358, in module
globals = debugger.run(setup['file'], None, None, is_module)
  File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 
4.5.3\helpers\pydev\pydevd.py, line 1778, in run
pydev_imports.execfile(file, globals, locals)  # execute the script
  File C:/Users/PycharmProjects/MovieLensALS/MovieLensALS.py, line 129, in 
module
model = ALS.train(training, rank, numIter, lmbda)
  File C:\spark-1.4.1\python\pyspark\mllib\recommendation.py, line 194, in 
train
lambda_, blocks, nonnegative, seed)
  File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 128, in 
callMLlibFunc
return callJavaFunc(sc, api, *args)
  File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 121, in 
callJavaFunc
return _java2py(sc, func(*args))
  File 
C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\java_gateway.py,
 line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\protocol.py,
 line 308, in get_return_value
format(target_id, ., name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o145.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 
(TID 124, localhost): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown 
Source)
at java.io.ObjectInputStream.readHandle(Unknown Source)
at java.io.ObjectInputStream.readClassDesc(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at 

[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2015-08-27 Thread Velu nambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717263#comment-14717263
 ] 

Velu nambi commented on SPARK-10319:


Yes it does seem similar to SPARK-5955, it works when I set the reduce the 
iterations to [5,10] (currently set to [10,20]).

Here is the small stack trace from top of the stack, let me know 

5/08/27 10:35:07 INFO DAGScheduler: Job 12 failed: count at ALS.scala:243, took 
3.083999 s
Traceback (most recent call last):
  File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 
4.5.3\helpers\pydev\pydevd.py, line 2358, in module
globals = debugger.run(setup['file'], None, None, is_module)
  File C:\Program Files (x86)\JetBrains\PyCharm Community Edition 
4.5.3\helpers\pydev\pydevd.py, line 1778, in run
pydev_imports.execfile(file, globals, locals)  # execute the script
  File C:/Users/PycharmProjects/MovieLensALS/MovieLensALS.py, line 129, in 
module
model = ALS.train(training, rank, numIter, lmbda)
  File C:\spark-1.4.1\python\pyspark\mllib\recommendation.py, line 194, in 
train
lambda_, blocks, nonnegative, seed)
  File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 128, in 
callMLlibFunc
return callJavaFunc(sc, api, *args)
  File C:\spark-1.4.1\python\pyspark\mllib\common.py, line 121, in 
callJavaFunc
return _java2py(sc, func(*args))
  File 
C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\java_gateway.py,
 line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
C:\Users\PyCharmVirtualEnv\MovieLensALSVirtEnv\lib\site-packages\py4j\protocol.py,
 line 308, in get_return_value
format(target_id, ., name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o145.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 56.0 failed 1 times, most recent failure: Lost task 0.0 in stage 56.0 
(TID 124, localhost): java.lang.StackOverflowError
at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown 
Source)
at java.io.ObjectInputStream.readHandle(Unknown Source)
at java.io.ObjectInputStream.readClassDesc(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at 

[jira] [Assigned] (SPARK-9148) User-facing documentation for NaN handling semantics

2015-08-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-9148:
---

Assignee: Michael Armbrust

 User-facing documentation for NaN handling semantics
 

 Key: SPARK-9148
 URL: https://issues.apache.org/jira/browse/SPARK-9148
 Project: Spark
  Issue Type: Technical task
  Components: Documentation, SQL
Reporter: Josh Rosen
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.5.0


 Once we've finalized our NaN changes for Spark 1.5, we need to create 
 user-facing documentation to explain our chosen semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10322) Column %in% function is not exported

2015-08-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717305#comment-14717305
 ] 

Felix Cheung commented on SPARK-10322:
--

https://github.com/apache/spark/commit/ad7f0f160be096c0fdae6e6cf7e3b6ba4a606de7
SPARK-10308

 Column %in% function is not exported
 

 Key: SPARK-10322
 URL: https://issues.apache.org/jira/browse/SPARK-10322
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
Reporter: Felix Cheung

 $ sparkR
 ...
  df$age
 Column age
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% 30)
 Error in filter(df$age %in% 30) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10322) Column %in% function is not exported

2015-08-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-10322.
--
Resolution: Duplicate

Looks like this was fixed last night.

 Column %in% function is not exported
 

 Key: SPARK-10322
 URL: https://issues.apache.org/jira/browse/SPARK-10322
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
Reporter: Felix Cheung

 $ sparkR
 ...
  df$age
 Column age
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% c(19, 30))
 Error in filter(df$age %in% c(19, 30)) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments
  filter(df$age %in% 30)
 Error in filter(df$age %in% 30) :
   error in evaluating the argument 'x' in selecting a method for function 
 'filter': Error in match(x, table, nomatch = 0L) :
   'match' requires vector arguments



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717444#comment-14717444
 ] 

Apache Spark commented on SPARK-10295:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8489

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10295:
--
Component/s: (was: Mesos)
 Spark Core
 Documentation
 Issue Type: Improvement  (was: Question)

OK, let's think of this as a simple log message update. PR coming.

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Summary: Partition discovery does not throw an exception if the dir 
structure is valid  (was: Need to add a null check in unwrapperFor in 
HiveInspectors)

 Partition discovery does not throw an exception if the dir structure is valid
 -

 Key: SPARK-10304
 URL: https://issues.apache.org/jira/browse/SPARK-10304
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Zhan Zhang
Priority: Critical

 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
 stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
 (TID 3504, 10.0.195.227): java.lang.NullPointerException
 at 
 org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
   at 
 org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10321:


Assignee: Apache Spark

 OrcRelation doesn't override sizeInBytes
 

 Key: SPARK-10321
 URL: https://issues.apache.org/jira/browse/SPARK-10321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Critical

 This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10321:


Assignee: (was: Apache Spark)

 OrcRelation doesn't override sizeInBytes
 

 Key: SPARK-10321
 URL: https://issues.apache.org/jira/browse/SPARK-10321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Critical

 This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717450#comment-14717450
 ] 

koert kuipers edited comment on SPARK-5741 at 8/27/15 8:22 PM:
---

given the requirement of source/binary compatibility i do not think it can be 
done without some kind of string munging.

however the string munging could be restricted to a separate variable, set with 
paths(path: *) so path does not get polluted. this variable would be 
exclusively for HadoopFsRelationProvider, and an error thrown in 
ResolvedDataSource if any other RelationProvider is used with this variable 
set. also an error would be thrown if path and paths are both set.

does this sound reasonable? if not i will keep looking for other solutions



was (Author: koert):
given the requirement of source/binary compatibility i do not think it can be 
done without some kind of string munging.

however the string munging could be restricted to a separate variable, set with 
paths(path: *) so path does not get polluted. this variable would be 
exclusively for HadoopFsRelationProvider, and an error thrown in 
ResolvedDataSource if any other RelationProvider is used. also an error would 
be thrown if path and paths are both set.

does this sound reasonable? if not i will keep looking for other solutions


 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at 

[jira] [Commented] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717453#comment-14717453
 ] 

Apache Spark commented on SPARK-10321:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8490

 OrcRelation doesn't override sizeInBytes
 

 Key: SPARK-10321
 URL: https://issues.apache.org/jira/browse/SPARK-10321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Critical

 This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-10321:
--

Assignee: Davies Liu

 OrcRelation doesn't override sizeInBytes
 

 Key: SPARK-10321
 URL: https://issues.apache.org/jira/browse/SPARK-10321
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Davies Liu
Priority: Critical

 This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10295:


Assignee: (was: Apache Spark)

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10295:


Assignee: Apache Spark

 Dynamic allocation in Mesos does not release when RDDs are cached
 -

 Key: SPARK-10295
 URL: https://issues.apache.org/jira/browse/SPARK-10295
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 RC1
 Centos 6
 java 7 oracle
Reporter: Hans van den Bogert
Assignee: Apache Spark
Priority: Minor

 When running spark in coarse grained mode with shuffle service and dynamic 
 allocation, the driver does not release executors if a dataset is cached.
 The console output OTOH shows:
  15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
  support cached RDDs. Cached data for RDD 9 will be lost when executors are 
  removed.
 However after the default of 1m, executors are not released. When I perform 
 the same initial setup, loading data, etc, but without caching, the executors 
 are released.
 Is this intended behaviour?
 If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-27 Thread Deborah Siegel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717473#comment-14717473
 ] 

Deborah Siegel commented on SPARK-9316:
---

Now that %in% is exported in namespace, both the filter and the '[' syntax work 
with it. Thanks [~shivaram].

[~felixcheung] Not apparent to me at the moment why one would need support for 
quoted syntax in the brackets with %in% working.  

btw, although filter works with (age in (19, 30)), the bracket notation with 
the quotes still getting error both ways
 subsetdf - df[age in (19, 30),1:2]
Error in df[age in (19, 30),1:2] : 
  object of type 'S4' is not subsettable
 subsetdf - df[age in (19, 30),1:2]
Error in df[age in (19, 30), 1:2] : 
  object of type 'S4' is not subsettable


 Add support for filtering using `[` (synonym for filter / select)
 -

 Key: SPARK-9316
 URL: https://issues.apache.org/jira/browse/SPARK-9316
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Felix Cheung
 Fix For: 1.6.0, 1.5.1


 Will help us support queries of the form 
 {code}
 air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Need to add a null check in unwrapperFor in HiveInspectors

2015-08-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717475#comment-14717475
 ] 

Yin Huai commented on SPARK-10304:
--

Will field be null? I will try to get more info.

 Need to add a null check in unwrapperFor in HiveInspectors
 --

 Key: SPARK-10304
 URL: https://issues.apache.org/jira/browse/SPARK-10304
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Zhan Zhang
Priority: Critical

 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
 stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
 (TID 3504, 10.0.195.227): java.lang.NullPointerException
 at 
 org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
   at 
 org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Description: 
I have a dir structure like {{/path/table1/partition_column=1/}}. When I try to 
use {{load(/path/)}}, it works and I get a DF. When I query this DF, if it is 
stored as ORC, there will be the following NPE. But, if it is Parquet, we even 
can return rows. We should complain to users about the dir struct because 
{{table1}} does not meet our format.

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
(TID 3504, 10.0.195.227): java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
at 
org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
at 
org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
{code}

  was:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
(TID 3504, 10.0.195.227): java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
at 
org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
at 
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
at 
org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
 

[jira] [Commented] (SPARK-5741) Support the path contains comma in HiveContext

2015-08-27 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717450#comment-14717450
 ] 

koert kuipers commented on SPARK-5741:
--

given the requirement of source/binary compatibility i do not think it can be 
done without some kind of string munging.

however the string munging could be restricted to a separate variable, set with 
paths(path: *) so path does not get polluted. this variable would be 
exclusively for HadoopFsRelationProvider, and an error thrown in 
ResolvedDataSource if any other RelationProvider is used. also an error would 
be thrown if path and paths are both set.

does this sound reasonable? if not i will keep looking for other solutions


 Support the path contains comma in HiveContext
 --

 Key: SPARK-5741
 URL: https://issues.apache.org/jira/browse/SPARK-5741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.3.0


 When run ```select * from nzhang_part where hr = 'file,';```, it throws 
 exception ```java.lang.IllegalArgumentException: Can not create a Path from 
 an empty string```. Because the path of hdfs contains comma, and 
 FileInputFormat.setInputPaths will split path by comma.
 ###
 SQL
 ###
 set hive.merge.mapfiles=true; 
 set hive.merge.mapredfiles=true;
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 create table nzhang_part like srcpart;
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select 
 key, value, hr from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select 
 key, value from srcpart where ds='2008-04-08';
 insert overwrite table nzhang_part partition (ds='2010-08-15', hr)  
 select * from (
 select key, value, hr from srcpart where ds='2008-04-08'
 union all
 select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;
 select * from nzhang_part where hr = 'file,';
 ###
 Error log
 ###
 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part 
 where hr = 'file,']
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
 at org.apache.hadoop.fs.Path.init(Path.java:135)
 at 
 org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:223)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:221)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:221)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 

[jira] [Resolved] (SPARK-9901) User guide for RowMatrix Tall-and-skinny QR

2015-08-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9901.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8462
[https://github.com/apache/spark/pull/8462]

 User guide for RowMatrix Tall-and-skinny QR
 ---

 Key: SPARK-9901
 URL: https://issues.apache.org/jira/browse/SPARK-9901
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Assignee: yuhao yang
 Fix For: 1.5.0


 SPARK-7368 adds Tall-and-Skinny QR factorization. 
 {{mllib-data-types#rowmatrix}} should be updated to document this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717531#comment-14717531
 ] 

Shivaram Venkataraman commented on SPARK-9316:
--

I don't think supporting the age in (19, 30) with `[` is very important as 
the %in% is more natural for R users. We should update the documentation to 
reflect this if its misleading though

 Add support for filtering using `[` (synonym for filter / select)
 -

 Key: SPARK-9316
 URL: https://issues.apache.org/jira/browse/SPARK-9316
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Felix Cheung
 Fix For: 1.6.0, 1.5.1


 Will help us support queries of the form 
 {code}
 air[air$UniqueCarrier %in% c(UA, HA), c(1,2,3,5:9)]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimeter

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10310:
-
Priority: Critical  (was: Blocker)

 [Spark SQL] All result records will be popluated into ONE line during the 
 script transform due to missing the correct line/filed delimeter
 --

 Key: SPARK-10310
 URL: https://issues.apache.org/jira/browse/SPARK-10310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yi Zhou
Priority: Critical

 There is real case using python stream script in Spark SQL query. We found 
 that all result records were wroten in ONE line as input from select 
 pipeline for python script and so it caused script will not identify each 
 record.Other, filed separator in spark sql will be '^A' or '\001' which is 
 inconsistent/incompatible the '\t' in Hive implementation.
 #Key  Query:
 CREATE VIEW temp1 AS
 SELECT *
 FROM
 (
   FROM
   (
 SELECT
   c.wcs_user_sk,
   w.wp_type,
   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
 FROM web_clickstreams c, web_page w
 WHERE c.wcs_web_page_sk = w.wp_web_page_sk
 AND   c.wcs_web_page_sk IS NOT NULL
 AND   c.wcs_user_sk IS NOT NULL
 AND   c.wcs_sales_skIS NULL --abandoned implies: no sale
 DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
   ) clicksAnWebPageType
   REDUCE
 wcs_user_sk,
 tstamp_inSec,
 wp_type
   USING 'python sessionize.py 3600'
   AS (
 wp_type STRING,
 tstamp BIGINT, 
 sessionid STRING)
 ) sessionized
 #Key Python Script#
 for line in sys.stdin:
  user_sk,  tstamp_str, value  = line.strip().split(\t)
 Result Records example from 'select' ##
 ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
 Result Records example in format##
 31   3237764860   feedback
 31   3237769106   dynamic
 31   3237779027   review



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717575#comment-14717575
 ] 

Yin Huai commented on SPARK-10304:
--

[~zhazhan] just took a look, it is not an ORC issue. It is an issue related to 
partition discovery.

 Partition discovery does not throw an exception if the dir structure is valid
 -

 Key: SPARK-10304
 URL: https://issues.apache.org/jira/browse/SPARK-10304
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Zhan Zhang
Priority: Critical

 I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
 to use {{load(/path/)}}, it works and I get a DF. When I query this DF, if 
 it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
 we even can return rows. We should complain to users about the dir struct 
 because {{table1}} does not meet our format.
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
 stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
 (TID 3504, 10.0.195.227): java.lang.NullPointerException
 at 
 org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
   at 
 org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
   at 
 org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map

2015-08-27 Thread Esteban Donato (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717713#comment-14717713
 ] 

Esteban Donato commented on SPARK-10296:


Sean, thanks your your response. As per your comments, couple of things. You're 
right that the parameter is to support mapping key-value pairs when the key 
doesn't change. My point is that when you are in such situation, and you don't 
want to lose the partitioner, you are forced to use mapPartitions method 
instead of map method just to use the preservesPartitioning parameter even when 
the map method would be enough.

On the other hand, regarding the changes in the API, I think that shouldn't be 
an issue if the preservesPartitioning is added as the last parameter with a 
default value false to make it backwards compatible.

Let me know your thoughts

 add preservesParitioning parameter to RDD.map
 -

 Key: SPARK-10296
 URL: https://issues.apache.org/jira/browse/SPARK-10296
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Esteban Donato
Priority: Minor

 It would be nice to add the Boolean parameter preservesParitioning with 
 default false to RDD.map method just as it is in RDD.mapPartitions method.
 If you agree I can submit a pull request with this enhancement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10323) NPE in code-gened In expression

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10323:
-
Assignee: Davies Liu

 NPE in code-gened In expression
 ---

 Key: SPARK-10323
 URL: https://issues.apache.org/jira/browse/SPARK-10323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Davies Liu
Priority: Critical

 To reproduce the problem, you can run {{null in ('str')}}. Let's also take a 
 look InSet and other similar expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8514) LU factorization on BlockMatrix

2015-08-27 Thread Jerome (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717696#comment-14717696
 ] 

Jerome edited comment on SPARK-8514 at 8/27/15 11:03 PM:
-

I have a draft of the LU Decomposition in BlockMatrix.scala

https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization
https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

Only one unit test so far:
https://github.com/nilmeier/spark/tree/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala

The method here is slightly different than the previously proposed method in 
that it preforms large block matrices for large BlockMatrix.multiply 
operations.  I'll be adding documentation shortly to github to describe the 
method.

Cheers, J


was (Author: nilmeier):
I have a draft of the LU Decomposition in BlockMatrix.scala

https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization
https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

Only one unit test so far:
https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala

The method here is slightly different than the previously proposed method in 
that it preforms large block matrices for large BlockMatrix.multiply 
operations.  I'll be adding documentation shortly to github to describe the 
method.

Cheers, J

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced
 Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
 BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, testScript.scala


 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10287:
-
Labels: releasenotes  (was: )

 After processing a query using JSON data, Spark SQL continuously refreshes 
 metadata of the table
 

 Key: SPARK-10287
 URL: https://issues.apache.org/jira/browse/SPARK-10287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: releasenotes
 Fix For: 1.5.1


 I have a partitioned json table with 1824 partitions.
 {code}
 val df = sqlContext.read.format(json).load(aPartitionedJsonData)
 val columnStr = df.schema.map(_.name).mkString(,)
 println(scolumns: $columnStr)
 val hash = df
   .selectExpr(shash($columnStr) as hashValue)
   .groupBy()
   .sum(hashValue)
   .head()
   .getLong(0)
 {code}
 Looks like for JSON, we refresh metadata when we call buildScan. For a 
 partitioned table, we call buildScan for every partition. So, looks like we 
 will refresh this table 1824 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10287:
-
Target Version/s:   (was: 1.5.0)

 After processing a query using JSON data, Spark SQL continuously refreshes 
 metadata of the table
 

 Key: SPARK-10287
 URL: https://issues.apache.org/jira/browse/SPARK-10287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.5.1


 I have a partitioned json table with 1824 partitions.
 {code}
 val df = sqlContext.read.format(json).load(aPartitionedJsonData)
 val columnStr = df.schema.map(_.name).mkString(,)
 println(scolumns: $columnStr)
 val hash = df
   .selectExpr(shash($columnStr) as hashValue)
   .groupBy()
   .sum(hashValue)
   .head()
   .getLong(0)
 {code}
 Looks like for JSON, we refresh metadata when we call buildScan. For a 
 partitioned table, we call buildScan for every partition. So, looks like we 
 will refresh this table 1824 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

2015-08-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10287.
--
   Resolution: Fixed
Fix Version/s: 1.5.1

Issue resolved by pull request 8469
[https://github.com/apache/spark/pull/8469]

 After processing a query using JSON data, Spark SQL continuously refreshes 
 metadata of the table
 

 Key: SPARK-10287
 URL: https://issues.apache.org/jira/browse/SPARK-10287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical
 Fix For: 1.5.1


 I have a partitioned json table with 1824 partitions.
 {code}
 val df = sqlContext.read.format(json).load(aPartitionedJsonData)
 val columnStr = df.schema.map(_.name).mkString(,)
 println(scolumns: $columnStr)
 val hash = df
   .selectExpr(shash($columnStr) as hashValue)
   .groupBy()
   .sum(hashValue)
   .head()
   .getLong(0)
 {code}
 Looks like for JSON, we refresh metadata when we call buildScan. For a 
 partitioned table, we call buildScan for every partition. So, looks like we 
 will refresh this table 1824 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

2015-08-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717737#comment-14717737
 ] 

Yin Huai commented on SPARK-10287:
--

We need to put the following release note JSON data source will not 
automatically load new files that are created by other applications (i.e. files 
that are not inserted to the dataset through Spark SQL). [SPARK-10287]..

 After processing a query using JSON data, Spark SQL continuously refreshes 
 metadata of the table
 

 Key: SPARK-10287
 URL: https://issues.apache.org/jira/browse/SPARK-10287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: releasenotes
 Fix For: 1.5.1


 I have a partitioned json table with 1824 partitions.
 {code}
 val df = sqlContext.read.format(json).load(aPartitionedJsonData)
 val columnStr = df.schema.map(_.name).mkString(,)
 println(scolumns: $columnStr)
 val hash = df
   .selectExpr(shash($columnStr) as hashValue)
   .groupBy()
   .sum(hashValue)
   .head()
   .getLong(0)
 {code}
 Looks like for JSON, we refresh metadata when we call buildScan. For a 
 partitioned table, we call buildScan for every partition. So, looks like we 
 will refresh this table 1824 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2015-08-27 Thread Jerome (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717696#comment-14717696
 ] 

Jerome commented on SPARK-8514:
---

I have a draft of the LU Decomposition in BlockMatrix.scala

https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization
https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

Only one unit test so far:
https://github.com/nilmeier/spark/blob/SPARK-8514_LU_factorization/mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala

The method here is slightly different than the previously proposed method in 
that it preforms large block matrices for large BlockMatrix.multiply 
operations.  I'll be adding documentation shortly to github to describe the 
method.

Cheers, J

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced
 Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
 BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, testScript.scala


 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10307) Fix regression in block matrix multiply (1.4-1.5 regression)

2015-08-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717704#comment-14717704
 ] 

Joseph K. Bradley commented on SPARK-10307:
---

I tested this a number of times to try to reproduce the issue on branch-1.5.  
Weirdly, I reproduced it once, with running times:
{code}
results:[{time:79.313},{time:82.344},{time:77.169},{time:63.269},{time:86.671},{time:79.732},{time:76.208},{time:91.78},{time:73.738},{time:56.931},{time:75.267},{time:75.316},{time:63.639},{time:66.429},{time:67.172}]
{code}

But when I tried re-running on branch-1.5 a few times (on both RC1 and the most 
recent branch with updates post-RC21), I got times like this:
{code}
results:[{time:49.95},{time:49.081},{time:50.712},{time:49.272},{time:49.81},{time:47.067},{time:52.498},{time:48.093},{time:48.468},{time:49.142},{time:47.212},{time:47.21},{time:48.007},{time:55.267},{time:48.121}]
{code}

Note these were all on the same EC2 cluster.

So...I'd say there is no obvious regression.  If something is wrong, then it's 
pretty subtle.  I'll close this for now.

 Fix regression in block matrix multiply (1.4-1.5 regression)
 -

 Key: SPARK-10307
 URL: https://issues.apache.org/jira/browse/SPARK-10307
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 Running spark-perf on the block-matrix-mult test (BlockMatrix.multiply), I 
 found the running time increased from 50 sec to 80 sec.  This was on the 
 default test settings, on 16 r3.2xlarge workers on EC2, and with 15 trials, 
 dropping the first 2.
 The only relevant changes I found are:
 * 
 [https://github.com/apache/spark/commit/520ad44b17f72e6465bf990f64b4e289f8a83447]
 * 
 [https://github.com/apache/spark/commit/99c40cd0d8465525cac34dfa373b81532ef3d719]
 I'm testing reverting each of those now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9680) Update programming guide section for ml.feature.StopWordsRemover

2015-08-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9680.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8436
[https://github.com/apache/spark/pull/8436]

 Update programming guide section for ml.feature.StopWordsRemover
 

 Key: SPARK-9680
 URL: https://issues.apache.org/jira/browse/SPARK-9680
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: yuhao yang
Assignee: Feynman Liang
Priority: Minor
  Labels: document
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9906.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8197
[https://github.com/apache/spark/pull/8197]

 User guide for LogisticRegressionSummary
 

 Key: SPARK-9906
 URL: https://issues.apache.org/jira/browse/SPARK-9906
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Feynman Liang
Assignee: Manoj Kumar
 Fix For: 1.5.0


 SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
 statistics to ML pipeline logistic regression models. This feature is not 
 present in mllib and should be documented within {{ml-linear-methods}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10307) Fix regression in block matrix multiply (1.4-1.5 regression)

2015-08-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10307.
---
Resolution: Cannot Reproduce

 Fix regression in block matrix multiply (1.4-1.5 regression)
 -

 Key: SPARK-10307
 URL: https://issues.apache.org/jira/browse/SPARK-10307
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 Running spark-perf on the block-matrix-mult test (BlockMatrix.multiply), I 
 found the running time increased from 50 sec to 80 sec.  This was on the 
 default test settings, on 16 r3.2xlarge workers on EC2, and with 15 trials, 
 dropping the first 2.
 The only relevant changes I found are:
 * 
 [https://github.com/apache/spark/commit/520ad44b17f72e6465bf990f64b4e289f8a83447]
 * 
 [https://github.com/apache/spark/commit/99c40cd0d8465525cac34dfa373b81532ef3d719]
 I'm testing reverting each of those now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10323) NPE in code-gened In expression

2015-08-27 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10323:


 Summary: NPE in code-gened In expression
 Key: SPARK-10323
 URL: https://issues.apache.org/jira/browse/SPARK-10323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical


To reproduce the problem, you can run {{null in ('str')}}. Let's also take a 
look InSet and other similar expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4066) Make whether maven builds fails on scalastyle violation configurable

2015-08-27 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-4066:
--
Description: 
Here is the thread Koert started:

http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit


It would be flexible if whether maven build fails due to scalastyle violation 
configurable.

  was:
Here is the thread Koert started:

http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit

It would be flexible if whether maven build fails due to scalastyle violation 
configurable.


 Make whether maven builds fails on scalastyle violation configurable
 

 Key: SPARK-4066
 URL: https://issues.apache.org/jira/browse/SPARK-4066
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Ted Yu
Priority: Minor
  Labels: style
 Attachments: spark-4066-v1.txt


 Here is the thread Koert started:
 http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bitsubj=scalastyle+annoys+me+a+little+bit
 It would be flexible if whether maven build fails due to scalastyle violation 
 configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10323) NPE in code-gened In expression

2015-08-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717762#comment-14717762
 ] 

Yin Huai commented on SPARK-10323:
--

Seems {{array_contains}} does not have this NPE issue.

 NPE in code-gened In expression
 ---

 Key: SPARK-10323
 URL: https://issues.apache.org/jira/browse/SPARK-10323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Davies Liu
Priority: Critical

 To reproduce the problem, you can run {{null in ('str')}}. Let's also take a 
 look InSet and other similar expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10324) MLlib 1.6 Roadmap

2015-08-27 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10324:
-

 Summary: MLlib 1.6 Roadmap
 Key: SPARK-10324
 URL: https://issues.apache.org/jira/browse/SPARK-10324
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10326) Cannot launch YARN job on Windows

2015-08-27 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-10326:
--

 Summary: Cannot launch YARN job on Windows 
 Key: SPARK-10326
 URL: https://issues.apache.org/jira/browse/SPARK-10326
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin


The fix is already in master, and it's one line out of the patch for 
SPARK-5754; the bug is that a Windows file path cannot be used to create a URI, 
to {{File.toURI()}} needs to be called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10329) Cost RDD in k-means initialization is not storage-efficient

2015-08-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10329:
--
Description: 
Currently we use `RDD[Vector]` to store point cost during k-means|| 
initialization, where each `Vector` has size `runs`. This is not 
storage-efficient because `runs` is usually 1 and then each record is a Vector 
of size 1. What we need is just the 8 bytes to store the cost, but we introduce 
two objects (DenseVector and its values array), which could cost 16 bytes. That 
is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting 
this issue!

There are several solutions:

1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
record.
2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
`Array[Double]` object covers 1024 instances, which could remove most of the 
overhead.

Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
kicking out the training dataset from memory.

  was:
Currently we use `RDD[Vector]` to store point cost during k-means|| 
initialization, where each `Vector` has size `runs`. This is not 
storage-efficient because `runs` is usually 1 and then each record is a Vector 
of size 1. What we need is just the 8 bytes to store the cost, but we introduce 
two objects (DenseVector and its values array), which could cost 16 bytes. That 
is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting 
this issue!

There are several solutions:

1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
record.
2. Use `RDD[Array[Double]]`) but batch the values for storage, e.g. each 
`Array[Double]` object covers 1024 instances, which could remove most of the 
overhead.

Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
kicking out the training dataset from memory.


 Cost RDD in k-means initialization is not storage-efficient
 ---

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10329) Cost RDD in k-means initialization is not storage-efficient

2015-08-27 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10329:
-

 Summary: Cost RDD in k-means initialization is not 
storage-efficient
 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Xiangrui Meng


Currently we use `RDD[Vector]` to store point cost during k-means|| 
initialization, where each `Vector` has size `runs`. This is not 
storage-efficient because `runs` is usually 1 and then each record is a Vector 
of size 1. What we need is just the 8 bytes to store the cost, but we introduce 
two objects (DenseVector and its values array), which could cost 16 bytes. That 
is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting 
this issue!

There are several solutions:

1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
record.
2. Use `RDD[Array[Double]]`) but batch the values for storage, e.g. each 
`Array[Double]` object covers 1024 instances, which could remove most of the 
overhead.

Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >