[jira] [Commented] (SPARK-12199) Follow-up: Refine example code in ml-features.md

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046557#comment-15046557
 ] 

Apache Spark commented on SPARK-12199:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10193

> Follow-up: Refine example code in ml-features.md
> 
>
> Key: SPARK-12199
> URL: https://issues.apache.org/jira/browse/SPARK-12199
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: somil deshmukh
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12199) Follow-up: Refine example code in ml-features.md

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12199:


Assignee: somil deshmukh  (was: Apache Spark)

> Follow-up: Refine example code in ml-features.md
> 
>
> Key: SPARK-12199
> URL: https://issues.apache.org/jira/browse/SPARK-12199
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: somil deshmukh
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12199) Follow-up: Refine example code in ml-features.md

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12199:


Assignee: Apache Spark  (was: somil deshmukh)

> Follow-up: Refine example code in ml-features.md
> 
>
> Key: SPARK-12199
> URL: https://issues.apache.org/jira/browse/SPARK-12199
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6982) Data Frame and Spark SQL should allow filtering on key portion of incremental parquet files

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046556#comment-15046556
 ] 

Hyukjin Kwon commented on SPARK-6982:
-

Would this be done now by partition key [partition 
discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)?



> Data Frame and Spark SQL should allow filtering on key portion of incremental 
> parquet files
> ---
>
> Key: SPARK-6982
> URL: https://issues.apache.org/jira/browse/SPARK-6982
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Brad Willard
>  Labels: dataframe, sql
>
> I'm working with a 2.4 billion dataset. I just converted it to use the 
> incremental schema features of parquet added in 1.3.0 where you save 
> incremental files with key=X.
> I'm currently saving files where the key is a timestamp like key=2015-01-01. 
> If I run a query, the key comes back as an attributes in the row. It would be 
> amazing to be able to do comparisons and filters on the key attribute to do 
> efficient queries between points in time and just skip the partitions of data 
> outside of a key range.
> df = sql_context.parquetFile('/super_large_dataset_over time')
> df.filter(df.key >= '2015-3-24').filter(df.key < '2015-04-01').count()
> That job could then skip large portions of the dataset very quickly even if 
> the entire parquet file contains years of data.
> Currently that will throw an error because key is not part of the parquet 
> schema even though it's returned in the rows.
> However it does strangely work with the in clause which is my current work 
> around
> df.where('key in ("2015-04-02", "2015-04-03")').count()
> Job aborted due to stage failure: Task 122 in stage 6.0 failed 100 times, 
> most recent failure: Lost task 122.99 in stage 6.0 (TID 39498, 
> ip-): java.lang.IllegalArgumentException: Column [key] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   

[jira] [Resolved] (SPARK-12193) Spark-Submit from node.js and call the process

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12193.
---
  Resolution: Invalid
Target Version/s:   (was: 1.3.1)

Questions go to u...@spark.apache.org, not JIRA.

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for how 
to fill out JIRAs, too.

> Spark-Submit from node.js and call the process
> --
>
> Key: SPARK-12193
> URL: https://issues.apache.org/jira/browse/SPARK-12193
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.3.1
> Environment: Ubuntu 14.4
>Reporter: himanshu singhal
>  Labels: features
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
>  I have a spark application which run my spark jobs with spark-shell or 
> spark-submit command. I'd like to go further and I wonder how to use spark as 
> a backend of a web application. Specificaly, I want a frontend application ( 
> build with nodejs )  to communicate with spark on the backend, so that every 
> query from the frontend is rooted to spark. And the result from Spark are 
> sent back to the frontend. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12200) pyspark.sql.types.Row should implement __contains__

2015-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12200:
---
Component/s: PySpark

> pyspark.sql.types.Row should implement __contains__
> ---
>
> Key: SPARK-12200
> URL: https://issues.apache.org/jira/browse/SPARK-12200
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> The Row type should implement __contains__ so when people do this
> {code}
> 'field' in row
> {code}
> we will check row keys instead of values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11652) Remote code execution with InvokerTransformer

2015-12-08 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046619#comment-15046619
 ] 

meiyoula commented on SPARK-11652:
--

[~darabos] Can you have a look on the patch merged by owen, I think the 
artifactId of the dependency is wrong.

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.4.2, 1.5.3, 1.6.0
>
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11928) Master retry deadlock

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11928.
---
Resolution: Cannot Reproduce

> Master retry deadlock
> -
>
> Key: SPARK-11928
> URL: https://issues.apache.org/jira/browse/SPARK-11928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Richard Marscher
>
> In our hardening testing during upgrade to Spark 1.5.2 we are noticing that 
> there is a deadlock issue in the master connection retry code in AppClient: 
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96
> On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv 
> for the masterRef. This Runnable is not checking for interruption on the 
> current thread so it seems like it will continue to block as long as the 
> rpcEnv is blocking and not respect the cancel(true) call in 
> registerWithMaster. The thread pool only has enough threads to run 
> tryRegisterAllMasters once at a time.
> It is being called from registerWithMaster which is going to retry every 20 
> seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
>  Meanwhile the rpcEnv default timeout is 120 seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
>  So once registerWithMaster recursively calls itself it will provoke a 
> deadlock in tryRegisterAllMasters.
> Exception
> {code}
> 15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to 
> master spark://ip-172-22-121-44:7077...
> 15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@1d96b2c9 rejected from 
> java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>  ~[na:1.7.0_91]
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) 
> [na:1.7.0_91]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) 
> [na:1.7.0_91]
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>  ~[na:1.7.0_91]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
> {code}
> Thread dump excerpt
> {code}
> "appclient-registration-retry-thread" daemon prio=10 tid=0x7f92dc00f800 
> nid=0x6939 waiting on condition [0x7f927dada000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xf7e7aef8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
>   at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> "appclient-register-master-threadpool-0" daemon prio=10 
> tid=0x7f92dc004000 nid=0x6938 waiting on condition [0x7f927dbdb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xf7e7a078> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>   at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>   at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
>   at 

[jira] [Resolved] (SPARK-12190) spark does not start cleanly windows 7 64 bit

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12190.
---
Resolution: Duplicate

Please search JIRA first; plenty of comment about /tmp/hive

> spark does not start cleanly windows 7 64 bit
> -
>
> Key: SPARK-12190
> URL: https://issues.apache.org/jira/browse/SPARK-12190
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.5.2
> Environment: windows 7 64 bit
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0
> (where the bin\winutils resides)
>Reporter: stefan
>  Labels: newbie
>
> See environment description above for all my PATH info and ENV variables.
> Hadoop is not compiled, nor is a distributed storage set up, but the hadoop 
> binary with winutils. exe was downloaded from here:
> https://www.barik.net/archive/2015/01/19/172716/
> and moved to the home directory
> Spark was not built on this machine but rather the precompiled binary was 
> downloaded. 
> Java is this version:
> java version "1.8.0_65"
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
> Spark-shell is invoked and the error is shown below:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin>spark-shell
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's repl log4j profile: 
> org/apache/spark/log4j-defaults-repl.properties
> To adjust logging level use sc.setLogLevel("INFO")
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/12/07 21:18:40 WARN MetricsSystem: Using default name DAGScheduler for 
> source because spark.app.id is not set.
> Spark context available as sc.
> 15/12/07 21:18:42 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar"
>  is already registered, and you are trying to register an identical plugin 
> located at URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/bin/../lib/datanucleus-rdbms-3.2.9.jar."
> 15/12/07 21:18:42 WARN General: Plugin (Bundle) "org.datanucleus" is already 
> registered. Ensure you dont have multiple JAR versions of the same plugin in 
> the classpath. The URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar"
>  is already registered, and you are trying to register an identical plugin 
> located at URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/bin/../lib/datanucleus-core-3.2.10.jar."
> 15/12/07 21:18:42 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is 
> already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/bin/../lib/datanucleus-api-jdo-3.2.6.jar"
>  is already registered, and you are trying to register an identical plugin 
> located at URL 
> "file:/C:/Users/Stefan/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar."
> 15/12/07 21:18:42 WARN Connection: BoneCP specified but not present in 
> CLASSPATH (or one of dependencies)
> 15/12/07 21:18:42 WARN Connection: BoneCP specified but not present in 
> CLASSPATH (or one of dependencies)
> 15/12/07 21:18:47 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 15/12/07 21:18:47 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> 15/12/07 21:18:47 WARN : Your hostname, BloomBear-SSD resolves to a 
> loopback/non-reachable address: fe80:0:0:0:2424:cdcb:ecc1:c9cb%eth6, but we 
> couldn't find any external IP address!
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: -
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> at 
> 

[jira] [Resolved] (SPARK-11487) Spark Master shutdown automatically after some applications execution

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11487.
---
Resolution: Not A Problem

> Spark Master shutdown automatically after some applications execution
> -
>
> Key: SPARK-11487
> URL: https://issues.apache.org/jira/browse/SPARK-11487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Spark Standalone on CentOS 6.6, 
> One Master and 5 worker nodes cluster (Each Node Memory: > 150 GB each, 72 
> cores each)
>Reporter: Sandeep Pal
>  Labels: master
>
> The master logs are as follow after the spark automatic shutdown:
> 15/11/02 20:50:01 INFO master.Master: Registering app PythonWordCount
> 15/11/02 20:50:01 INFO master.Master: Registered app PythonWordCount with ID 
> app-20151102205001-0025
> 15/11/02 20:50:01 INFO master.Master: Launching executor 
> app-20151102205001-0025/0 on worker worker-20151030135450-x.x.x.76-42502
> 15/11/02 20:50:01 INFO master.Master: Launching executor 
> app-20151102205001-0025/1 on worker worker-20151030135450-x.x.x.86-51916
> 15/11/02 20:50:01 INFO master.Master: Launching executor 
> app-20151102205001-0025/2 on worker worker-20151030135450-x.x.x.85-47388
> 15/11/02 20:50:01 INFO master.Master: Launching executor 
> app-20151102205001-0025/3 on worker worker-20151030125450-x.x.x.69-51604
> 15/11/02 20:50:01 INFO master.Master: Launching executor 
> app-20151102205001-0025/4 on worker worker-20151030135450-x.x.x.87-35705
> 15/11/02 20:57:35 INFO master.Master: Received unregister request from 
> application app-20151102205001-0025
> 15/11/02 20:57:35 INFO master.Master: Removing app app-20151102205001-0025
> 15/11/02 20:57:35 WARN master.Master: Application PythonWordCount is still in 
> progress, it may be terminated abnormally.
> 15/11/02 20:57:35 INFO spark.SecurityManager: Changing view acls to: root
> 15/11/02 20:57:35 INFO spark.SecurityManager: Changing modify acls to: root
> 15/11/02 20:57:35 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/11/02 20:57:43 INFO master.Master: x.x.x.x:47502 got disassociated, 
> removing it.
> 15/11/02 20:57:43 WARN master.Master: Got status update for unknown executor 
> app-20151102205001-0025/4
> 15/11/02 20:57:43 WARN master.Master: Got status update for unknown executor 
> app-20151102205001-0025/3
> 15/11/02 20:57:43 WARN master.Master: Got status update for unknown executor 
> app-20151102205001-0025/0
> 15/11/02 20:57:43 WARN master.Master: Got status update for unknown executor 
> app-20151102205001-0025/2
> 15/11/02 20:57:43 WARN master.Master: Got status update for unknown executor 
> app-20151102205001-0025/1
> 15/11/02 20:58:28 INFO master.Master: Registering app App Test
> 15/11/02 20:58:28 INFO master.Master: Registered app App Test with ID 
> app-20151102205828-0026
> 15/11/02 20:58:28 INFO master.Master: Launching executor 
> app-20151102205828-0026/0 on worker worker-20151030135450-x.x.x.76-42502
> 15/11/02 20:58:28 INFO master.Master: Launching executor 
> app-20151102205828-0026/1 on worker worker-20151030135450-x.x.x.86-51916
> 15/11/02 20:58:28 INFO master.Master: Launching executor 
> app-20151102205828-0026/2 on worker worker-20151030135450-x.x.x.85-47388
> 15/11/02 20:58:28 INFO master.Master: Launching executor 
> app-20151102205828-0026/3 on worker worker-20151030125450-x.x.x.69-51604
> 15/11/02 20:58:28 INFO master.Master: Launching executor 
> app-20151102205828-0026/4 on worker worker-20151030135450-x.x.x.87-35705
> 15/11/02 20:59:35 INFO master.Master: Received unregister request from 
> application app-20151102205828-0026
> 15/11/02 20:59:35 INFO master.Master: Removing app app-20151102205828-0026
> 15/11/02 20:59:35 WARN master.Master: Application App Test is still in 
> progress, it may be terminated abnormally.
> 15/11/02 20:59:35 INFO spark.SecurityManager: Changing view acls to: root
> 15/11/02 20:59:35 INFO spark.SecurityManager: Changing modify acls to: root
> 15/11/02 20:59:35 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/11/02 21:17:46 INFO master.Master: x.x.x.x:40954 got disassociated, 
> removing it.
> 15/11/02 21:17:46 WARN master.Master: Got status update for unknown executor 
> app-20151102205828-0026/3
> 15/11/02 21:17:46 WARN master.Master: Got status update for unknown executor 
> app-20151102205828-0026/1
> 15/11/02 21:17:46 WARN master.Master: Got status update for unknown executor 
> app-20151102205828-0026/0
> 15/11/02 21:17:46 WARN master.Master: Got status update for 

[jira] [Assigned] (SPARK-12200) pyspark.sql.types.Row should implement __contains__

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12200:


Assignee: Apache Spark

> pyspark.sql.types.Row should implement __contains__
> ---
>
> Key: SPARK-12200
> URL: https://issues.apache.org/jira/browse/SPARK-12200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Minor
>
> The Row type should implement __contains__ so when people do this
> {code}
> 'field' in row
> {code}
> we will check row keys instead of values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12201) add type coercion rule for greatest/least

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12201:


Assignee: (was: Apache Spark)

> add type coercion rule for greatest/least
> -
>
> Key: SPARK-12201
> URL: https://issues.apache.org/jira/browse/SPARK-12201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12201) add type coercion rule for greatest/least

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046611#comment-15046611
 ] 

Apache Spark commented on SPARK-12201:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10196

> add type coercion rule for greatest/least
> -
>
> Key: SPARK-12201
> URL: https://issues.apache.org/jira/browse/SPARK-12201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12202) Pass additional Scala REPL options to the underlying REPL

2015-12-08 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-12202:
-

 Summary: Pass additional Scala REPL options to the underlying REPL
 Key: SPARK-12202
 URL: https://issues.apache.org/jira/browse/SPARK-12202
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Reporter: Iulian Dragos


Sometimes it is useful to be able to pass Scala options to the underlying Spark 
REPL (like {{-target}} or {{-Xprint:parse}} when debugging). A simple way is to 
pass them through an additional environment variable {{SPARK_REPL_OPTS}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12146) SparkR jsonFile should support multiple input files

2015-12-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12146:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-12144

> SparkR jsonFile should support multiple input files
> ---
>
> Key: SPARK-12146
> URL: https://issues.apache.org/jira/browse/SPARK-12146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>
> This bug is easy to reproduce, jsonFile did not support character vector as 
> arguments.
> {code}
> > path <- c("/path/to/dir1","/path/to/dir2")
> > raw.terror<-jsonFile(sqlContext,path)
> 15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   java.io.IOException: No input paths specified in job
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12045) Use joda's DateTime to replace Calendar

2015-12-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046631#comment-15046631
 ] 

Sean Owen commented on SPARK-12045:
---

I'm generally pretty negative on punting on semantics questions with a flag. 
You end up having to support *both* semantics everywhere which sounds like a 
lot of potential bugs and questions. Or, one behavior becomes de facto 
unsupported and the flag is pointless.

> Use joda's DateTime to replace Calendar
> ---
>
> Key: SPARK-12045
> URL: https://issues.apache.org/jira/browse/SPARK-12045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> Currently spark use Calendar to build the Date when convert from string to 
> Date. But Calendar can not detect the invalid date format (e.g. 2011-02-29).
> Although we can use Calendar.setLenient(false) to enable Calendar to detect 
> the invalid date format, but found the error message very confusing. So I 
> suggest to use joda's DateTime to replace Calendar. 
> Besides that, I found that there's already some format checking logic when 
> casting string to date. And if it is invalid format, it would return None. I 
> don't think it make sense to just return None without telling users.  I think 
> by default should just throw exception, and user can set property to allow it 
> return None if invalid format. 
> {code}
> if (i == 0 && j != 4) {
>   // year should have exact four digits
>   return None
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN

2015-12-08 Thread mark vervuurt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046606#comment-15046606
 ] 

mark vervuurt commented on SPARK-6918:
--

I am also doubting wether this issue has been resolved we are getting a similar 
error when trying to read data from HBase from Spark using Spark 1.4.1. within 
a cluster secured with Kerberos.

> Secure HBase with Kerberos does not work over YARN
> --
>
> Key: SPARK-6918
> URL: https://issues.apache.org/jira/browse/SPARK-6918
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.2.1, 1.3.0, 1.3.1
>Reporter: Dean Chen
>Assignee: Dean Chen
> Fix For: 1.4.0
>
>
> Attempts to access HBase from Spark executors will fail at the auth to the 
> metastore with: _GSSException: No valid credentials provided (Mechanism 
> level: Failed to find any Kerberos tgt)_
> This is because HBase Kerberos auth token is not send to the executor. Will 
> need to have something similar to obtainTokensForNamenodes(used for HDFS) in 
> yarn/Client.scala. Storm also needed something similar: 
> https://github.com/apache/storm/pull/226
> I've created a patch for this that required an HBase dependency in the YARN 
> module that we've been using successfully at eBay but am working on a version 
> that does not require the HBase dependency by calling the class loader. 
> Should be ready in a few days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12201) add type coercion rule for greatest/least

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12201:


Assignee: Apache Spark

> add type coercion rule for greatest/least
> -
>
> Key: SPARK-12201
> URL: https://issues.apache.org/jira/browse/SPARK-12201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12200) pyspark.sql.types.Row should implement __contains__

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12200:


Assignee: (was: Apache Spark)

> pyspark.sql.types.Row should implement __contains__
> ---
>
> Key: SPARK-12200
> URL: https://issues.apache.org/jira/browse/SPARK-12200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> The Row type should implement __contains__ so when people do this
> {code}
> 'field' in row
> {code}
> we will check row keys instead of values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12200) pyspark.sql.types.Row should implement __contains__

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046566#comment-15046566
 ] 

Apache Spark commented on SPARK-12200:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/10194

> pyspark.sql.types.Row should implement __contains__
> ---
>
> Key: SPARK-12200
> URL: https://issues.apache.org/jira/browse/SPARK-12200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> The Row type should implement __contains__ so when people do this
> {code}
> 'field' in row
> {code}
> we will check row keys instead of values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12192) recommendProductsForUsers(num: Int) can run OK when the dataset is small,but run wrong when large amount of data

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12192.
---
Resolution: Not A Problem

The error shows you that something in your code is attempting to write to a 
path that already exists. It's not a Spark issue.

> recommendProductsForUsers(num: Int) can run OK when the dataset is small,but 
> run wrong when  large amount of data
> -
>
> Key: SPARK-12192
> URL: https://issues.apache.org/jira/browse/SPARK-12192
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: jinglei.chen
>
> recommendProductsForUsers(num: Int) can run OK when the dataset is small,but 
> run wrong when large amount of data(about 1517459 users and 1190671 products, 
> Every time an error, the error is User class threw exception: 
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> hdfs://recomm-30-1.liepin.inc:9000/recommend/result/job/newb already exists,
> in fact ,the output is deleted before i run ALS)
> my QQ:604586220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6982) Data Frame and Spark SQL should allow filtering on key portion of incremental parquet files

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046556#comment-15046556
 ] 

Hyukjin Kwon edited comment on SPARK-6982 at 12/8/15 8:15 AM:
--

Would this be done now by partition key? 
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery




was (Author: hyukjin.kwon):
Would this be done now by partition key [partition 
discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)
 ?



> Data Frame and Spark SQL should allow filtering on key portion of incremental 
> parquet files
> ---
>
> Key: SPARK-6982
> URL: https://issues.apache.org/jira/browse/SPARK-6982
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Brad Willard
>  Labels: dataframe, sql
>
> I'm working with a 2.4 billion dataset. I just converted it to use the 
> incremental schema features of parquet added in 1.3.0 where you save 
> incremental files with key=X.
> I'm currently saving files where the key is a timestamp like key=2015-01-01. 
> If I run a query, the key comes back as an attributes in the row. It would be 
> amazing to be able to do comparisons and filters on the key attribute to do 
> efficient queries between points in time and just skip the partitions of data 
> outside of a key range.
> df = sql_context.parquetFile('/super_large_dataset_over time')
> df.filter(df.key >= '2015-3-24').filter(df.key < '2015-04-01').count()
> That job could then skip large portions of the dataset very quickly even if 
> the entire parquet file contains years of data.
> Currently that will throw an error because key is not part of the parquet 
> schema even though it's returned in the rows.
> However it does strangely work with the in clause which is my current work 
> around
> df.where('key in ("2015-04-02", "2015-04-03")').count()
> Job aborted due to stage failure: Task 122 in stage 6.0 failed 100 times, 
> most recent failure: Lost task 122.99 in stage 6.0 (TID 39498, 
> ip-): java.lang.IllegalArgumentException: Column [key] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at 

[jira] [Comment Edited] (SPARK-6982) Data Frame and Spark SQL should allow filtering on key portion of incremental parquet files

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046556#comment-15046556
 ] 

Hyukjin Kwon edited comment on SPARK-6982 at 12/8/15 8:15 AM:
--

Would this be done now by partition key [partition 
discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)
 ?




was (Author: hyukjin.kwon):
Would this be done now by partition key [partition 
discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)?



> Data Frame and Spark SQL should allow filtering on key portion of incremental 
> parquet files
> ---
>
> Key: SPARK-6982
> URL: https://issues.apache.org/jira/browse/SPARK-6982
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Brad Willard
>  Labels: dataframe, sql
>
> I'm working with a 2.4 billion dataset. I just converted it to use the 
> incremental schema features of parquet added in 1.3.0 where you save 
> incremental files with key=X.
> I'm currently saving files where the key is a timestamp like key=2015-01-01. 
> If I run a query, the key comes back as an attributes in the row. It would be 
> amazing to be able to do comparisons and filters on the key attribute to do 
> efficient queries between points in time and just skip the partitions of data 
> outside of a key range.
> df = sql_context.parquetFile('/super_large_dataset_over time')
> df.filter(df.key >= '2015-3-24').filter(df.key < '2015-04-01').count()
> That job could then skip large portions of the dataset very quickly even if 
> the entire parquet file contains years of data.
> Currently that will throw an error because key is not part of the parquet 
> schema even though it's returned in the rows.
> However it does strangely work with the in clause which is my current work 
> around
> df.where('key in ("2015-04-02", "2015-04-03")').count()
> Job aborted due to stage failure: Task 122 in stage 6.0 failed 100 times, 
> most recent failure: Lost task 122.99 in stage 6.0 (TID 39498, 
> ip-): java.lang.IllegalArgumentException: Column [key] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>   at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> 

[jira] [Created] (SPARK-12200) pyspark.sql.types.Row should implement __contains__

2015-12-08 Thread JIRA
Maciej Bryński created SPARK-12200:
--

 Summary: pyspark.sql.types.Row should implement __contains__
 Key: SPARK-12200
 URL: https://issues.apache.org/jira/browse/SPARK-12200
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.6.0
Reporter: Maciej Bryński
Priority: Minor


The Row type should implement __contains__ so when people do this
{code}
'field' in row
{code}
we will check row keys instead of values



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10123) Cannot set "--deploy-mode" in default configuration

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10123:


Assignee: (was: Apache Spark)

> Cannot set "--deploy-mode" in default configuration
> ---
>
> Key: SPARK-10123
> URL: https://issues.apache.org/jira/browse/SPARK-10123
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's no configuration option that is the equivalent of "--deploy-mode". So 
> it's not possible, for example, to have applications be submitted in 
> standalone cluster mode by default - you have to always use the command line 
> argument for that.
> YARN is special because it has the (somewhat deprecated) "yarn-cluster" 
> master, but it would be nice to be consistent and have a proper config option 
> for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10123) Cannot set "--deploy-mode" in default configuration

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10123:


Assignee: Apache Spark

> Cannot set "--deploy-mode" in default configuration
> ---
>
> Key: SPARK-10123
> URL: https://issues.apache.org/jira/browse/SPARK-10123
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> There's no configuration option that is the equivalent of "--deploy-mode". So 
> it's not possible, for example, to have applications be submitted in 
> standalone cluster mode by default - you have to always use the command line 
> argument for that.
> YARN is special because it has the (somewhat deprecated) "yarn-cluster" 
> master, but it would be nice to be consistent and have a proper config option 
> for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12178) Expose reporting of StreamInputInfo for custom made streams

2015-12-08 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046602#comment-15046602
 ] 

Saisai Shao commented on SPARK-12178:
-

This is a good idea to make it generic if there's more direct stream other than 
Kafka. I thought about this when implementing this InputInfoTracker, but at 
that time there's only one special case (Kafka direct stream).

> Expose reporting of StreamInputInfo for custom made streams
> ---
>
> Key: SPARK-12178
> URL: https://issues.apache.org/jira/browse/SPARK-12178
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Rodrigo Boavida
>Priority: Minor
>
> For custom made direct streams, the Spark Streaming context needs to be 
> informed of the RDD count per batch execution. This is not exposed by the 
> InputDStream abstract class. 
> The suggestion is to create a method in the InputDStream class that reports 
> to the streaming context and make that available to child classes of 
> InputDStream.
> Signature example:
> def reportInfo(validTime : org.apache.spark.streaming.Time, inputInfo : 
> org.apache.spark.streaming.scheduler.StreamInputInfo)
> I have already done this on my own private branch. I can merge that change in 
> if approval is given.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12201) add type coercion rule for greatest/least

2015-12-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12201:
---

 Summary: add type coercion rule for greatest/least
 Key: SPARK-12201
 URL: https://issues.apache.org/jira/browse/SPARK-12201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-12203:
---

 Summary: Add KafkaDirectInputDStream that directly pulls messages 
from Kafka Brokers using receivers
 Key: SPARK-12203
 URL: https://issues.apache.org/jira/browse/SPARK-12203
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Liang-Chi Hsieh


Currently, we have DirectKafkaInputDStream, which directly pulls messages from 
Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
messages from a Kafka Broker using receiver with zookeeper.

As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
after each batch finishes, it posts a latency compared with KafkaInputDStream 
that continues to pull messages during each batch window.

So we try to add KafkaDirectInputDStream that directly pulls messages from 
Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12202) Pass additional Scala REPL options to the underlying REPL (2.11)

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12202:


Assignee: (was: Apache Spark)

> Pass additional Scala REPL options to the underlying REPL (2.11)
> 
>
> Key: SPARK-12202
> URL: https://issues.apache.org/jira/browse/SPARK-12202
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Iulian Dragos
>
> Sometimes it is useful to be able to pass Scala options to the underlying 
> Spark REPL (like {{-target}} or {{-Xprint:parse}} when debugging). 
> Currently, only the 2.10 version of the REPL allows that (as normal arguments 
> to the spark-shell command)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12202) Pass additional Scala REPL options to the underlying REPL (2.11)

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046752#comment-15046752
 ] 

Apache Spark commented on SPARK-12202:
--

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/10199

> Pass additional Scala REPL options to the underlying REPL (2.11)
> 
>
> Key: SPARK-12202
> URL: https://issues.apache.org/jira/browse/SPARK-12202
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Iulian Dragos
>
> Sometimes it is useful to be able to pass Scala options to the underlying 
> Spark REPL (like {{-target}} or {{-Xprint:parse}} when debugging). 
> Currently, only the 2.10 version of the REPL allows that (as normal arguments 
> to the spark-shell command)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11439.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9756
[https://github.com/apache/spark/pull/9756]

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12137) Spark Streaming State Recovery limitations

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12137.
---
Resolution: Not A Problem

> Spark Streaming State Recovery limitations
> --
>
> Key: SPARK-12137
> URL: https://issues.apache.org/jira/browse/SPARK-12137
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Ravindar
>Priority: Critical
>
> There was multiple threads in forums asking similar question without a clear 
> answer and hence entering it here.
> We have a streaming application that goes through multi-step processing. In 
> some of these steps stateful operations like *updateStateByKey* are used to 
> maintain an accumulated running state (and other state info) with incoming 
> RDD streams. As streaming application is incremental, it is imperative that 
> we recover/restore from previous known state in the following two scenarios
>   1. On spark driver/streaming application failure.
>  In this scenario the driver/streaming application shutdown and 
> restarted. The recommended approach is enable the *checkpoint(checkpointDir)* 
> and use *StreamingContext.getOrCreate* to restore the context from checkpoint 
> state.
>   2. Upgrade driver/streaming application with additional steps in the 
> processing
>  In this scenario, we introduced new steps with downstream processing for 
> new functionality without changes to existing steps.  Upgrading the streaming 
> application with the new fails on  *StreamingContext.getOrCreate* as there is 
> mismatch in checkpoint saved.
> Both of the above scenarios needs a unified approach where accumulated state 
> has to be saved and restored. The first approach of restoring from checkpoint 
> works for driver failure but not code upgrade. When the application code 
> changed, there is a recommendation to delete checkpoint data when new code is 
> deployed. If so, how do you reconstitute all of the stateful (e.g: 
> updateStateByKey) information from the last run. Every streaming application 
> has to save  up-to-date state for each session represented by key and then 
> initialize it from this when a new session starts for the same key. Does 
> every application have to create their own mechanism given this is very 
> similar to current state checkpointing to HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11439:
--
Assignee: Nakul Jindal

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Assignee: Nakul Jindal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12203:


Assignee: Apache Spark

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12202) Pass additional Scala REPL options to the underlying REPL (2.11)

2015-12-08 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-12202:
--
Description: 
Sometimes it is useful to be able to pass Scala options to the underlying Spark 
REPL (like {{-target}} or {{-Xprint:parse}} when debugging). 

Currently, only the 2.10 version of the REPL allows that (as normal arguments 
to the spark-shell command)

  was:
Sometimes it is useful to be able to pass Scala options to the underlying Spark 
REPL (like {{-target}} or {{-Xprint:parse}} when debugging). A simple way is to 
pass them through an additional environment variable {{SPARK_REPL_OPTS}}.


Summary: Pass additional Scala REPL options to the underlying REPL 
(2.11)  (was: Pass additional Scala REPL options to the underlying REPL)

> Pass additional Scala REPL options to the underlying REPL (2.11)
> 
>
> Key: SPARK-12202
> URL: https://issues.apache.org/jira/browse/SPARK-12202
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Iulian Dragos
>
> Sometimes it is useful to be able to pass Scala options to the underlying 
> Spark REPL (like {{-target}} or {{-Xprint:parse}} when debugging). 
> Currently, only the 2.10 version of the REPL allows that (as normal arguments 
> to the spark-shell command)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12204) Implement drop method for DataFrame in SparkR

2015-12-08 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12204:
---

 Summary: Implement drop method for DataFrame in SparkR
 Key: SPARK-12204
 URL: https://issues.apache.org/jira/browse/SPARK-12204
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12166) Unset hadoop related environment in testing

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12166.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10172
[https://github.com/apache/spark/pull/10172]

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12166) Unset hadoop related environment in testing

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12166:
--
Assignee: Jeff Zhang

> Unset hadoop related environment in testing 
> 
>
> Key: SPARK-12166
> URL: https://issues.apache.org/jira/browse/SPARK-12166
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause 
> is that spark is still using my local single node cluster hadoop when doing 
> the unit test. I don't think it make sense to do that. These environment 
> variable should be unset before the testing. And I suspect dev/run-tests also
> didn't do that either. 
> Here's the error message:
> {code}
> Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root 
> scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: 
> rwxr-xr-x
> [info]   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
> [info]   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12204) Implement drop method for DataFrame in SparkR

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12204:


Assignee: Apache Spark

> Implement drop method for DataFrame in SparkR
> -
>
> Key: SPARK-12204
> URL: https://issues.apache.org/jira/browse/SPARK-12204
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) Clarify documentation of KafkaUtils createStream with multiple topics

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12103:
--
Summary: Clarify documentation of KafkaUtils createStream with multiple 
topics  (was: KafkaUtils createStream with multiple topics -- does not work as 
expected)

> Clarify documentation of KafkaUtils createStream with multiple topics
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12103:
--
Assignee: Cody Koeninger

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11551) Replace example code in ml-features.md using include_example

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046763#comment-15046763
 ] 

Apache Spark commented on SPARK-11551:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10200

> Replace example code in ml-features.md using include_example
> 
>
> Key: SPARK-11551
> URL: https://issues.apache.org/jira/browse/SPARK-11551
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: somil deshmukh
>  Labels: starter
> Fix For: 1.6.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12156) SPARK_EXECUTOR_INSTANCES is not effective

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12156.
---
Resolution: Won't Fix

> SPARK_EXECUTOR_INSTANCES  is not effective
> --
>
> Key: SPARK-12156
> URL: https://issues.apache.org/jira/browse/SPARK-12156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: KaiXinXIaoLei
>Priority: Minor
>
> I set SPARK_EXECUTOR_INSTANCES=3, but two executors starts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12127) Spark sql Support Cross join with Mongo DB

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12127.
---
  Resolution: Duplicate
Target Version/s:   (was: 1.4.1)

> Spark sql Support Cross join with Mongo DB
> --
>
> Key: SPARK-12127
> URL: https://issues.apache.org/jira/browse/SPARK-12127
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1, 1.5.2
> Environment: Linux Ubuntu 14.4
>Reporter: himanshu singhal
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> I am using spark sql to perform the various operation . But MongoDb is not  
> giving the correct result on the version 1.4.0 and giving the error on the 
> version 1.5.2 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046711#comment-15046711
 ] 

Sean Owen commented on SPARK-12203:
---

Can you just maintain this implementation for your own purposes to start?

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046708#comment-15046708
 ] 

Liang-Chi Hsieh commented on SPARK-12203:
-

Our need is to have exactly once feature of DirectKafkaInputDStream without its 
latency. Current two implementations can not satisfy us for that.

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11652) Remote code execution with InvokerTransformer

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046738#comment-15046738
 ] 

Apache Spark commented on SPARK-11652:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10198

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.4.2, 1.5.3, 1.6.0
>
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12202) Pass additional Scala REPL options to the underlying REPL (2.11)

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12202:


Assignee: Apache Spark

> Pass additional Scala REPL options to the underlying REPL (2.11)
> 
>
> Key: SPARK-12202
> URL: https://issues.apache.org/jira/browse/SPARK-12202
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Iulian Dragos
>Assignee: Apache Spark
>
> Sometimes it is useful to be able to pass Scala options to the underlying 
> Spark REPL (like {{-target}} or {{-Xprint:parse}} when debugging). 
> Currently, only the 2.10 version of the REPL allows that (as normal arguments 
> to the spark-shell command)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046678#comment-15046678
 ] 

Apache Spark commented on SPARK-12203:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10197

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12203:


Assignee: (was: Apache Spark)

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046679#comment-15046679
 ] 

Sean Owen commented on SPARK-12203:
---

Is this really worth a third implementation? it seems to have downsides of both 
current implementation for not much gain.

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12103.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10132
[https://github.com/apache/spark/pull/10132]

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12204) Implement drop method for DataFrame in SparkR

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046792#comment-15046792
 ] 

Apache Spark commented on SPARK-12204:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/10201

> Implement drop method for DataFrame in SparkR
> -
>
> Key: SPARK-12204
> URL: https://issues.apache.org/jira/browse/SPARK-12204
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12204) Implement drop method for DataFrame in SparkR

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12204:


Assignee: (was: Apache Spark)

> Implement drop method for DataFrame in SparkR
> -
>
> Key: SPARK-12204
> URL: https://issues.apache.org/jira/browse/SPARK-12204
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046079#comment-15046079
 ] 

Jean-Baptiste Onofré edited comment on SPARK-11193 at 12/8/15 4:06 PM:
---

To reproduce the issue, I added the following test in KryoSerializerSuite:

{code}
  test("Bug: SPARK-11193") {
val ser = new 
KryoSerializer(conf.clone.set("spark.kryo.registrationRequired", "false"))
  .newInstance()

val myMap: mutable.HashMap[String, String] = new mutable.HashMap[String, 
String]
  with mutable.SynchronizedMap[String, String]
myMap.put("foo", "bar")
val myMapBytes = ser.serialize(myMap)

val deserialized: mutable.HashMap[String, String]
  with mutable.SynchronizedMap[String, String] = ser.deserialize(myMapBytes)

deserialized.clear()
  }
{code}

When running this test, I got:

{code}
scala.collection.mutable.HashMap cannot be cast to 
scala.collection.mutable.SynchronizedMap
java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
to scala.collection.mutable.SynchronizedMap
{code}

similar to what happens in KinesisReceiver.

I'm figuring out the fix to do in KryoSerializer.


was (Author: jbonofre):
I'm adding a test in KryoSerializerSuite about the support of SynchronizedMap.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12205) Pivot fails Analysis when aggregate is UnresolvedFunction

2015-12-08 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-12205:
--

 Summary: Pivot fails Analysis when aggregate is UnresolvedFunction
 Key: SPARK-12205
 URL: https://issues.apache.org/jira/browse/SPARK-12205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Ray






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047036#comment-15047036
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

It could be possible. Let me check with the team.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12205) Pivot fails Analysis when aggregate is UnresolvedFunction

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12205:


Assignee: Apache Spark

> Pivot fails Analysis when aggregate is UnresolvedFunction
> -
>
> Key: SPARK-12205
> URL: https://issues.apache.org/jira/browse/SPARK-12205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Ray
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046975#comment-15046975
 ] 

Cody Koeninger commented on SPARK-12177:


I really think this needs to be handled as a separate subproject, or otherwise 
in a fully backwards compatible way.  The new consumer api will require 
upgrading kafka brokers, which is a big ask just in order for people to upgrade 
spark versions.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11193:


Assignee: Apache Spark

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>Assignee: Apache Spark
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12205) Pivot fails Analysis when aggregate is UnresolvedFunction

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12205:


Assignee: (was: Apache Spark)

> Pivot fails Analysis when aggregate is UnresolvedFunction
> -
>
> Key: SPARK-12205
> URL: https://issues.apache.org/jira/browse/SPARK-12205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Ray
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread Fabien Comte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047027#comment-15047027
 ] 

Fabien Comte commented on SPARK-11193:
--

Thank you for working on this issue.
I guess it's too late for it to be included in the 1.6 release?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11193:


Assignee: (was: Apache Spark)

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047056#comment-15047056
 ] 

Apache Spark commented on SPARK-11193:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/10203

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-08 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047132#comment-15047132
 ] 

Neelesh Srinivas Salian commented on SPARK-9059:


Is this JIRA still active? The PR seems to be closed. 

Shall I go ahead and begin working on it?

Thank you.

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12207) "org.datanucleus" is already registered

2015-12-08 Thread stefan (JIRA)
stefan created SPARK-12207:
--

 Summary: "org.datanucleus" is already registered
 Key: SPARK-12207
 URL: https://issues.apache.org/jira/browse/SPARK-12207
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
 Environment: windows 7 64 bit

PATH includes:
C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
C:\ProgramData\Oracle\Java\javapath
C:\Users\Stefan\scala\bin

SYSTEM variables set are:
JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
(where the bin\winutils resides)

winutils ls \tmp\hive ->
drwxrwxrwx 1 PC\Stefan BloomBear-SSD\None 0 Dec  8 2015 \tmp\hive
Reporter: stefan


I read the response to an identical issue:

https://issues.apache.org/jira/browse/SPARK-11142

and then did search  the mailing list archives here:

http://apache-spark-user-list.1001560.n3.nabble.com/

but got no help

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page=1=%22org.datanucleus%22+is+already+registered=0

There were only  6 posts over all the dates but no resolution.

My apache spark folder has 3 jar files in the lib folder
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar

and these are compiled -> mostly unreadable
There are no other files with the word 'datanucleus' on my pc.

Maven's pom.xml file has been implicated

http://stackoverflow.com/questions/877949/conflicting-versions-of-datanucleus-enhancer-in-a-maven-google-app-engine-projec

but I do not have maven on this windows box.

This post:

http://www.worldofdb2.com/profiles/blogs/using-spark-s-interactive-scala-shell-for-accessing-db2-data

suggests downloading 

DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) and setting 
SPARK_CLASSPATH=c:\db2jcc.jar

but the mailing list archives say SPARK_CLASSPATH is deprecated

https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E

Could I have a ptr to a resolution to this problem.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12208) Abstract the examples into a common place

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12208:
--

 Summary: Abstract the examples into a common place
 Key: SPARK-12208
 URL: https://issues.apache.org/jira/browse/SPARK-12208
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


When we write examples in the code, we put the generation of the data along 
with the example itself. We typically have either:

{code}
val data = 
sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
...
{code}

or some more esoteric stuff such as:
{code}
val data = Array(
  (0, 0.1),
  (1, 0.8),
  (2, 0.2)
)
val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
"feature")
{code}

{code}
val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}

I suggest we follow the example of sklearn and standardize all the generation 
of example data inside a few methods, for example in 
{{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading 
the code is sometimes not enough to figure out what the data is supposed to be. 
For example when using {{libsvm_data}}, it is unclear what the dataframe 
columns are. This is something we should comment somewhere.
Also, it would help explaining in one place all the scala idiosyncracies such 
as using {{Tuple1.apply}} and such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12074) Avoid memory copy involving ByteBuffer.wrap(ByteArrayOutputStream.toByteArray)

2015-12-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12074.
--
   Resolution: Fixed
 Assignee: Ted Yu
Fix Version/s: 2.0.0

> Avoid memory copy involving ByteBuffer.wrap(ByteArrayOutputStream.toByteArray)
> --
>
> Key: SPARK-12074
> URL: https://issues.apache.org/jira/browse/SPARK-12074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 2.0.0
>
>
> SPARK-12060 fixed JavaSerializerInstance.serialize
> This issue applies the same technique (via ByteBufferOutputStream) on two 
> other classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047186#comment-15047186
 ] 

Davies Liu commented on SPARK-12179:


Could you also test 1.6-RC1?

I'm just wondering that the window function `row_number` came since Spark 1.4, 
how can you run this query again 1.3 ?

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10393) use ML pipeline in LDA example

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10393.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8551
[https://github.com/apache/spark/pull/8551]

> use ML pipeline in LDA example
> --
>
> Key: SPARK-10393
> URL: https://issues.apache.org/jira/browse/SPARK-10393
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Since the logic of the text processing part has been moved to ML 
> estimators/transformers, replace the related code in LDA Example with the ML 
> pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-08 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047194#comment-15047194
 ] 

Adam Roberts edited comment on SPARK-9858 at 12/8/15 6:33 PM:
--

Several potential issues here, may well not be with this code itself though - 
I'm consistently encountering problems for two different big endian platforms 
while testing this

1) is this thread safe? I've noticed if we print the rowBuffer when using more 
than one thread for our SQLContext, the ordering of elements is not consistent 
and we sometimes have two rows printed consecutively

2) For the aggregate, join, and complex query 2 tests, I consistently receive 
more bytes per partition (or should that be per stage?) and instead of 
estimating (0, 2) for the indices we get (0, 2, 4). I know we're using the 
UnsafeRowSerializer and so wary if the issue lies here instead, I see it's 
using Google's ByteStreams class to read in the bytes. Specifically I have 800, 
800, 800, 800, 720 bytes per partition instead of 600, 600, 600, 600, 600. Is 
there a way I can print what these bytes are?

3) Where do the values used in the assertions for the test suite come from?

If we print the rows we see differences between the two platforms: (the 63 and 
70 is on our BE platform and this value differs each time we run the test)

Works perfectly on various architectures that are LE and hence the current 
endianness/serialization theory. This might be better suited to the dev mailing 
list although I expect I'm one of the few to be testing this on BE. It occurs 
regardless of Java vendor and whether we're running in interpreted mode or not.


was (Author: aroberts):
Several potential issues here, may well not be with this code itself though - 
I'm consistently encountering problems for two different big endian platforms 
while testing this

1) is this thread safe? I've noticed if we print the rowBuffer when using more 
than one thread for our SQLContext, the ordering of elements is not consistent 
and we sometimes have two rows printed consecutively

2) For the aggregate, join, and complex query 2 tests, I consistently receive 
more bytes per partition (or should that be per stage?) and instead of 
estimating (0, 2) for the indices we get (0, 2, 4). I know we're using the 
UnsafeRowSerializer and so wary if the issue lies here instead, I see it's 
using Google's ByteStreams class to read in the bytes. Specifically I have 800, 
800, 800, 800, 720 bytes per partition instead of 600, 600, 600, 600, 600. Is 
there a way I can print what these bytes are?

3) Where do the values used in the assertions for the test suite come from?

If we print the rows we see differences between the two platforms: (the 63 and 
70 is on our BE platform and this value differs each time we run the test)

Works perfectly on various architectures that are LE and hence the current 
endianness/serialization theory. Apologies if this would be better suited to 
the dev mailing list, although I expect I'm one of the few to be testing this 
on BE...

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12206) Streaming WebUI shows incorrect batch statistics when using Window operations

2015-12-08 Thread Anand Iyer (JIRA)
Anand Iyer created SPARK-12206:
--

 Summary: Streaming WebUI shows incorrect batch statistics when 
using Window operations
 Key: SPARK-12206
 URL: https://issues.apache.org/jira/browse/SPARK-12206
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Anand Iyer
Priority: Minor


I have a streaming app that uses the Window(...) function to create a sliding 
window, and perform transformations on the window'd DStream.

The Batch statistics section of the Streaming UI starts displaying stats for 
each Window, instead of each micro-batch. Is that expected behavior?

The "Input Size" column shows incorrect values. The streaming application is 
receiving about 1K events/sec. However, the "Input Size" column shows values in 
the single digits or low double digits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11155) Stage summary json should include stage duration

2015-12-08 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-11155.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10107
https://github.com/apache/spark/pull/10107

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Assignee: Xin Ren
>Priority: Minor
>  Labels: Starter
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2015-12-02.png
>
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12201) add type coercion rule for greatest/least

2015-12-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12201.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10196
[https://github.com/apache/spark/pull/10196]

> add type coercion rule for greatest/least
> -
>
> Key: SPARK-12201
> URL: https://issues.apache.org/jira/browse/SPARK-12201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12209) spark.streaming.concurrentJobs and spark.streaming.backpressure.enabled

2015-12-08 Thread Andrej Kazakov (JIRA)
Andrej Kazakov created SPARK-12209:
--

 Summary: spark.streaming.concurrentJobs and 
spark.streaming.backpressure.enabled
 Key: SPARK-12209
 URL: https://issues.apache.org/jira/browse/SPARK-12209
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.5.2
Reporter: Andrej Kazakov


spark.streaming.backpressure.enabled doesn't take into account  
spark.streaming.concurrentJobs. 

The backpressure reduces the input rate until only one job is being processed 
at a time. Having concurrent jobs is interesting when dealing with high-latency 
backends, where the ingestion time is large enough, yet there is enough 
throughput capacity to have multiple jobs complete in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12188) [SQL] Code refactoring and comment correction in Dataset APIs

2015-12-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12188.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10184
[https://github.com/apache/spark/pull/10184]

> [SQL] Code refactoring and comment correction in Dataset APIs
> -
>
> Key: SPARK-12188
> URL: https://issues.apache.org/jira/browse/SPARK-12188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> - Created a new private variable `boundTEncoder` that can be shared by 
> multiple functions, `RDD`, `select` and `collect`. 
> - Replaced all the `queryExecution.analyzed` by the function call 
> `logicalPlan`
> - A few API comments are using wrong class names (e.g., `DataFrame`) or 
> parameter names (e.g., `n`)
> - A few API descriptions are wrong. (e.g., `mapPartitions`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2015-12-08 Thread Neelesh Srinivas Salian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian updated SPARK-12071:

Labels: releasenotes starter  (was: releasenotes)

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: releasenotes, starter
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047193#comment-15047193
 ] 

Davies Liu edited comment on SPARK-12179 at 12/8/15 6:24 PM:
-

There are two directions to narrow down the problem:

1) simplify the query until removing anything from it the problem will gone
2) remove the customized configurations (for example, extraJavaOptions), until 
remove anything of them the problem will gone.

This could be a critical bug, hopefully we could find a way to fix it.


was (Author: davies):
There are two direction to narrow down the problem:

1) simplify the query until removing anything from it the problem will gone
2) remove the customized configurations (for example, extraJavaOptions), until 
remove anything of them the problem will gone.

This could be a critical bug, hopefully we could find a way to fix it.

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-08 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047194#comment-15047194
 ] 

Adam Roberts commented on SPARK-9858:
-

Several potential issues here, may well not be with this code itself though - 
I'm consistently encountering problems for two different big endian platforms 
while testing this

1) is this thread safe? I've noticed if we print the rowBuffer when using more 
than one thread for our SQLContext, the ordering of elements is not consistent 
and we sometimes have two rows printed consecutively

2) For the aggregate, join, and complex query 2 tests, I consistently receive 
more bytes per partition and instead of estimating (0, 2) for the indices we 
get (0, 2, 4). I know we're using the UnsafeRowSerializer and so wary if the 
issue lies here instead, I see it's using Google's ByteStreams class to read in 
the bytes. Specifically I have 800, 800, 800, 800, 720 bytes per partition 
instead of 600, 600, 600, 600, 600

3) Where do the values used in the assertions for the test suite come from?

If we print the rows we see differences between the two platforms: (the 63 and 
70 is on our BE platform and this value differs each time we run the test)

Works perfectly on various architectures that are LE and hence the current 
endianness/serialization theory. Apologies if this would be better suited to 
the dev mailing list, although I expect I'm one of the few to be testing this 
on BE...

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12206) Streaming WebUI shows incorrect batch statistics when using Window operations

2015-12-08 Thread Anand Iyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Iyer updated SPARK-12206:
---
Attachment: streaming-webui.png

> Streaming WebUI shows incorrect batch statistics when using Window operations
> -
>
> Key: SPARK-12206
> URL: https://issues.apache.org/jira/browse/SPARK-12206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Anand Iyer
>Priority: Minor
> Attachments: streaming-webui.png
>
>
> I have a streaming app that uses the Window(...) function to create a sliding 
> window, and perform transformations on the window'd DStream.
> The Batch statistics section of the Streaming UI starts displaying stats for 
> each Window, instead of each micro-batch. Is that expected behavior?
> The "Input Size" column shows incorrect values. The streaming application is 
> receiving about 1K events/sec. However, the "Input Size" column shows values 
> in the single digits or low double digits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047193#comment-15047193
 ] 

Davies Liu commented on SPARK-12179:


There are two direction to narrow down the problem:

1) simplify the query until removing anything from it the problem will gone
2) remove the customized configurations (for example, extraJavaOptions), until 
remove anything of them the problem will gone.

This could be a critical bug, hopefully we could find a way to fix it.

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-08 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047194#comment-15047194
 ] 

Adam Roberts edited comment on SPARK-9858 at 12/8/15 6:28 PM:
--

Several potential issues here, may well not be with this code itself though - 
I'm consistently encountering problems for two different big endian platforms 
while testing this

1) is this thread safe? I've noticed if we print the rowBuffer when using more 
than one thread for our SQLContext, the ordering of elements is not consistent 
and we sometimes have two rows printed consecutively

2) For the aggregate, join, and complex query 2 tests, I consistently receive 
more bytes per partition (or should that be per stage?) and instead of 
estimating (0, 2) for the indices we get (0, 2, 4). I know we're using the 
UnsafeRowSerializer and so wary if the issue lies here instead, I see it's 
using Google's ByteStreams class to read in the bytes. Specifically I have 800, 
800, 800, 800, 720 bytes per partition instead of 600, 600, 600, 600, 600. Is 
there a way I can print what these bytes are?

3) Where do the values used in the assertions for the test suite come from?

If we print the rows we see differences between the two platforms: (the 63 and 
70 is on our BE platform and this value differs each time we run the test)

Works perfectly on various architectures that are LE and hence the current 
endianness/serialization theory. Apologies if this would be better suited to 
the dev mailing list, although I expect I'm one of the few to be testing this 
on BE...


was (Author: aroberts):
Several potential issues here, may well not be with this code itself though - 
I'm consistently encountering problems for two different big endian platforms 
while testing this

1) is this thread safe? I've noticed if we print the rowBuffer when using more 
than one thread for our SQLContext, the ordering of elements is not consistent 
and we sometimes have two rows printed consecutively

2) For the aggregate, join, and complex query 2 tests, I consistently receive 
more bytes per partition and instead of estimating (0, 2) for the indices we 
get (0, 2, 4). I know we're using the UnsafeRowSerializer and so wary if the 
issue lies here instead, I see it's using Google's ByteStreams class to read in 
the bytes. Specifically I have 800, 800, 800, 800, 720 bytes per partition 
instead of 600, 600, 600, 600, 600

3) Where do the values used in the assertions for the test suite come from?

If we print the rows we see differences between the two platforms: (the 63 and 
70 is on our BE platform and this value differs each time we run the test)

Works perfectly on various architectures that are LE and hence the current 
endianness/serialization theory. Apologies if this would be better suited to 
the dev mailing list, although I expect I'm one of the few to be testing this 
on BE...

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12195) Adding BigDecimal, Date and Timestamp into Encoder

2015-12-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12195.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10188
[https://github.com/apache/spark/pull/10188]

> Adding BigDecimal, Date and Timestamp into Encoder
> --
>
> Key: SPARK-12195
> URL: https://issues.apache.org/jira/browse/SPARK-12195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> Add three types like DecimalType, DateType and TimestampType into Encoder for 
> DataSet APIs.
> DecimalType -> java.math.BigDecimal
> DateType -> java.sql.Date
> TimestampType -> java.sql.Timestamp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12118) SparkR: Documentation change for isNaN

2015-12-08 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047180#comment-15047180
 ] 

Neelesh Srinivas Salian commented on SPARK-12118:
-

Hi [~yanboliang]

This looks similar to SPARK-12071. It involves updating the programming guide. 
The release notes would have the fix dependent on the release it is included in.

If it is the same, we can close this as a duplicate of the aforementioned JIRA.

Thank you.

> SparkR: Documentation change for isNaN
> --
>
> Key: SPARK-12118
> URL: https://issues.apache.org/jira/browse/SPARK-12118
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/10037, we 
> replace DataFrame.isNaN with DataFrame.isnan at SparkR side. Because 
> DataFrame.isNaN has been deprecated and will be removed at Spark 2.0. We 
> should document the change at release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12078) Fix ByteBuffer.limit misuse

2015-12-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12078.
--
Resolution: Won't Fix

For now, we can assume position is 0

> Fix ByteBuffer.limit misuse
> ---
>
> Key: SPARK-12078
> URL: https://issues.apache.org/jira/browse/SPARK-12078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> `ByteBuffer.limit` is not the remaining size of ByteBuffer. 
> `ByteBuffer.limit` is equal to `ByteBuffer.remaining` only if 
> `ByteBuffer.position` is 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047222#comment-15047222
 ] 

Yin Huai commented on SPARK-9858:
-

[~aroberts] Thanks for your comments.

For 1, can you provide more details? What is the rowBuffer you referred to?

For 2 and 3, I feel the size differences are caused by the differences of 
platforms. In our tests, I got those numbers in assertion  from my machine. 
Those numbers work well with jenkins. Do you have any suggestion on how we can 
make these tests robust to different platforms? 

btw, have you changed {{spark.sql.shuffle.partitions}}?

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2015-12-08 Thread Andrew King (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047280#comment-15047280
 ] 

Andrew King commented on SPARK-10872:
-

I am running into the same issue. I am calling HiveContext through IPython. If 
I try to run my code in the same instance of IPython more than once, I get this 
error. A HiveContext.close() would solve this issue for me (I use sc.close() to 
get around a similar problem with SparkContext). Some way to kill derby / hive 
through python would be great. 

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> 

[jira] [Created] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12210:
--

 Summary: Small example that shows how to integrate spark.mllib 
with spark.ml
 Key: SPARK-12210
 URL: https://issues.apache.org/jira/browse/SPARK-12210
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


Since we are missing a number of algorithms in {{spark.ml}} such as clustering 
or LDA, we should have a small example that shows the recommended way to go 
back and forth between {{spark.ml}} and {{spark.mllib}}. It is mostly putting 
together existing pieces, but I feel it is important for new users to see how 
the interaction plays out in practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12211) Incorrect version number in graphx doc for migration from 1.1

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12211:


Assignee: Apache Spark

> Incorrect version number in graphx doc for migration from 1.1
> -
>
> Key: SPARK-12211
> URL: https://issues.apache.org/jira/browse/SPARK-12211
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, GraphX
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2, 1.6.0
>Reporter: Andrew Ray
>Assignee: Apache Spark
>Priority: Minor
>
> Migration from 1.1 section added to the GraphX doc in 1.2.0 (see 
> https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11)
>  uses {{site.SPARK_VERSION}} as the version where changes were introduced, it 
> should be just 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047342#comment-15047342
 ] 

Apache Spark commented on SPARK-8517:
-

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/10207

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12212:
--

 Summary: Clarify the distinction between spark.mllib and spark.ml
 Key: SPARK-12212
 URL: https://issues.apache.org/jira/browse/SPARK-12212
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.5.2
Reporter: Timothy Hunter


There is a confusion in the documentation of MLLib as to what exactly MLlib: is 
it the package, or is it the whole effort of ML on spark, and how it differs 
from spark.ml? Is MLLib going to be deprecated?

We should do the following:
 - refer to the mllib the code package as spark.mllib across all the 
documentation. Alternative name is "RDD API of MLlib".
 - refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib 
(it should be the default)
 - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
MLlib". I would deemphasize that this API is for building pipelines. Some users 
are lead to believe from the documentation that spark.ml can only be used for 
building pipelines and that using a single algorithm can only be done with 
spark.mllib.

Most relevant places:
 - {{mllib-guide.md}}
 - {{mllib-linear-methods.md}}
 - {{mllib-dimensionality-reduction.md}}
 - {{mllib-pmml-model-export.md}}
 - {{mllib-statistics.md}}
In these files, most references to {{MLlib}} are meant to refer to 
{{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047324#comment-15047324
 ] 

Joseph K. Bradley commented on SPARK-10931:
---

OK great!  Btw, I just thought that this might be doable in a generic way which 
does not require modifying every Model:
* In the Model abstraction, override getattr to check for whether the attribute 
name is the name of a Param in the parent Estimator.  If so, return that Param. 
 If not, call the default getattr, if any.


> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12211) Incorrect version number in graphx doc for migration from 1.1

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047336#comment-15047336
 ] 

Apache Spark commented on SPARK-12211:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/10206

> Incorrect version number in graphx doc for migration from 1.1
> -
>
> Key: SPARK-12211
> URL: https://issues.apache.org/jira/browse/SPARK-12211
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, GraphX
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2, 1.6.0
>Reporter: Andrew Ray
>Priority: Minor
>
> Migration from 1.1 section added to the GraphX doc in 1.2.0 (see 
> https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11)
>  uses {{site.SPARK_VERSION}} as the version where changes were introduced, it 
> should be just 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12211) Incorrect version number in graphx doc for migration from 1.1

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12211:


Assignee: (was: Apache Spark)

> Incorrect version number in graphx doc for migration from 1.1
> -
>
> Key: SPARK-12211
> URL: https://issues.apache.org/jira/browse/SPARK-12211
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, GraphX
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2, 1.6.0
>Reporter: Andrew Ray
>Priority: Minor
>
> Migration from 1.1 section added to the GraphX doc in 1.2.0 (see 
> https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11)
>  uses {{site.SPARK_VERSION}} as the version where changes were introduced, it 
> should be just 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047352#comment-15047352
 ] 

Joseph K. Bradley commented on SPARK-12040:
---

Let's hold off on this until there's a need though.  I'd prefer to close this 
for now and to reopen it if a need arises.

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12205) Pivot fails Analysis when aggregate is UnresolvedFunction

2015-12-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12205.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10202
[https://github.com/apache/spark/pull/10202]

> Pivot fails Analysis when aggregate is UnresolvedFunction
> -
>
> Key: SPARK-12205
> URL: https://issues.apache.org/jira/browse/SPARK-12205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Ray
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12205) Pivot fails Analysis when aggregate is UnresolvedFunction

2015-12-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12205:
-
Assignee: Andrew Ray

> Pivot fails Analysis when aggregate is UnresolvedFunction
> -
>
> Key: SPARK-12205
> URL: https://issues.apache.org/jira/browse/SPARK-12205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Ray
>Assignee: Andrew Ray
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047262#comment-15047262
 ] 

Joseph K. Bradley commented on SPARK-11219:
---

Thanks for the careful assessment!  Creating subtasks sounds good.

> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
>   * SVMModel - single line
>   * SVMWithSGD - vertical alignment, sporatic default values
>   * NaiveBayesModel - single line
>   * NaiveBayes - single line
> h4. Clustering
>   * KMeansModel - missing param description
>   * KMeans - missing param description and defaults
>   * GaussianMixture - vertical align, incorrect default formatting
>   * PowerIterationClustering - single line with wrapped indentation, missing 
> defaults
>   * StreamingKMeansModel - single line wrapped
>   * StreamingKMeans - single line wrapped, missing defaults
>   * LDAModel - single line
>   * LDA - vertical align, mising some defaults
> h4. FPM  
>   * FPGrowth - single line
>   * PrefixSpan - single line, defaults values in backticks
> h4. Recommendation
>   * ALS - does not have param descriptions
> h4. Regression
>   * LabeledPoint - single line
>   * LinearModel - single line
>   * LinearRegressionWithSGD - vertical alignment
>   * RidgeRegressionWithSGD - vertical align
>   * IsotonicRegressionModel - single line
>   * IsotonicRegression - single line, missing default
> h4. Tree
>   * DecisionTree - single line with vertical indentation, missing defaults
>   * RandomForest - single line with wrapped indent, missing some defaults
>   * GradientBoostedTrees - single line with wrapped indent
> NOTE
> This issue will just focus on model/algorithm descriptions, which are the 
> largest source of inconsistent formatting
> evaluation.py, feature.py, random.py, utils.py - these supporting classes 
> have param descriptions as single line, but are consistent so don't need to 
> be changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12208) Abstract the examples into a common place

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047278#comment-15047278
 ] 

Joseph K. Bradley commented on SPARK-12208:
---

This seems reasonable, as long as it's easy for users to copy the example code 
and run it in the spark shell.

> Abstract the examples into a common place
> -
>
> Key: SPARK-12208
> URL: https://issues.apache.org/jira/browse/SPARK-12208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> When we write examples in the code, we put the generation of the data along 
> with the example itself. We typically have either:
> {code}
> val data = 
> sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
> ...
> {code}
> or some more esoteric stuff such as:
> {code}
> val data = Array(
>   (0, 0.1),
>   (1, 0.8),
>   (2, 0.2)
> )
> val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
> "feature")
> {code}
> {code}
> val data = Array(
>   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
>   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
>   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
> )
> val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
> {code}
> I suggest we follow the example of sklearn and standardize all the generation 
> of example data inside a few methods, for example in 
> {{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading 
> the code is sometimes not enough to figure out what the data is supposed to 
> be. For example when using {{libsvm_data}}, it is unclear what the dataframe 
> columns are. This is something we should comment somewhere.
> Also, it would help explaining in one place all the scala idiosyncracies such 
> as using {{Tuple1.apply}} and such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >