date:20151029

[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7970:
-
Assignee: Nitin Goyal
Target Version/s: 1.6.0

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
>Assignee: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11365) consolidate aggregates for summary statistics in weighted least squares

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11365:
--
Component/s: MLlib

> consolidate aggregates for summary statistics in weighted least squares
> ---
>
> Key: SPARK-11365
> URL: https://issues.apache.org/jira/browse/SPARK-11365
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Right now we duplicate some aggregates in the aggregator, we could simplify 
> this a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11348:
--
Component/s: Spark Core

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11346) Spark EventLog for completed applications

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11346:
--
Component/s: Spark Core

> Spark EventLog for completed applications
> -
>
> Key: SPARK-11346
> URL: https://issues.apache.org/jira/browse/SPARK-11346
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Centos 6.7
>Reporter: Milan Brna
> Attachments: eventLogTest.scala
>
>
> Environment description: Spark 1.5.1 build following way:
> ./dev/change-scala-version.sh 2.11
> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
> ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn 
> -Dscala-2.11 -Phive -Phive-thriftserver
> 4 node standalone cluster (node1-node4)
> Master configuration in spark-defaults.conf:
> spark.eventLog.enabledtrue
> spark.eventLog.dirhdfs://node1:38200/user/spark-events
> The same configuration was created during tests of event logging on all 4 
> nodes.
> Cluster is started from node1 (master) by ./start-all.sh, thrift server and 
> history server are additionally started
> Simple application (see attached scala file eventLogTest.scala) is executed 
> from remote laptop, using intellij GUI.
> When conf.set("spark.eventLog.enabled","true") and 
> conf.set("spark.eventLog.dir","hdfs://node1:38200/user/spark-events")
> are un-commented, application eventlog directory in 
> hdfs://node1:38200/user/spark-events is created and contains data.
> History server properly sees and presents content. Everything allright as far.
> If both parameters in application are turned off (commented in source) 
> however, no eventlog directory is ever created for the application.
> I'd expect that parameters spark.eventLog.enabled and spark.eventLog.dir from 
> spark-defaults.conf which is present on all four nodes will be sufficient for 
> the application (even remote) to create eventlog.
> Additionally, I have experimented with following options on all four nodes in 
> spark-env.sh:
> SPARK_MASTER_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_WORKER_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_CONF_DIR="/u01/com/app/spark-1.5.1-bin-cdma-spark/conf"
> SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_SHUFFLE_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_DAEMON_JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> and I have even tried to set following configuration option in application 
> spark context configuration:
> conf.set("spark.submit.deployMode","cluster")
> but none of these settings caused eventlog to appear for completed 
> application.
> EventLog is present for application started from the cluster servers i.e. 
> pyspark, thrift server
> Question: Is this correct behaviour that executing application from remote 
> intellij produces no eventlog unless these options are explicitely specified 
> by configuration inside scala code of the application, hence ignoring 
> settings in spark-defaults.conf file?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11402:


Assignee: (was: Apache Spark)

> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11402:


Assignee: Apache Spark

> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Assignee: Apache Spark
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980441#comment-14980441
 ] 

Apache Spark commented on SPARK-11402:
--

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/9354

> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11383) Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980320#comment-14980320
 ] 

Apache Spark commented on SPARK-11383:
--

User 'rishabhbhardwaj' has created a pull request for this issue:
https://github.com/apache/spark/pull/9353

> Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md 
> using include_example
> ---
>
> Key: SPARK-11383
> URL: https://issues.apache.org/jira/browse/SPARK-11383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11383) Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11383:


Assignee: Apache Spark

> Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md 
> using include_example
> ---
>
> Key: SPARK-11383
> URL: https://issues.apache.org/jira/browse/SPARK-11383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-29 Thread Saif Addin Ellafi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980372#comment-14980372
 ] 

Saif Addin Ellafi commented on SPARK-11330:
---

Thank you. I cannot test master right now, shall I wait until next release to 
close the issue? Or did you try with the proposed dataset in master and it 
worked fine?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11326) Split networking in standalone mode

2015-10-29 Thread Jacek Lewandowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Lewandowski updated SPARK-11326:
--
Description: 
h3.The idea

Currently, in standalone mode, all components, for all network connections need 
to use the same secure token if they want to have any security ensured. 

This ticket is intended to split the communication in standalone mode to make 
it more like in Yarn mode - application internal communication, scheduler 
internal communication and communication between the client and scheduler. 

Such refactoring will allow for the scheduler (master, workers) to use a 
distinct secret, which will remain unknown for the users. Similarly, it will 
allow for better security in applications, because each application will be 
able to use a distinct secret as well. 

By providing Kerberos based SASL authentication/encryption for connections 
between a client (Client or AppClient) and Spark Master, it will be possible to 
introduce authentication and automatic generation of digest tokens and safe 
sharing them among the application processes. 

h3.User facing changes when running application

h4.General principles:
- conf: {{spark.authenticate.secret}} is *never sent* over the wire
- env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
- In all situations env variable will overwrite conf variable if present. 
- In all situations when a user has to pass secret, it is better (safer) to do 
this through env variable
- In work modes with multiple secrets we assume encrypted communication between 
client and master, between driver and master, between master and workers


h4.Work modes and descriptions
h5.Client mode, single secret
h6.Configuration
- env: {{SPARK_AUTH_SECRET=secret}} or conf: 
{{spark.authenticate.secret=secret}}

h6.Description
- The driver is running locally
- The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
{{spark.authenticate.secret}}
- The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
{{spark.authenticate.secret}} for connection to the master
- _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
will look for it in the worker configuration and it will find it there (its 
presence is implied). 


h5.Client mode, multiple secrets
h6.Configuration
- env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
{{spark.app.authenticate.secret=secret}}
- env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
{{spark.submission.authenticate.secret=scheduler_secret}}

h6.Description
- The driver is running locally
- The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
{{spark.submission.authenticate.secret}} to connect to the master
- The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor conf: 
{{spark.submission.authenticate.secret}}
- The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
{{spark.app.authenticate.secret}} for communication with the executors
- The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
that the executors can use it to communicate with the driver
- _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
_ExecutorBackend_ afterwards and used for all the connections (with driver, 
other executors and external shuffle service).


h5.Cluster mode, single secret
h6.Configuration
- env: {{SPARK_AUTH_SECRET=secret}} or conf: 
{{spark.authenticate.secret=secret}}

h6.Description
- The driver is run by _DriverRunner_ which is is a part of the worker
- The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
{{spark.authenticate.secret}}
- The client will use either env: {{SPARK_AUTH_SECRET}} or conf: 
{{spark.authenticate.secret}} for connection to the master and submit the driver
- _DriverRunner_ will not find any secret in _DriverDescription_ so it will 
look for it in the worker configuration and it will find it there (its presence 
is implied)
- _DriverRunner_ will set the secret it found in env: {{SPARK_AUTH_SECRET}} so 
that the driver will find it and use it for all the connections
- The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
{{spark.authenticate.secret}} for connection to the master
- _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
will look for it in the worker configuration and it will find it there (its 
presence is implied). 


h5.Cluster mode, multiple secrets
h6.Configuration
- env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
{{spark.app.authenticate.secret=secret}}
- env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
{{spark.submission.authenticate.secret=scheduler_secret}}

h6.Description
- The driver is run by _DriverRunner_ which is is a part of the worker
- The client will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf:

[jira] [Updated] (SPARK-11270) Add improved equality testing for TopicAndPartition from the Kafka Streaming API

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11270:
--
Assignee: Nick Evans

> Add improved equality testing for TopicAndPartition from the Kafka Streaming 
> API
> 
>
> Key: SPARK-11270
> URL: https://issues.apache.org/jira/browse/SPARK-11270
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Assignee: Nick Evans
>Priority: Minor
> Fix For: 1.5.3, 1.6.0
>
>
> Hey, sorry, new to contributing to Spark! Let me know if I'm doing anything 
> wrong.
> This issue is in relation to equality testing of a TopicAndPartition object. 
> It allows you to test that the topics and partitions of two of these objects 
> are equal, as opposed to checking that the two objects are the same instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Jacek Lewandowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980454#comment-14980454
 ] 

Jacek Lewandowski commented on SPARK-11402:
---

I'm not sure if I get what do you mean. Do you think that making 
{{ExecutorRunner}} and {{DriverRunner}} pluggable is too much of abstraction?


> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11403) Log something when dying from OnOutOfMemoryError

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11403:


Assignee: Apache Spark

> Log something when dying from OnOutOfMemoryError
> 
>
> Key: SPARK-11403
> URL: https://issues.apache.org/jira/browse/SPARK-11403
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Daniel Darabos
>Assignee: Apache Spark
>Priority: Trivial
>
> Executors on YARN run with the {{-XX:OnOutOfMemoryError='kill %p'}} flag. The 
> motivation I think is to avoid getting to an unpredictable state where some 
> threads may have been lost.
> The problem is that when this happens nothing is logged. The executor does 
> not log anything, it just exits with exit code 143. This is logged in the 
> NodeManager log.
> I'd like to add a tiny log message to make debugging such issues a bit easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11403) Log something when dying from OnOutOfMemoryError

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980456#comment-14980456
 ] 

Apache Spark commented on SPARK-11403:
--

User 'darabos' has created a pull request for this issue:
https://github.com/apache/spark/pull/9355

> Log something when dying from OnOutOfMemoryError
> 
>
> Key: SPARK-11403
> URL: https://issues.apache.org/jira/browse/SPARK-11403
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Daniel Darabos
>Priority: Trivial
>
> Executors on YARN run with the {{-XX:OnOutOfMemoryError='kill %p'}} flag. The 
> motivation I think is to avoid getting to an unpredictable state where some 
> threads may have been lost.
> The problem is that when this happens nothing is logged. The executor does 
> not log anything, it just exits with exit code 143. This is logged in the 
> NodeManager log.
> I'd like to add a tiny log message to make debugging such issues a bit easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10958) Upgrade json4s version to 3.3.0 to eliminate common serialization issues with Formats.

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10958.
---
Resolution: Won't Fix

> Upgrade json4s version to 3.3.0 to eliminate common serialization issues with 
> Formats.
> --
>
> Key: SPARK-10958
> URL: https://issues.apache.org/jira/browse/SPARK-10958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tyler Prete
>Priority: Minor
>
> json4s recently released version 3.3.0.
> Feature-wise, it's mostly the same, but it has one change that is very 
> relevant to spark.
> Formats are now serializable.
> https://github.com/json4s/json4s/pull/285
> This has been a common source of pain for engineers at Radius, and I'm sure 
> elsewhere.
> I've made a PR for the change on github:
> https://github.com/apache/spark/pull/8992



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980492#comment-14980492
 ] 

Deming Zhu commented on SPARK-5569:
---

For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Updated] (SPARK-11322) Keep full stack track in captured exception in PySpark

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11322:
--
Assignee: Liang-Chi Hsieh

> Keep full stack track in captured exception in PySpark
> --
>
> Key: SPARK-11322
> URL: https://issues.apache.org/jira/browse/SPARK-11322
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> We should keep full stack trace in captured exception in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11370:
--
Assignee: Wenchen Fan

> fix a bug in GroupedIterator and create unit test for it
> 
>
> Key: SPARK-11370
> URL: https://issues.apache.org/jira/browse/SPARK-11370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11403) Log something when dying from OnOutOfMemoryError

2015-10-29 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-11403:
--

 Summary: Log something when dying from OnOutOfMemoryError
 Key: SPARK-11403
 URL: https://issues.apache.org/jira/browse/SPARK-11403
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.5.1
Reporter: Daniel Darabos
Priority: Trivial


Executors on YARN run with the {{-XX:OnOutOfMemoryError='kill %p'}} flag. The 
motivation I think is to avoid getting to an unpredictable state where some 
threads may have been lost.

The problem is that when this happens nothing is logged. The executor does not 
log anything, it just exits with exit code 143. This is logged in the 
NodeManager log.

I'd like to add a tiny log message to make debugging such issues a bit easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deming Zhu updated SPARK-5569:
--
Comment: was deleted

(was: For OffsetRange ClassNotFound issue, I'm sure this patch could solve the 
issue because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?)

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Issue Comment Deleted] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deming Zhu updated SPARK-5569:
--
Comment: was deleted

(was: For OffsetRange ClassNotFound issue, I'm sure this patch could solve the 
issue because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?)

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Issue Comment Deleted] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deming Zhu updated SPARK-5569:
--
Comment: was deleted

(was: For OffsetRange ClassNotFound issue, I'm sure this patch could solve the 
issue because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?)

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Updated] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11371:
--
Component/s: SQL

[~tedyu] again: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Add a component. We don't use patches.

I don't think this is a helpful change since you're helping people write 
nonstandard SQL.

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980429#comment-14980429
 ] 

Sean Owen commented on SPARK-11402:
---

I personally think this is an abstraction too far. When you do this, it sounds 
strictly helpful: why not just make it pluggable? but implicitly you are 
promising an entire API that it's not at all obvious you want to promise.

> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10986:
--
Affects Version/s: 1.5.2
 Target Version/s: 1.6.0

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at

[jira] [Created] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Jacek Lewandowski (JIRA)

Jacek Lewandowski created SPARK-11402:
-

 Summary: Allow to define a custom driver runner and executor runner
 Key: SPARK-11402
 URL: https://issues.apache.org/jira/browse/SPARK-11402
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Reporter: Jacek Lewandowski
Priority: Minor


{{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in standalone 
mode to spawn driver and executor processes respectively. When integrating 
Spark with some environments, it would be useful to allow providing a custom 
implementation of those components.

The idea is simple - provide factory class names for driver and executor 
runners in Worker configuration. By default, the current implementations are 
used. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980493#comment-14980493
 ] 

Deming Zhu commented on SPARK-5569:
---

For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Updated] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2015-10-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8582:
-
Assignee: Shixiong Zhu
Target Version/s: 1.6.0  (was: )

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980493#comment-14980493
 ] 

Deming Zhu edited comment on SPARK-5569 at 10/29/15 2:20 PM:
-

For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:

{code:title=kafkaSample}

// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
{code}
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?


was (Author: maxwell):
For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
>

[jira] [Updated] (SPARK-11318) Include hive profile in make-distribution.sh command

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11318:
--
Assignee: Ted Yu

> Include hive profile in make-distribution.sh command
> 
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11403) Log something when dying from OnOutOfMemoryError

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11403:


Assignee: (was: Apache Spark)

> Log something when dying from OnOutOfMemoryError
> 
>
> Key: SPARK-11403
> URL: https://issues.apache.org/jira/browse/SPARK-11403
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Daniel Darabos
>Priority: Trivial
>
> Executors on YARN run with the {{-XX:OnOutOfMemoryError='kill %p'}} flag. The 
> motivation I think is to avoid getting to an unpredictable state where some 
> threads may have been lost.
> The problem is that when this happens nothing is logged. The executor does 
> not log anything, it just exits with exit code 143. This is logged in the 
> NodeManager log.
> I'd like to add a tiny log message to make debugging such issues a bit easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11318) Include hive profile in make-distribution.sh command

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11318.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9281
[https://github.com/apache/spark/pull/9281]

> Include hive profile in make-distribution.sh command
> 
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11401) PMML export for Logistic Regression Multiclass Classification

2015-10-29 Thread Vincenzo Selvaggio (JIRA)

Vincenzo Selvaggio created SPARK-11401:
--

 Summary: PMML export for Logistic Regression Multiclass 
Classification
 Key: SPARK-11401
 URL: https://issues.apache.org/jira/browse/SPARK-11401
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Vincenzo Selvaggio
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11383) Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11383:


Assignee: (was: Apache Spark)

> Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md 
> using include_example
> ---
>
> Key: SPARK-11383
> URL: https://issues.apache.org/jira/browse/SPARK-11383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11326) Split networking in standalone mode

2015-10-29 Thread Jacek Lewandowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980383#comment-14980383
 ] 

Jacek Lewandowski commented on SPARK-11326:
---

[~vanzin] I'll really appreciate if you can take a look.


> Split networking in standalone mode
> ---
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication, scheduler 
> internal communication and communication between the client and scheduler. 
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing Kerberos based SASL authentication/encryption for connections 
> between a client (Client or AppClient) and Spark Master, it will be possible 
> to introduce authentication and automatic generation of digest tokens and 
> safe sharing them among the application processes. 
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external shuffle service).
> 
> h5.Cluster mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is run by _DriverRunner_ which is is a part of the worker
> - The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The client will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master and submit the 
> driver
> - _DriverRunner_ will not find any secret in _DriverDescription_ so it will 
> look for it in the worker configuration and it will find it there (its 
> presence is implied)
> - _DriverRunner_ will set the secret it found in env: {{SPARK_AUTH_SECRET}} 
> so that the driver will find it and use it for all the connections
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_

[jira] [Resolved] (SPARK-11211) Kafka - offsetOutOfRange forces to largest

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11211.
---
Resolution: Not A Problem

> Kafka - offsetOutOfRange forces to largest
> --
>
> Key: SPARK-11211
> URL: https://issues.apache.org/jira/browse/SPARK-11211
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.5.1
>Reporter: Daniel Strassler
>
> This problem relates to how DStreams using the Direct Approach of connecting 
> to a Kafka topic behave when they request an offset that does not exist on 
> the topic.  Currently it appears the "auto.offset.reset" configuration value 
> is being ignored and the default value of “largest” is always being used.
>  
> When using the Direct Approach of connecting to a Kafka topic using a 
> DStream, even if you have the Kafka configuration "auto.offset.reset" set to 
> smallest, the behavior in the event of a 
> kafka.common.OffsetOutOfRangeException exception is to move the next offset 
> to be consumed value to the largest value on the Kafka topic.  It appears 
> that the exception is being eaten and not propagated up to the driver as 
> well, so a work around triggered by the propagation of the error can not be 
> implemented either.
>  
> The current behavior of setting to largest means that any data on the Kafka 
> topic at the time of the exception being thrown is skipped(lost) to 
> consumption and only data produced to the topic after the exception will be 
> consumed.  Two possible fixes are listed below.
>  
> Fix 1:  When “auto.offset.reset" is set to “smallest”, the DStream should set 
> the next consumed offset to be the smallest offset value on the Kafka topic.
>  
> Fix 2:  Propagate the error to the Driver to allow it to react as it deems 
> appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11388) Build breaks due to the use of tags in javadoc.

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11388.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/9339. I assume this was only 
affecting master/1.6 since i didn't see the class in question in 1.5.x?

> Build breaks due to the use of  tags in javadoc.
> 
>
> Key: SPARK-11388
> URL: https://issues.apache.org/jira/browse/SPARK-11388
> Project: Spark
>  Issue Type: Bug
>  Components: Build
> Environment: Java 8
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 1.6.0
>
>
> In java 8 self-closing tags are illegal. The build fails when I issue a 
> publishLocal, for example:
> {noformat}
> launcher/publishLocal
> [info] Packaging 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT-sources.jar
>  ...
> [info] Updating 
> {file:/media/hvanhovell/Data/QT/IT/Software/spark-pr/}launcher...
> [info] Done packaging.
> [info] Wrote 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT.pom
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] Done updating.
> [info] :: delivering :: org.apache.spark#spark-launcher_2.10;1.6.0-SNAPSHOT 
> :: 1.6.0-SNAPSHOT :: integration :: Wed Oct 28 21:12:50 CET 2015
> [info]delivering ivy file to 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/ivy-1.6.0-SNAPSHOT.xml
> [info] Main Java API documentation to 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/api...
> [info] Packaging 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT.jar
>  ...
> [info] Done packaging.
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherProtocol.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/Main.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/NamedThreadFactory.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/OutputRedirector.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/package-info.java...
> [info] Constructing Javadoc information...
> [info] Standard Doclet version 1.8.0_60
> [info] Building tree for all the packages and classes...
> [info] Generating 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/api/org/apache/spark/launcher/SparkAppHandle.html...
> [error] 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java:22:
>  error: self-closing element not allowed
> [error]  * 
> [error]^
> [info] Generating 
>

[jira] [Updated] (SPARK-11401) PMML export for Logistic Regression Multiclass Classification

2015-10-29 Thread Vincenzo Selvaggio (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincenzo Selvaggio updated SPARK-11401:
---
Description: 
 tvmanikandan requested on https://github.com/apache/spark/pull/3062 multi 
class support for logistic regression.

At the moment the toPMML method for Logistic Regression only supports binary 
classification.

> PMML export for Logistic Regression Multiclass Classification
> -
>
> Key: SPARK-11401
> URL: https://issues.apache.org/jira/browse/SPARK-11401
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincenzo Selvaggio
>Priority: Minor
>
>  tvmanikandan requested on https://github.com/apache/spark/pull/3062 multi 
> class support for logistic regression.
> At the moment the toPMML method for Logistic Regression only supports binary 
> classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11303:
--
Assignee: Yanbo Liang

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example

2015-10-29 Thread Pravin Gadakh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980419#comment-14980419
 ] 

Pravin Gadakh commented on SPARK-11381:
---

We should make this jira dependent upon 
[SPARK-11399|https://issues.apache.org/jira/browse/SPARK-11399].

> Replace example code in mllib-linear-methods.md using include_example
> -
>
> Key: SPARK-11381
> URL: https://issues.apache.org/jira/browse/SPARK-11381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11004.
---
Resolution: Won't Fix

For the moment I'd like to consider this concluded, but as in all things, it 
can be reopened to address a specific change.

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-29 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980484#comment-14980484
 ] 

Ted Yu commented on SPARK-11371:


Since I cannot assign the JIRA to myself, attaching patch shows my intention 
working on the JIRA.
The background is that I wanted to open 3 PRs as of yerterday but I don't have 
as many email addresses (forked repo's, i.e.).

I am more than willing to learn from experts how multiple outstanding PRs are 
managed.

As for the mean alias, I quoted Reynold's response.
I am open to discussion on whether this would ultimately go through.

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11388) Build breaks due to the use of tags in javadoc.

2015-10-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11388:
--
Affects Version/s: (was: 1.5.1)

> Build breaks due to the use of  tags in javadoc.
> 
>
> Key: SPARK-11388
> URL: https://issues.apache.org/jira/browse/SPARK-11388
> Project: Spark
>  Issue Type: Bug
>  Components: Build
> Environment: Java 8
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 1.6.0
>
>
> In java 8 self-closing tags are illegal. The build fails when I issue a 
> publishLocal, for example:
> {noformat}
> launcher/publishLocal
> [info] Packaging 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT-sources.jar
>  ...
> [info] Updating 
> {file:/media/hvanhovell/Data/QT/IT/Software/spark-pr/}launcher...
> [info] Done packaging.
> [info] Wrote 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT.pom
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] Done updating.
> [info] :: delivering :: org.apache.spark#spark-launcher_2.10;1.6.0-SNAPSHOT 
> :: 1.6.0-SNAPSHOT :: integration :: Wed Oct 28 21:12:50 CET 2015
> [info]delivering ivy file to 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/ivy-1.6.0-SNAPSHOT.xml
> [info] Main Java API documentation to 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/api...
> [info] Packaging 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/spark-launcher_2.10-1.6.0-SNAPSHOT.jar
>  ...
> [info] Done packaging.
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherProtocol.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/Main.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/NamedThreadFactory.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/OutputRedirector.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkClassCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitOptionParser.java...
> [info] Loading source file 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/package-info.java...
> [info] Constructing Javadoc information...
> [info] Standard Doclet version 1.8.0_60
> [info] Building tree for all the packages and classes...
> [info] Generating 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/api/org/apache/spark/launcher/SparkAppHandle.html...
> [error] 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/src/main/java/org/apache/spark/launcher/SparkAppHandle.java:22:
>  error: self-closing element not allowed
> [error]  * 
> [error]^
> [info] Generating 
> /media/hvanhovell/Data/QT/IT/Software/spark-pr/launcher/target/scala-2.10/api/org/apache/spark/launcher/SparkAppHandle.Listener.html...
> [error] 
>

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980490#comment-14980490
 ] 

Deming Zhu commented on SPARK-5569:
---

For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-10-29 Thread Jacek Lewandowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980508#comment-14980508
 ] 

Jacek Lewandowski commented on SPARK-6373:
--

[~turp1twin] so are you going to continue working on this ticket?

> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11140) Replace file server in driver with RPC-based alternative

2015-10-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980537#comment-14980537
 ] 

Marcelo Vanzin commented on SPARK-11140:


SSL is really hard to configure correctly in a large cluster with multiple 
users, which is why SASL is a much more user-friendly choice. Once we get 
everything running on SASL we can look at alternatives to Digest-MD5 (such as 
GSSAPI if Kerberos is available).

> Replace file server in driver with RPC-based alternative
> 
>
> Key: SPARK-11140
> URL: https://issues.apache.org/jira/browse/SPARK-11140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> As part of making configuring encryption easy in Spark, it would be better to 
> use the existing RPC channel between driver and executors to transfer files 
> and jars added to the application.
> This would remove the need to start the HTTP server currently used for that 
> purpose, which needs to be configured to use SSL if encryption is wanted. SSL 
> is kinda hard to configure correctly in a multi-user, distributed environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11032) Failure to resolve having correctly

2015-10-29 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980593#comment-14980593
 ] 

Yin Huai commented on SPARK-11032:
--

I add 1.5.3 as the fix version. If we re-cut 1.5.2, I will change the fix 
version.

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11032) Failure to resolve having correctly

2015-10-29 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11032:
-
Fix Version/s: 1.5.3

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980590#comment-14980590
 ] 

Yin Huai commented on SPARK-11246:
--

OK I add 1.5.3 as the fix version. If we re-cut 1.5.2, I will change the fix 
version.

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.5.3, 1.6.0
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo:

[jira] [Assigned] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11348:


Assignee: Apache Spark

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980534#comment-14980534
 ] 

Apache Spark commented on SPARK-11348:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9356

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11269) Java API support & test cases

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980610#comment-14980610
 ] 

Apache Spark commented on SPARK-11269:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9358

> Java API support & test cases
> -
>
> Key: SPARK-11269
> URL: https://issues.apache.org/jira/browse/SPARK-11269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-10-29 Thread Jeffrey Turpin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980657#comment-14980657
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

[~jlewandowski], Yes I am. I got crazy busy at work the last few months and I 
had to table this work. I wasn't getting much feedback, so any feedback you 
want to give would be appreciated. I have a couple small refactorings to finish 
and then I will push my latest to my branch, and can create a PR after that... 
Sorry for the long delays...

Jeff


> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-10-29 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980717#comment-14980717
 ] 

Sandy Ryza commented on SPARK-2089:
---

My opinion is that we should be moving towards dynamic allocation as the norm, 
both for batch and long-running applications.  With dynamic allocation turned 
on, it's possible to attain close to the same behavior as static allocation if 
you set max executors and a really fast ramp-up time.

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-29 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980560#comment-14980560
 ] 

Yin Huai commented on SPARK-11330:
--

https://issues.apache.org/jira/browse/SPARK-10859 should fix this problem. Can 
you checkout the 1.5 branch and try it locally?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11404) groupBy on column expressions

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11404:


Assignee: Apache Spark  (was: Michael Armbrust)

> groupBy on column expressions
> -
>
> Key: SPARK-11404
> URL: https://issues.apache.org/jira/browse/SPARK-11404
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10859) Predicates pushed to InmemoryColumnarTableScan are not evaluated correctly

2015-10-29 Thread Saif Addin Ellafi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980546#comment-14980546
 ] 

Saif Addin Ellafi commented on SPARK-10859:
---

Hi, I have been told that this issue: 
https://issues.apache.org/jira/browse/SPARK-11330 
has been caused by this problem.

If there is any chance to confirm the scenario is solved on that situation, it 
is appreciated.
Saif

> Predicates pushed to InmemoryColumnarTableScan are not evaluated correctly
> --
>
> Key: SPARK-10859
> URL: https://issues.apache.org/jira/browse/SPARK-10859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> {code}
> var data01 = sqlContext.sql("select 1 as id, \"{\\\"animal\\\":{\\\"type\\\": 
> \\\"cat\\\"}},{\\\"animal\\\":{\\\"type\\\": 
> \\\"dog\\\"}},{\\\"animal\\\":{\\\"type\\\": 
> \\\"donkey\\\"}},{\\\"animal\\\":{\\\"type\\\": 
> \\\"turkey\\\"}},{\\\"animal\\\":{\\\"type\\\": 
> \\\"cat\\\"}},{\\\"animal\\\":{\\\"NOTANIMAL\\\": \\\"measuring tape\\\"}}\" 
> as field")
> case class SubField(fieldling: String)
> var data02 = data01.explode(data01("field")){ case Row(field: String) => 
> field.split(",").map(SubField(_))}
>   .selectExpr("id","fieldling","get_json_object(fieldling,\"$.animal.type\") 
> as animal") 
> var data03 = data01.explode(data01("field")){ case Row(field: String) => 
> field.split(",").map(SubField(_))}
>   .selectExpr("id","fieldling","get_json_object(fieldling,\"$.animal.type\") 
> as animal")
> data02.cache()
> data02.select($"animal" === "cat").explain
> == Physical Plan ==
> Project [(animal#25 = cat) AS (animal = cat)#263]
>  InMemoryColumnarTableScan [animal#25], (InMemoryRelation 
> [id#20,fieldling#24,animal#25], true, 1, StorageLevel(true, true, false, 
> true, 1), (TungstenProject 
> [id#20,fieldling#24,get_json_object(fieldling#24,$.animal.type) AS 
> animal#25]), None)
> data02.select($"animal" === "cat").show
> +--+
> |(animal = cat)|
> +--+
> |  true|
> | false|
> | false|
> | false|
> |  true|
> |  null|
> +--+
> data02.filter($"animal" === "cat").explain
> == Physical Plan ==
> Filter (animal#25 = cat)
>  InMemoryColumnarTableScan [id#20,fieldling#24,animal#25], [(animal#25 = 
> cat)], (InMemoryRelation [id#20,fieldling#24,animal#25], true, 1, 
> StorageLevel(true, true, false, true, 1), (TungstenProject 
> [id#20,fieldling#24,get_json_object(fieldling#24,$.animal.type) AS 
> animal#25]), None)
> data02.filter($"animal" === "cat").show
> +---+-+--+
> | id|fieldling|animal|
> +---+-+--+
> +---+-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11246:
-
Fix Version/s: 1.5.3

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.5.3, 1.6.0
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB)
> 15/10/21

[jira] [Updated] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11246:
-
Fix Version/s: (was: 1.5.2)

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.6.0
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB)
> 15/10/21

[jira] [Commented] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980570#comment-14980570
 ] 

Yin Huai commented on SPARK-11246:
--

I picked it in branch 1.5. Will update the fix version later.

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.6.0
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on

[jira] [Resolved] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11246.
--
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9326
[https://github.com/apache/spark/pull/9326]

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.6.0, 1.5.2
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO

[jira] [Updated] (SPARK-11383) Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example

2015-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11383:
--
Shepherd: Xusen Yin

> Replace example code in mllib-naive-bayes.md/mllib-isotonic-regression.md 
> using include_example
> ---
>
> Key: SPARK-11383
> URL: https://issues.apache.org/jira/browse/SPARK-11383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11405) ROW_NUMBER function does not adhere to window ORDER BY, when joining

2015-10-29 Thread Jarno Seppanen (JIRA)

Jarno Seppanen created SPARK-11405:
--

 Summary: ROW_NUMBER function does not adhere to window ORDER BY, 
when joining
 Key: SPARK-11405
 URL: https://issues.apache.org/jira/browse/SPARK-11405
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: YARN
Reporter: Jarno Seppanen


The following query produces incorrect results:
{code:sql}
  SELECT a.i, a.x,
ROW_NUMBER() OVER (
  PARTITION BY a.i ORDER BY a.x) AS row_num
  FROM a
  JOIN b ON b.i = a.i;
+---++---+
|  i|   x|row_num|
+---++---+
|  1|  0.8717439935587555|  1|
|  1|  0.6684483939068196|  2|
|  1|  0.3378351523586306|  3|
|  1|  0.2483285619632939|  4|
|  1|  0.4796752841655936|  5|
|  2|  0.2971739640384895|  1|
|  2|  0.2199359901600595|  2|
|  2|  0.4646004597998037|  3|
|  2| 0.24823688829578183|  4|
|  2|  0.5914212915574378|  5|
|  3|0.010912835935112164|  1|
|  3|  0.6520139509583123|  2|
|  3|  0.8571994559240592|  3|
|  3|  0.1122635843020473|  4|
|  3| 0.45913022936460457|  5|
+---++---+
{code}

The row number doesn't follow the correct order. The join seems to break the 
order, ROW_NUMBER() works correctly if the join results are saved to a 
temporary table, and a second query is made.

Here's a small PySpark test case to reproduce the error:

{code}

from pyspark.sql import Row
import random

a = sc.parallelize([Row(i=i, x=random.random())
for i in range(5)
for j in range(5)])

b = sc.parallelize([Row(i=i) for i in [1, 2, 3]])

af = sqlContext.createDataFrame(a)
bf = sqlContext.createDataFrame(b)
af.registerTempTable('a')
bf.registerTempTable('b')

af.show()
# +---++
# |  i|   x|
# +---++
# |  0| 0.12978974167478896|
# |  0|  0.7105927498584452|
# |  0| 0.21225679077448045|
# |  0| 0.03849717391728036|
# |  0|  0.4976622146442401|
# |  1|  0.4796752841655936|
# |  1|  0.8717439935587555|
# |  1|  0.6684483939068196|
# |  1|  0.3378351523586306|
# |  1|  0.2483285619632939|
# |  2|  0.2971739640384895|
# |  2|  0.2199359901600595|
# |  2|  0.5914212915574378|
# |  2| 0.24823688829578183|
# |  2|  0.4646004597998037|
# |  3|  0.1122635843020473|
# |  3|  0.6520139509583123|
# |  3| 0.45913022936460457|
# |  3|0.010912835935112164|
# |  3|  0.8571994559240592|
# +---++
# only showing top 20 rows

bf.show()
# +---+
# |  i|
# +---+
# |  1|
# |  2|
# |  3|
# +---+

### WRONG

sqlContext.sql("""
  SELECT a.i, a.x,
ROW_NUMBER() OVER (
  PARTITION BY a.i ORDER BY a.x) AS row_num
  FROM a
  JOIN b ON b.i = a.i
""").show()
# +---++---+
# |  i|   x|row_num|
# +---++---+
# |  1|  0.8717439935587555|  1|
# |  1|  0.6684483939068196|  2|
# |  1|  0.3378351523586306|  3|
# |  1|  0.2483285619632939|  4|
# |  1|  0.4796752841655936|  5|
# |  2|  0.2971739640384895|  1|
# |  2|  0.2199359901600595|  2|
# |  2|  0.4646004597998037|  3|
# |  2| 0.24823688829578183|  4|
# |  2|  0.5914212915574378|  5|
# |  3|0.010912835935112164|  1|
# |  3|  0.6520139509583123|  2|
# |  3|  0.8571994559240592|  3|
# |  3|  0.1122635843020473|  4|
# |  3| 0.45913022936460457|  5|
# +---++---+

### WORKAROUND BY USING TEMP TABLE

t = sqlContext.sql("""
  SELECT a.i, a.x
  FROM a
  JOIN b ON b.i = a.i
""").cache()
# trigger computation
t.head()
t.registerTempTable('t')

sqlContext.sql("""
  SELECT i, x,
ROW_NUMBER() OVER (
  PARTITION BY i ORDER BY x) AS row_num
  FROM t
""").show()
# +---++---+
# |  i|   x|row_num|
# +---++---+
# |  1|  0.2483285619632939|  1|
# |  1|  0.3378351523586306|  2|
# |  1|  0.4796752841655936|  3|
# |  1|  0.6684483939068196|  4|
# |  1|  0.8717439935587555|  5|
# |  2|  0.2199359901600595|  1|
# |  2| 0.24823688829578183|  2|
# |  2|  0.2971739640384895|  3|
# |  2|  0.4646004597998037|  4|
# |  2|  0.5914212915574378|  5|
# |  3|0.010912835935112164|  1|
# |  3|  0.1122635843020473|  2|
# |  3| 0.45913022936460457|  3|
# |  3|  0.6520139509583123|  4|
# |  3|  0.8571994559240592|  5|
# +---++---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11140) Replace file server in driver with RPC-based alternative

2015-10-29 Thread Jacek Lewandowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980526#comment-14980526
 ] 

Jacek Lewandowski commented on SPARK-11140:
---

[~vanzin] it would be better to use SSL key-pair than Digest-MD5 for encryption 
- it is pretty outdated and currently marked as historic in RFC 6331 
(https://tools.ietf.org/html/rfc6331).

Anyway, I like the idea of moving file server to RPC and I know it will open 
more ways to configure authentication and encryption. 


> Replace file server in driver with RPC-based alternative
> 
>
> Key: SPARK-11140
> URL: https://issues.apache.org/jira/browse/SPARK-11140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> As part of making configuring encryption easy in Spark, it would be better to 
> use the existing RPC channel between driver and executors to transfer files 
> and jars added to the application.
> This would remove the need to start the HTTP server currently used for that 
> purpose, which needs to be configured to use SSL if encryption is wanted. SSL 
> is kinda hard to configure correctly in a multi-user, distributed environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980584#comment-14980584
 ] 

Nan Zhu commented on SPARK-11402:
-

I'm curious about what kind of functionalities you need with a customized 
ExecutorRunner/DriverRunner?



> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11269) Java API support & test cases

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11269:


Assignee: Apache Spark

> Java API support & test cases
> -
>
> Key: SPARK-11269
> URL: https://issues.apache.org/jira/browse/SPARK-11269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11269) Java API support & test cases

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11269:


Assignee: (was: Apache Spark)

> Java API support & test cases
> -
>
> Key: SPARK-11269
> URL: https://issues.apache.org/jira/browse/SPARK-11269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11404) groupBy on column expressions

2015-10-29 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-11404:


 Summary: groupBy on column expressions
 Key: SPARK-11404
 URL: https://issues.apache.org/jira/browse/SPARK-11404
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-29 Thread Deming Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980488#comment-14980488
 ] 

Deming Zhu commented on SPARK-5569:
---

For OffsetRange ClassNotFound issue, I'm sure this patch could solve the issue 
because actually the document on
Spark Home Page could not run without this patch

Here's the sample code on Spark Home Page:
```
// Hold a reference to the current offset ranges, so it can be used downstream
 var offsetRanges = Array[OffsetRange]()

 directKafkaStream.transform { rdd =>
   offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd
 }.map {
   ...
 }.foreachRDD { rdd =>
   for (o <- offsetRanges) {
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }
   ...
 }
```
we can see that offsetRanges is an array and need to be deserialized by 
executors.

For KafkaRDDPartition ClassNotFound issue, since I cannot see the source code,
I cannot 100% sure that this patch can solve the issue.

But, In my opinion, this kind of issue are all caused by Spark trying to load 
an array object. 

Can anyone of you try this patch can tell us whether your issues are solved or 
not?

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
>

[jira] [Assigned] (SPARK-11404) groupBy on column expressions

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11404:


Assignee: Michael Armbrust  (was: Apache Spark)

> groupBy on column expressions
> -
>
> Key: SPARK-11404
> URL: https://issues.apache.org/jira/browse/SPARK-11404
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8655) DataFrameReader#option supports more than String as value

2015-10-29 Thread Tony Cebzanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980672#comment-14980672
 ] 

Tony Cebzanov commented on SPARK-8655:
--

I'm running into this limitation as well.

> DataFrameReader#option supports more than String as value
> -
>
> Key: SPARK-8655
> URL: https://issues.apache.org/jira/browse/SPARK-8655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Michael Nitschinger
>
> I'm working on a custom data source, porting it from 1.3 to 1.4.
> On 1.3 I could easily extend the SparkSQL imports and get access to it, which 
> meant I could use custom options right away. One of those is I pass a Filter 
> down to my Relation for tighter schema inference against a schemaless 
> database.
> So I would have something like:
> n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String 
> = null)
> Since I want to move my API behind the DataFrameReader, the SQLContext is not 
> available anymore, only through the RelationProvider, which I've implemented 
> and it works nicely.
> The only problem I have now is that while I can pass in custom options, they 
> are all String typed. So I have no way to pass down my optional Filter 
> anymore (since parameters is a Map[String, String]).
> Would it be possible to extend the options so that more than just Strings can 
> be passed in? Right now I probably need to work around that by documenting 
> how people can pass in a string which I turn into a Filter, but that's 
> somewhat hacky.
> Note that built-in impls like JSON or JDBC have no issues, because since they 
> can access the SQLContext (private) without issues, they don't need to go 
> through the decoupling of the RelationProvider and can do any custom 
> arguments they want on their methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11348:


Assignee: (was: Apache Spark)

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-29 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980652#comment-14980652
 ] 

Xin Wu commented on SPARK-11246:


[~yhuai] Thank you!

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Xin Wu
> Fix For: 1.5.3, 1.6.0
>
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192.168.99.9:50262 in memory (size: 21.1

[jira] [Commented] (SPARK-11404) groupBy on column expressions

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980628#comment-14980628
 ] 

Apache Spark commented on SPARK-11404:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9359

> groupBy on column expressions
> -
>
> Key: SPARK-11404
> URL: https://issues.apache.org/jira/browse/SPARK-11404
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration

2015-10-29 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980802#comment-14980802
 ] 

Xin Ren commented on SPARK-11155:
-

Hi [~imranr], thank you so much for your detailed explanation, really helps me 
out. I'm digging into it now.

I'll try to finish it asap. Thank you again! :)

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Assignee: Xin Ren
>Priority: Minor
>  Labels: Starter
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11188:
-
Assignee: Dilip Biswal  (was: Michael Armbrust)

> Elide stacktraces in bin/spark-sql for AnalysisExceptions
> -
>
> Key: SPARK-11188
> URL: https://issues.apache.org/jira/browse/SPARK-11188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Dilip Biswal
> Fix For: 1.6.0
>
>
> For analysis exceptions in the sql-shell, we should only print the error 
> message to the screen.  The stacktrace will never have useful information 
> since this error is used to signify an error with the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table

2015-10-29 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980899#comment-14980899
 ] 

kevin yu commented on SPARK-6043:
-

Hello Trystan: I tried your testcase, and it works on spark 1.5, seems the 
problem has been fixed. Can you verify and close this jira? Thanks.

> Error when trying to rename table with alter table after using INSERT 
> OVERWITE to populate the table
> 
>
> Key: SPARK-6043
> URL: https://issues.apache.org/jira/browse/SPARK-6043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Trystan Leftwich
>Priority: Minor
>
> If you populate a table using INSERT OVERWRITE and then try to rename the 
> table using alter table it fails with:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> Unable to alter table. (state=,code=0)
> {noformat}
> Using the following SQL statement creates the error:
> {code:sql}
> CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE);
> INSERT OVERWRITE table tmp_table SELECT
>MIN(sales_customer.salesamount) salesamount_c1
> FROM
> (
>   SELECT
>  SUM(sales.salesamount) salesamount
>   FROM
>  internalsales sales
> ) sales_customer;
> ALTER TABLE tmp_table RENAME to not_tmp;
> {code}
> But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works.
> This is happening on our CDH5.3 cluster with multiple workers, If we use the 
> CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark 
> 1.2.1 built for hadoop2.4+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10158) ALS should print better errors when given Long IDs

2015-10-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980920#comment-14980920
 ] 

Bryan Cutler commented on SPARK-10158:
--

I made a quick fix for this.  When using Long values for Ratings user or 
product ids, it will raise an exception with a more informative message.  I'll 
post the PR soon.

{quote}
net.razorvine.pickle.PickleException: Ratings id 1205640308657491975 exceeds 
max value of 2147483647
at 
org.apache.spark.mllib.api.python.SerDe$RatingPickler.ratingsIdCheckLong(PythonMLLibAPI.scala:1454)
at 
org.apache.spark.mllib.api.python.SerDe$RatingPickler.construct(PythonMLLibAPI.scala:1445)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
...
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at 
org.apache.spark.mllib.api.python.SerDe$RatingPickler.ratingsIdCheckLong(PythonMLLibAPI.scala:1451)
{quote}

> ALS should print better errors when given Long IDs
> --
>
> Key: SPARK-10158
> URL: https://issues.apache.org/jira/browse/SPARK-10158
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See [SPARK-10115] for the very confusing messages you get when you try to use 
> ALS with Long IDs.  We should catch and identify these errors and print 
> meaningful error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-11188:


Assignee: Michael Armbrust

> Elide stacktraces in bin/spark-sql for AnalysisExceptions
> -
>
> Key: SPARK-11188
> URL: https://issues.apache.org/jira/browse/SPARK-11188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> For analysis exceptions in the sql-shell, we should only print the error 
> message to the screen.  The stacktrace will never have useful information 
> since this error is used to signify an error with the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11188.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9194
[https://github.com/apache/spark/pull/9194]

> Elide stacktraces in bin/spark-sql for AnalysisExceptions
> -
>
> Key: SPARK-11188
> URL: https://issues.apache.org/jira/browse/SPARK-11188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
> Fix For: 1.6.0
>
>
> For analysis exceptions in the sql-shell, we should only print the error 
> message to the screen.  The stacktrace will never have useful information 
> since this error is used to signify an error with the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11405) ROW_NUMBER function does not adhere to window ORDER BY, when joining

2015-10-29 Thread Jarno Seppanen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarno Seppanen updated SPARK-11405:
---
   Priority: Critical  (was: Major)
Description: 
The following query produces incorrect results:
{code:sql}
sqlContext.sql("""
  SELECT a.i, a.x,
ROW_NUMBER() OVER (
  PARTITION BY a.i ORDER BY a.x) AS row_num
  FROM a
  JOIN b ON b.i = a.i
""").show()
+---++---+
|  i|   x|row_num|
+---++---+
|  1|  0.8717439935587555|  1|
|  1|  0.6684483939068196|  2|
|  1|  0.3378351523586306|  3|
|  1|  0.2483285619632939|  4|
|  1|  0.4796752841655936|  5|
|  2|  0.2971739640384895|  1|
|  2|  0.2199359901600595|  2|
|  2|  0.4646004597998037|  3|
|  2| 0.24823688829578183|  4|
|  2|  0.5914212915574378|  5|
|  3|0.010912835935112164|  1|
|  3|  0.6520139509583123|  2|
|  3|  0.8571994559240592|  3|
|  3|  0.1122635843020473|  4|
|  3| 0.45913022936460457|  5|
+---++---+
{code}

The row number doesn't follow the correct order. The join seems to break the 
order, ROW_NUMBER() works correctly if the join results are saved to a 
temporary table, and a second query is made.

Here's a small PySpark test case to reproduce the error:

{code}

from pyspark.sql import Row
import random

a = sc.parallelize([Row(i=i, x=random.random())
for i in range(5)
for j in range(5)])

b = sc.parallelize([Row(i=i) for i in [1, 2, 3]])

af = sqlContext.createDataFrame(a)
bf = sqlContext.createDataFrame(b)
af.registerTempTable('a')
bf.registerTempTable('b')

af.show()
# +---++
# |  i|   x|
# +---++
# |  0| 0.12978974167478896|
# |  0|  0.7105927498584452|
# |  0| 0.21225679077448045|
# |  0| 0.03849717391728036|
# |  0|  0.4976622146442401|
# |  1|  0.4796752841655936|
# |  1|  0.8717439935587555|
# |  1|  0.6684483939068196|
# |  1|  0.3378351523586306|
# |  1|  0.2483285619632939|
# |  2|  0.2971739640384895|
# |  2|  0.2199359901600595|
# |  2|  0.5914212915574378|
# |  2| 0.24823688829578183|
# |  2|  0.4646004597998037|
# |  3|  0.1122635843020473|
# |  3|  0.6520139509583123|
# |  3| 0.45913022936460457|
# |  3|0.010912835935112164|
# |  3|  0.8571994559240592|
# +---++
# only showing top 20 rows

bf.show()
# +---+
# |  i|
# +---+
# |  1|
# |  2|
# |  3|
# +---+

### WRONG

sqlContext.sql("""
  SELECT a.i, a.x,
ROW_NUMBER() OVER (
  PARTITION BY a.i ORDER BY a.x) AS row_num
  FROM a
  JOIN b ON b.i = a.i
""").show()
# +---++---+
# |  i|   x|row_num|
# +---++---+
# |  1|  0.8717439935587555|  1|
# |  1|  0.6684483939068196|  2|
# |  1|  0.3378351523586306|  3|
# |  1|  0.2483285619632939|  4|
# |  1|  0.4796752841655936|  5|
# |  2|  0.2971739640384895|  1|
# |  2|  0.2199359901600595|  2|
# |  2|  0.4646004597998037|  3|
# |  2| 0.24823688829578183|  4|
# |  2|  0.5914212915574378|  5|
# |  3|0.010912835935112164|  1|
# |  3|  0.6520139509583123|  2|
# |  3|  0.8571994559240592|  3|
# |  3|  0.1122635843020473|  4|
# |  3| 0.45913022936460457|  5|
# +---++---+

### WORKAROUND BY USING TEMP TABLE

t = sqlContext.sql("""
  SELECT a.i, a.x
  FROM a
  JOIN b ON b.i = a.i
""").cache()
# trigger computation
t.head()
t.registerTempTable('t')

sqlContext.sql("""
  SELECT i, x,
ROW_NUMBER() OVER (
  PARTITION BY i ORDER BY x) AS row_num
  FROM t
""").show()
# +---++---+
# |  i|   x|row_num|
# +---++---+
# |  1|  0.2483285619632939|  1|
# |  1|  0.3378351523586306|  2|
# |  1|  0.4796752841655936|  3|
# |  1|  0.6684483939068196|  4|
# |  1|  0.8717439935587555|  5|
# |  2|  0.2199359901600595|  1|
# |  2| 0.24823688829578183|  2|
# |  2|  0.2971739640384895|  3|
# |  2|  0.4646004597998037|  4|
# |  2|  0.5914212915574378|  5|
# |  3|0.010912835935112164|  1|
# |  3|  0.1122635843020473|  2|
# |  3| 0.45913022936460457|  3|
# |  3|  0.6520139509583123|  4|
# |  3|  0.8571994559240592|  5|
# +---++---+
{code}

  was:
The following query produces incorrect results:
{code:sql}
  SELECT a.i, a.x,
ROW_NUMBER() OVER (
  PARTITION BY a.i ORDER BY a.x) AS row_num
  FROM a
  JOIN b ON b.i = a.i;
+---++---+
|  i|   x|row_num|
+---++---+
|  1|  0.8717439935587555|  1|
|  1|  0.6684483939068196|  2|
|  1|  0.3378351523586306|  3|
|  1|  0.2483285619632939|  4|
|  1|  0.4796752841655936|  5|
|  2|  0.2971739640384895|  1|
|  2|  0.2199359901600595|  2|
|  2|  0.4646004597998037|

[jira] [Comment Edited] (SPARK-11155) Stage summary json should include stage duration

2015-10-29 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980789#comment-14980789
 ] 

Imran Rashid edited comment on SPARK-11155 at 10/29/15 5:09 PM:


Hi [~iamshrek],

First start by reading
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges
 (especialy the section on "Pull Requests") and 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools.

Of course everyone has their own favorite development environment / techniques, 
but I'll tell you what I do.   I've found debugging from within IDEs to not 
work very well for spark (or scala codebases in general), so I just don't 
bother anymore.  (But I know others do debug from within intellij, so its not 
impossible.)  Instead, I just navigate through code in Intellij, and I use sbt, 
unit tests, & printlns to test out code.

1) run {{build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Pscala-2.10}}.  
This will start an sbt shell
2) inside the sbt shell, run {{project core}} (since you only care about the 
"spark-core" project)
3) run {{\~test:compile}}.  This will compile all the main & test code for the 
"spark-core" project and its dependencies.  Also it'll watch all the src files 
-- anytime you change them it'll *incrementally* recompile.  (the "\~" prefix 
makes sbt run the command in the incremental watch mode.)
4) leave that sbt-shell open, and then go to Intellij and do some coding.  Eg., 
try adding some garbage syntax, and then you'll see the sbt shell recompile and 
tell you about the compile error (and it should recompile quickly this time).
5) you can run some tests with {{\~test-only }}.  In your case, 
you probably want {{\~test-only *.HistoryServerSuite}}.  As before, it'll run 
in watch mode, but this time it'll rerun the tests as you change your code.  
The tests you are most interested in are these: 
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala#L102
 which reference expected results here: 
https://github.com/apache/spark/tree/master/core/src/test/resources/HistoryServerExpectations
While debugging, you can add printlns, which will show up in the sbt shell, or 
you can also add logging statements, which will appear in 
core/target/unit-tests.log (along with all the existing logging statements).
6) before submitting your PR, run {{scalastyle}} and {{test:scalastyle}} from 
within sbt (or just run {{dev/scalastyle}} from bash).   that'll help you track 
down style violations locally.  (jenkins would do this for you, but its a lot 
faster if you fix them locally -- that said, I often forget to do this myself.)
7) re-read the wiki guidelines, then submit your PR. Jenkins will then run the 
full set of tests for you, and reviewers will comment.

For the HistoryServer in particular, you can also just run it locally, navigate 
to some endpoints in your browser and see what happens.

hope this helps!


was (Author: irashid):
Hi [~iamshrek],

First start by reading
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges
 (especialy the section on "Pull Requests") and 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools.

Of course everyone has their own favorite development environment / techniques, 
but I'll tell you what I do.   I've found debugging from within IDEs to not 
work very well for spark (or scala codebases in general), so I just don't 
bother anymore.  (But I know others do debug from within intellij, so its not 
impossible.)  Instead, I just navigate through code in Intellij, and I use sbt, 
unit tests, & printlns to test out code.

1) run {{build/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Pscala-2.10}}.  
This will start an sbt shell
2) inside the sbt shell, run {{project core}} (since you only care about the 
"spark-core" project)
3) run {{~test:compile}}.  This will compile all the main & test code for the 
"spark-core" project and its dependencies.  Also it'll watch all the src files 
-- anytime you change them it'll *incrementally* recompile.  (the "~" prefix 
makes sbt run the command in the incremental watch mode.)
4) leave that sbt-shell open, and then go to Intellij and do some coding.  Eg., 
try adding some garbage syntax, and then you'll see the sbt shell recompile and 
tell you about the compile error (and it should recompile quickly this time).
5) you can run some tests with {{~test-only }}.  In your case, 
you probably want {{~test-only *.HistoryServerSuite}}.  As before, it'll run in 
watch mode, but this time it'll rerun the tests as you change your code.  The 
tests you are most interested in are these:

[jira] [Created] (SPARK-11407) Add documentation on using SparkR from RStudio

2015-10-29 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-11407:


 Summary: Add documentation on using SparkR from RStudio
 Key: SPARK-11407
 URL: https://issues.apache.org/jira/browse/SPARK-11407
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Felix Cheung
Priority: Minor


As per [~shivaram] we need to add a section in the programming guide on using 
SparkR from RStudio, in which we should talk about:

- how to load SparkR package
- what configurable options for initializing SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Brian ONeill (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian ONeill updated SPARK-11406:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/9360)

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11406:


Assignee: (was: Apache Spark)

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981038#comment-14981038
 ] 

Apache Spark commented on SPARK-11406:
--

User 'boneill42' has created a pull request for this issue:
https://github.com/apache/spark/pull/9360

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10641) skewness and kurtosis support

2015-10-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10641.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9003
[https://github.com/apache/spark/pull/9003]

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Fix For: 1.6.0
>
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11406:


Assignee: Apache Spark

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Assignee: Apache Spark
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Brian ONeill (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981039#comment-14981039
 ] 

Brian ONeill commented on SPARK-11406:
--

https://github.com/apache/spark/pull/9360

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11408) Display scientific notation correctly in DataFrame.show()

2015-10-29 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-11408:
-

 Summary: Display scientific notation correctly in DataFrame.show()
 Key: SPARK-11408
 URL: https://issues.apache.org/jira/browse/SPARK-11408
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.6.0
Reporter: Xiangrui Meng


{code}
scala> sqlContext.range(2).select(col("id") / 3e5).show()
++
| (id / 30.0)|
++
| 0.0|
|3.333...|
++
{code}

The exponential term is omitted, which is error-prone. We should format the 
numbers properly based on column width.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Brian ONeill (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian ONeill updated SPARK-11406:
-
Priority: Minor  (was: Major)

> utf-8 decode issue w/ kinesis
> -
>
> Key: SPARK-11406
> URL: https://issues.apache.org/jira/browse/SPARK-11406
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
> Environment: osx, spark:master
>Reporter: Brian ONeill
>Priority: Minor
>
> If we get a bad message over Kinesis, the spark job blows up with:
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 111, in main
> process()
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 2347, in pipeline_func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 317, in func
>   File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
> line 1777, in combineLocally
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", 
> line 236, in mergeValues
> for k, v in iterator:
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 88, in 
>   File 
> "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
>  line 31, in utf8_decoder
> return s.decode('utf-8')
>   File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
> continuation byte
> This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11406) utf-8 decode issue w/ kinesis

2015-10-29 Thread Brian ONeill (JIRA)

Brian ONeill created SPARK-11406:


 Summary: utf-8 decode issue w/ kinesis
 Key: SPARK-11406
 URL: https://issues.apache.org/jira/browse/SPARK-11406
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.1
 Environment: osx, spark:master
Reporter: Brian ONeill


If we get a bad message over Kinesis, the spark job blows up with:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 111, in main
process()
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
line 2347, in pipeline_func
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
line 2347, in pipeline_func
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
line 317, in func
  File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", 
line 1777, in combineLocally
  File 
"/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 
236, in mergeValues
for k, v in iterator:
  File 
"/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
 line 88, in 
  File 
"/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py",
 line 31, in utf8_decoder
return s.decode('utf-8')
  File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", 
line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid 
continuation byte

This kills the job.  Should there be a more elegant way to handle this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-29 Thread Jacek Lewandowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981170#comment-14981170
 ] 

Jacek Lewandowski commented on SPARK-11402:
---

[~CodingCat] for example environment, command line arguments, using task 
controller, using some special container, injecting files into working 
directory and so on.


> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11385:


Assignee: (was: Apache Spark)

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981285#comment-14981285
 ] 

Apache Spark commented on SPARK-11385:
--

User 'nakul02' has created a pull request for this issue:
https://github.com/apache/spark/pull/9362

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11385:


Assignee: Apache Spark

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10046) Hive warehouse dir not set in current directory when not providing hive-site.xml

2015-10-29 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981305#comment-14981305
 ] 

Yin Huai commented on SPARK-10046:
--

I feel it is better to not use local dir as the parent dir of warehouse because 
a developer can have multiple spark projects and he/she wants to use the same 
set of table (without configuring this setting for every project). I do not 
have a very strong opinion though. We can first update the doc.

> Hive warehouse dir not set in current directory when not providing 
> hive-site.xml
> 
>
> Key: SPARK-10046
> URL: https://issues.apache.org/jira/browse/SPARK-10046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.1
> Environment: OS X 10.10.4
> Java 1.7.0_79-b15
> Scala 2.10.5
> Spark 1.3.1
>Reporter: Antonio Murgia
>  Labels: hive, spark, sparksql
>
> When running spark in local environment (for unit-testing purpose) and 
> without providing any `hive-site.xml, databases apart from the default one 
> are created in Hive default hive.metastore.warehouse.dir and not in the 
> current directory (as stated in [Spark 
> docs](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)).
>  This code snippet, tested with Spark 1.3.1 demonstrates the issue: 
> https://github.com/tmnd1991/spark-hive-bug/blob/master/src/main/scala/Main.scala
>  You cane see that the exception is thrown when executing the CREATE DATABASE 
> STATEMENT, stating that is `Unable to create database path 
> file:/user/hive/warehouse/abc.db, failed to create database abc)` where is 
> `/user/hive/warehouse/abc.db`is not the current directory as stated in the 
> docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10046) Hive warehouse dir not set in current directory when not providing hive-site.xml

2015-10-29 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981331#comment-14981331
 ] 

Xin Wu commented on SPARK-10046:


Ok. Let me update the doc then.

> Hive warehouse dir not set in current directory when not providing 
> hive-site.xml
> 
>
> Key: SPARK-10046
> URL: https://issues.apache.org/jira/browse/SPARK-10046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.1
> Environment: OS X 10.10.4
> Java 1.7.0_79-b15
> Scala 2.10.5
> Spark 1.3.1
>Reporter: Antonio Murgia
>  Labels: hive, spark, sparksql
>
> When running spark in local environment (for unit-testing purpose) and 
> without providing any `hive-site.xml, databases apart from the default one 
> are created in Hive default hive.metastore.warehouse.dir and not in the 
> current directory (as stated in [Spark 
> docs](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)).
>  This code snippet, tested with Spark 1.3.1 demonstrates the issue: 
> https://github.com/tmnd1991/spark-hive-bug/blob/master/src/main/scala/Main.scala
>  You cane see that the exception is thrown when executing the CREATE DATABASE 
> STATEMENT, stating that is `Unable to create database path 
> file:/user/hive/warehouse/abc.db, failed to create database abc)` where is 
> `/user/hive/warehouse/abc.db`is not the current directory as stated in the 
> docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-10-29 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981338#comment-14981338
 ] 

Andrew Ray commented on SPARK-11275:


I think that I understand what is happening here. Any expression involving one 
of the group by columns in a cube/rollup gets evaluated wrong. This is missed 
in unit testing since the tests only check that the dataframe operations give 
the same result as SQL, but they are both handled by the same incorrect logic. 
I'll put a PR together to fix it and provide better unit tests.

> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 228 matches

Mail list logo