[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951926#comment-14951926
 ] 

Dibyendu Bhattacharya commented on SPARK-11045:
---

I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer knows that a 
better choice exists in spark-packages.


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2015-10-10 Thread Bhargav Mangipudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951985#comment-14951985
 ] 

Bhargav Mangipudi commented on SPARK-10759:
---

I can take this up if its still Open.

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10977) SQL injection bugs in JdbcUtils and DataFrameWriter

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951880#comment-14951880
 ] 

Sean Owen commented on SPARK-10977:
---

If it's simple, quoting sounds good. This isn't really a SQL injection problem 
since it would be up to callers to sanitize inputs from an external source, and 
Spark is not something you would expose directly to external calls or input. 
Still this may avoid corner case problems like table names with a special 
character.

> SQL injection bugs in JdbcUtils and DataFrameWriter
> ---
>
> Key: SPARK-10977
> URL: https://issues.apache.org/jira/browse/SPARK-10977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rick Hillegas
>Priority: Minor
>
> SPARK-10857 identifies a SQL injection bug in the JDBC dialect code. A 
> similar SQL injection bug can be found in 2 places in JdbcUtils and another 
> place in DataFrameWriter:
> {noformat}
> The DROP TABLE logic in JdbcUtils concatenates boilerplate with a 
> user-supplied string:
> def dropTable(conn: Connection, table: String): Unit = {
> conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
>   }
> Same for the INSERT logic in JdbcUtils:
> def insertStatement(conn: Connection, table: String, rddSchema: StructType): 
> PreparedStatement = {
> val sql = new StringBuilder(s"INSERT INTO $table VALUES (")
> var fieldsLeft = rddSchema.fields.length
> while (fieldsLeft > 0) {
>   sql.append("?")
>   if (fieldsLeft > 1) sql.append(", ") else sql.append(")")
>   fieldsLeft = fieldsLeft - 1
> }
> conn.prepareStatement(sql.toString())
>   }
> Same for the CREATE TABLE logic in DataFrameWriter:
>   def jdbc(url: String, table: String, connectionProperties: Properties): 
> Unit = {
>...
>
> if (!tableExists) {
> val schema = JdbcUtils.schemaString(df, url)
> val sql = s"CREATE TABLE $table ($schema)"
> conn.prepareStatement(sql).executeUpdate()
>   }
>...
>   }
> {noformat}
> Maybe we can find a common solution to all of these SQL injection bugs. 
> Something like this:
> 1) Parse the user-supplied table name into a table identifier and an optional 
> schema identifier. We can borrow logic from org.apache.derby.iapi.util.IdUtil 
> in order to do this.
> 2) Double-quote (and escape as necessary) the schema and table identifiers so 
> that the database interprets them as delimited ids.
> That should prevent the SQL injection attacks.
> With this solution, if the user specifies table names like cityTable and 
> trafficSchema.congestionTable, then the generated DROP TABLE statements would 
> be
> {noformat}
> DROP TABLE "CITYTABLE"
> DROP TABLE "TRAFFICSCHEMA"."CONGESTIONTABLE"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11049) If a single executor fails to allocate memory, entire job fails

2015-10-10 Thread Brian (JIRA)
Brian created SPARK-11049:
-

 Summary: If a single executor fails to allocate memory, entire job 
fails
 Key: SPARK-11049
 URL: https://issues.apache.org/jira/browse/SPARK-11049
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Brian


To reproduce:

* Create a spark cluster using start-master.sh and start-slave.sh (I believe 
this is the "standalone cluster manager?").  
* Leave a process running on some nodes that take up about significant amounts 
of RAM.
* Leave some nodes with plenty of RAM to run spark.
* Run a job against this cluster with spark.executor.memory asking for all or 
most of the memory available on each node.

On the node that has insufficient memory, there will of course be an error like:
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

On the driver node, and in the spark master UI, I see that _all_ executors exit 
or are killed, and the entire job fails.  It would be better if there was an 
indication of which individual node is actually at fault.  It would also be 
better if the cluster manager could handle failing-over to nodes that are still 
operating properly and have sufficient RAM.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API

2015-10-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Description: 
This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.

*UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  Other 
libraries and formats can support this, and it would be great if we could too.  
We could do either of the following:
* save() optionally takes a dataset (or schema), and load will return a (model, 
schema) pair.
* Models themselves save the input schema.

Both options would mean inheriting from new Saveable, Loadable types.

*UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have comments 
about the planned implementation, please comment in this JIRA.  Thanks!  
[https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]

  was:
This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.

*UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  Other 
libraries and formats can support this, and it would be great if we could too.  
We could do either of the following:
* save() optionally takes a dataset (or schema), and load will return a (model, 
schema) pair.
* Models themselves save the input schema.

Both options would mean inheriting from new Saveable, Loadable types.



> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951930#comment-14951930
 ] 

Apache Spark commented on SPARK-10973:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/9063

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951931#comment-14951931
 ] 

Apache Spark commented on SPARK-10973:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/9064

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951927#comment-14951927
 ] 

Apache Spark commented on SPARK-10973:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/9062

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951915#comment-14951915
 ] 

Dibyendu Bhattacharya commented on SPARK-11045:
---

Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which some at a cost)  Whereas , 
in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also good enough which uses Kafka high level API. I am here trying 
to give a better alternatives in terms of a much better Receiver based approach.





> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 

[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951926#comment-14951926
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/10/15 4:58 PM:
-

I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer do not know 
that a better choice exists in spark-packages.



was (Author: dibbhatt):
I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer knows that a 
better choice exists in spark-packages.


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of 

[jira] [Commented] (SPARK-10977) SQL injection bugs in JdbcUtils and DataFrameWriter

2015-10-10 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951865#comment-14951865
 ] 

Rick Hillegas commented on SPARK-10977:
---

Hi Sean,

Yes, I hope to post some patches soon. Thanks.

> SQL injection bugs in JdbcUtils and DataFrameWriter
> ---
>
> Key: SPARK-10977
> URL: https://issues.apache.org/jira/browse/SPARK-10977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rick Hillegas
>Priority: Minor
>
> SPARK-10857 identifies a SQL injection bug in the JDBC dialect code. A 
> similar SQL injection bug can be found in 2 places in JdbcUtils and another 
> place in DataFrameWriter:
> {noformat}
> The DROP TABLE logic in JdbcUtils concatenates boilerplate with a 
> user-supplied string:
> def dropTable(conn: Connection, table: String): Unit = {
> conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
>   }
> Same for the INSERT logic in JdbcUtils:
> def insertStatement(conn: Connection, table: String, rddSchema: StructType): 
> PreparedStatement = {
> val sql = new StringBuilder(s"INSERT INTO $table VALUES (")
> var fieldsLeft = rddSchema.fields.length
> while (fieldsLeft > 0) {
>   sql.append("?")
>   if (fieldsLeft > 1) sql.append(", ") else sql.append(")")
>   fieldsLeft = fieldsLeft - 1
> }
> conn.prepareStatement(sql.toString())
>   }
> Same for the CREATE TABLE logic in DataFrameWriter:
>   def jdbc(url: String, table: String, connectionProperties: Properties): 
> Unit = {
>...
>
> if (!tableExists) {
> val schema = JdbcUtils.schemaString(df, url)
> val sql = s"CREATE TABLE $table ($schema)"
> conn.prepareStatement(sql).executeUpdate()
>   }
>...
>   }
> {noformat}
> Maybe we can find a common solution to all of these SQL injection bugs. 
> Something like this:
> 1) Parse the user-supplied table name into a table identifier and an optional 
> schema identifier. We can borrow logic from org.apache.derby.iapi.util.IdUtil 
> in order to do this.
> 2) Double-quote (and escape as necessary) the schema and table identifiers so 
> that the database interprets them as delimited ids.
> That should prevent the SQL injection attacks.
> With this solution, if the user specifies table names like cityTable and 
> trafficSchema.congestionTable, then the generated DROP TABLE statements would 
> be
> {noformat}
> DROP TABLE "CITYTABLE"
> DROP TABLE "TRAFFICSCHEMA"."CONGESTIONTABLE"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951877#comment-14951877
 ] 

Cody Koeninger commented on SPARK-11045:


The comments regarding parallelism are not accurate.  Your read parallelism 
from Kafka is ultimately limited by number of Kafka partitions regardless of 
which consumer you use.

Spark checkpoint recovery is a problem, again regardless of what consumer you 
use.  Zookeeper as an offset store also has its own problems. At least the 
direct stream allows you to choose what works for you.

It seems like the main substantive complaint is rebalance behavior of the high 
level consumer. Frankly, given the state of the spark packages code the last 
time I looked at it, I'd rather effort be spent on addressing problems with the 
existing receiver rather than incorporating the spark packages code as is into 
spark.

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951921#comment-14951921
 ] 

Sean Owen commented on SPARK-11045:
---

Without commenting on the code itself, why does it need to move into Spark? 
what's the problem with Spark packages? keep in mind we already have a lightly 
confusing array of possibilities in the project already.

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11049) If a single executor fails to allocate memory, entire job fails

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951923#comment-14951923
 ] 

Sean Owen commented on SPARK-11049:
---

Hm, what's the context here though: you have set up a worker that says it has X 
GB of RAM or something to hand out, but then, there isn't that much RAM 
available from the OS even? that seems like a problem in your worker config. 
I'd expect a fairly complete failure of the job since it is not evidently 
recoverable. Sure, you can say, keep trying other nodes in case their 
advertised amount of RAM isn't wrong, but isn't this a basic config problem?

> If a single executor fails to allocate memory, entire job fails
> ---
>
> Key: SPARK-11049
> URL: https://issues.apache.org/jira/browse/SPARK-11049
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Brian
>
> To reproduce:
> * Create a spark cluster using start-master.sh and start-slave.sh (I believe 
> this is the "standalone cluster manager?").  
> * Leave a process running on some nodes that take up about significant 
> amounts of RAM.
> * Leave some nodes with plenty of RAM to run spark.
> * Run a job against this cluster with spark.executor.memory asking for all or 
> most of the memory available on each node.
> On the node that has insufficient memory, there will of course be an error 
> like:
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Could not create the Java virtual machine.
> On the driver node, and in the spark master UI, I see that _all_ executors 
> exit or are killed, and the entire job fails.  It would be better if there 
> was an indication of which individual node is actually at fault.  It would 
> also be better if the cluster manager could handle failing-over to nodes that 
> are still operating properly and have sufficient RAM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-10-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951988#comment-14951988
 ] 

Joseph K. Bradley commented on SPARK-6726:
--

I just posted a design doc on the parent JIRA.  It describes the planned 
implementation and API, which would be great for you to review, give feedback 
on, and get familiar with.  Thanks!

> Model export/import for spark.ml: LogisticRegression
> 
>
> Key: SPARK-6726
> URL: https://issues.apache.org/jira/browse/SPARK-6726
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11050) PySpark SparseVector can return wrong index in error message

2015-10-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11050:
-

 Summary: PySpark SparseVector can return wrong index in error 
message
 Key: SPARK-11050
 URL: https://issues.apache.org/jira/browse/SPARK-11050
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.6.0
Reporter: Joseph K. Bradley
Priority: Trivial


PySpark {{SparseVector.__getitem__}} returns an error message if given a bad 
index here:
[https://github.com/apache/spark/blob/a16396df76cc27099011bfb96b28cbdd7f964ca8/python/pyspark/mllib/linalg/__init__.py#L770]

But the index it complains about could have been modified (if negative), 
meaning the index in the error message could be wrong.  This should be 
corrected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-10 Thread Trystan Leftwich (JIRA)
Trystan Leftwich created SPARK-11052:


 Summary: Spaces in the build dir causes failures in the build/mvn 
script
 Key: SPARK-11052
 URL: https://issues.apache.org/jira/browse/SPARK-11052
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.1, 1.5.0
Reporter: Trystan Leftwich


If you are running make-distribution in a path that contains a space in it the 
build/mvn script will fail:

{code}
mkdir /tmp/test\ spaces
cd /tmp/test\ spaces
git clone https://github.com/apache/spark.git
cd spark
./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn -Phive-thriftserver 
-Phive
{code}

You will get the following errors
{code}
/tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
directory
usage: dirname path
/tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
directory
/tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or directory
/tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or directory
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-10 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-11051:

Environment: Spark version 1.6.0-SNAPSHOT built from the sources as of 
today - Oct, 10th  (was: Spark built from the sources as of today - Oct, 10th)

> NullPointerException when action called on localCheckpointed RDD (that was 
> checkpointed before)
> ---
>
> Key: SPARK-11051
> URL: https://issues.apache.org/jira/browse/SPARK-11051
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
> Environment: Spark version 1.6.0-SNAPSHOT built from the sources as 
> of today - Oct, 10th
>Reporter: Jacek Laskowski
>
> While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
> following NullPointerException was thrown:
> {code}
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}
> To reproduce the issue do the following:
> {code}
> $ ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val lines = sc.textFile("README.md")
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
> :24
> scala> sc.setCheckpointDir("checkpoints")
> scala> lines.checkpoint
> scala> lines.count
> res2: Long = 98
> scala> lines.localCheckpoint
> 15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
> checkpointing: overriding with local checkpoint.
> res4: lines.type = MapPartitionsRDD[1] at textFile at :24
> scala> lines.count
> java.lang.NullPointerException
>   at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11053) Remove use of KVIterator in SortBasedAggregationIterator

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11053:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove use of KVIterator in SortBasedAggregationIterator
> 
>
> Key: SPARK-11053
> URL: https://issues.apache.org/jira/browse/SPARK-11053
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> SortBasedAggregationIterator uses a KVIterator interface in order to process 
> input rows as key-value pairs, but this use of KVIterator is unnecessary, 
> slightly complicates the code, and might harm performance. We should refactor 
> this code to remove this use of KVIterator. I'll submit a PR to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11053) Remove use of KVIterator in SortBasedAggregationIterator

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952150#comment-14952150
 ] 

Apache Spark commented on SPARK-11053:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9066

> Remove use of KVIterator in SortBasedAggregationIterator
> 
>
> Key: SPARK-11053
> URL: https://issues.apache.org/jira/browse/SPARK-11053
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> SortBasedAggregationIterator uses a KVIterator interface in order to process 
> input rows as key-value pairs, but this use of KVIterator is unnecessary, 
> slightly complicates the code, and might harm performance. We should refactor 
> this code to remove this use of KVIterator. I'll submit a PR to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11053) Remove use of KVIterator in SortBasedAggregationIterator

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11053:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove use of KVIterator in SortBasedAggregationIterator
> 
>
> Key: SPARK-11053
> URL: https://issues.apache.org/jira/browse/SPARK-11053
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> SortBasedAggregationIterator uses a KVIterator interface in order to process 
> input rows as key-value pairs, but this use of KVIterator is unnecessary, 
> slightly complicates the code, and might harm performance. We should refactor 
> this code to remove this use of KVIterator. I'll submit a PR to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951915#comment-14951915
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/11/15 4:45 AM:
-

Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which comes at a cost)  Whereas 
, in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also not good enough which uses Kafka high level API. I am here 
trying to give a better alternatives in terms of a much better Receiver based 
approach.






was (Author: dibbhatt):
Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which comes at a cost)  Whereas 
, in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also good enough which uses Kafka high level API. I am here trying 
to give a better alternatives in terms of a much better Receiver based approach.





> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher 

[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951926#comment-14951926
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/11/15 4:47 AM:
-

I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and many customer do not know that 
a better choice exists in spark-packages.



was (Author: dibbhatt):
I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer do not know 
that a better choice exists in spark-packages.


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of 

[jira] [Created] (SPARK-11051) NullPointerException when action called on localCheckpointed RDD (that was checkpointed before)

2015-10-10 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-11051:
---

 Summary: NullPointerException when action called on 
localCheckpointed RDD (that was checkpointed before)
 Key: SPARK-11051
 URL: https://issues.apache.org/jira/browse/SPARK-11051
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
 Environment: Spark built from the sources as of today - Oct, 10th
Reporter: Jacek Laskowski


While toying with {{RDD.checkpoint}} and {{RDD.localCheckpoint}} methods, the 
following NullPointerException was thrown:

{code}
scala> lines.count
java.lang.NullPointerException
  at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
  ... 48 elided
{code}

To reproduce the issue do the following:

{code}
$ ./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val lines = sc.textFile("README.md")
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
:24

scala> sc.setCheckpointDir("checkpoints")

scala> lines.checkpoint

scala> lines.count
res2: Long = 98

scala> lines.localCheckpoint
15/10/10 22:59:20 WARN MapPartitionsRDD: RDD was already marked for reliable 
checkpointing: overriding with local checkpoint.
res4: lines.type = MapPartitionsRDD[1] at textFile at :24

scala> lines.count
java.lang.NullPointerException
  at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1587)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1115)
  ... 48 elided
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11052:


Assignee: Apache Spark

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>Assignee: Apache Spark
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11052:


Assignee: (was: Apache Spark)

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952147#comment-14952147
 ] 

Apache Spark commented on SPARK-11052:
--

User 'trystanleftwich' has created a pull request for this issue:
https://github.com/apache/spark/pull/9065

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-10 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951696#comment-14951696
 ] 

Sun Rui commented on SPARK-10971:
-

I agree that it is more flexible to allow configuration of location of RScript 
on both client and cluster modes. But I am not sure if it makes sense to 
distribute R itself onto worker nodes for jobs instead of have it installed on 
worker nodes, as R binary is platform specific (also may require platform 
specific installation steps), as well as performance cost of shipping R 
binaries.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dibyendu Bhattacharya updated SPARK-11045:
--
Affects Version/s: 1.5.1

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11048) Use ForkJoinPool as executorService

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951721#comment-14951721
 ] 

Apache Spark commented on SPARK-11048:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9054

> Use ForkJoinPool as executorService
> ---
>
> Key: SPARK-11048
> URL: https://issues.apache.org/jira/browse/SPARK-11048
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> ForkJoinPool: threads are created only if there are waiting tasks. They 
> expire after 2seconds (it's
> hardcoded in the jdk code).
> ForkJoinPool is better than ThreadPoolExecutor
> It's available in the JDK 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11048) Use ForkJoinPool as executorService

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11048:


Assignee: (was: Apache Spark)

> Use ForkJoinPool as executorService
> ---
>
> Key: SPARK-11048
> URL: https://issues.apache.org/jira/browse/SPARK-11048
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> ForkJoinPool: threads are created only if there are waiting tasks. They 
> expire after 2seconds (it's
> hardcoded in the jdk code).
> ForkJoinPool is better than ThreadPoolExecutor
> It's available in the JDK 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11048) Use ForkJoinPool as executorService

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11048:


Assignee: Apache Spark

> Use ForkJoinPool as executorService
> ---
>
> Key: SPARK-11048
> URL: https://issues.apache.org/jira/browse/SPARK-11048
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> ForkJoinPool: threads are created only if there are waiting tasks. They 
> expire after 2seconds (it's
> hardcoded in the jdk code).
> ForkJoinPool is better than ThreadPoolExecutor
> It's available in the JDK 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11048) Use ForkJoinPool as executorService

2015-10-10 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11048:
--

 Summary: Use ForkJoinPool as executorService
 Key: SPARK-11048
 URL: https://issues.apache.org/jira/browse/SPARK-11048
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor


ForkJoinPool: threads are created only if there are waiting tasks. They expire 
after 2seconds (it's
hardcoded in the jdk code).
ForkJoinPool is better than ThreadPoolExecutor
It's available in the JDK 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951744#comment-14951744
 ] 

Sean Owen commented on SPARK-2309:
--

Yeah, I also don't get it. In multinomial LR you still have the same features 
for every output class. The slide you show just shows a loss function computed 
over the loss for each of the N classes, not just 1. But the features are the 
same. Implicitly, if an example is in class k then it's not in the other 
classes.

> Generalize the binary logistic regression into multinomial logistic regression
> --
>
> Key: SPARK-2309
> URL: https://issues.apache.org/jira/browse/SPARK-2309
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 1.3.0
>
>
> Currently, there is no multi-class classifier in mllib. Logistic regression 
> can be extended to multinomial one straightforwardly. 
> The following formula will be implemented. 
> http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11025.
---
Resolution: Not A Problem

Personally I would rather not change the behavior, since it's a change, and, 
seems to allow a mistake that should be caught straight away. If I type "-D 
foo=bar" I'd rather figure out my mistake now than spend hours understanding 
who foo is not set to bar.

> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored or just passed them without filtering at that 
> level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dibyendu Bhattacharya updated SPARK-11045:
--
Description: 
This JIRA is to track the progress of making the Receiver based Low Level Kafka 
Consumer from spark-packages 
(http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
contributed back to Apache Spark Project.

This Kafka consumer has been around for more than year and has matured over the 
time . I see there are many adoptions of this package . I receive positive 
feedbacks that this consumer gives better performance and fault tolerant 
capabilities. 

This is the primary intent of this JIRA to give community a better alternative 
if they want to use Receiver Base model. 

If this consumer make it to Spark Core, it will definitely see more adoption 
and support from community and help many who still prefer the Receiver Based 
model of Kafka Consumer. 

I understand the Direct Stream is the consumer which can give Exact Once 
semantics and uses Kafka Low Level API  , which is good . But Direct Stream has 
concerns around recovering checkpoint on driver code change . Application 
developer need to manage their own offset which complex . Even if some one does 
manages their own offset , it limits the parallelism Spark Streaming can 
achieve. If someone wants more parallelism and want 
spark.streaming.concurrentJobs more than 1 , you can no longer rely on storing 
offset externally as you have no control which batch will run in which 
sequence. 

Furthermore , the Direct Stream has higher latency , as it fetch messages form 
Kafka during RDD action . Also number of RDD partitions are limited to topic 
partition . So unless your Kafka topic does not have enough partitions, you 
have limited parallelism while RDD processing. 

Due to above mentioned concerns , many people who does not want Exactly Once 
semantics , still prefer Receiver based model. Unfortunately, when customer 
fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
Consumer, there are other issues around the reliability of Kafka High Level 
API.  Kafka High Level API is buggy and has serious issue around Consumer 
Re-balance. Hence I do not think this is correct to advice people to use 
KafkaUtil.CreateStream in production . 


The better option presently is there is to use the Consumer from spark-packages 
. It is is using Kafka Low Level Consumer API , store offset in Zookeeper, and 
can recover from any failure . Below are few highlights of this consumer  ..

1. It has a inbuilt PID Controller for dynamic rate limiting.

2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
Rate Limiting is done by controlling number of  messages. The issue with 
throttling by number of message is, if message size various, block size will 
also vary . Let say your Kafka has messages for different sizes from 10KB to 
500 KB. Thus throttling by number of message can never give any deterministic 
size of your block hence there is no guarantee that Memory Back-Pressure can 
really take affect. 

3. This consumer is using Kafka low level API which gives better performance 
than KafkaUtils.createStream based High Level API.

4. This consumer can give end to end no data loss channel if enabled with WAL.


By accepting this low level kafka consumer from spark packages to apache spark 
project , we will give community a better options for Kafka connectivity both 
for Receiver less and Receiver based model. 


  was:
This JIRA is to track the progress of making the Receiver based Low Level Kafka 
Consumer from spark-packages 
(http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
contributed back to Apache Spark Project.

This Kafka consumer has been around for more than year and has matured over the 
time . I see there are many adoptions of this package . I receive positive 
feedbacks that this consumer gives better performance and fault tolerant 
capabilities. 

This is the primary intent of this JIRA to give community a better alternative 
if they want to use Receiver Base model. 

If this consumer make it to Spark Core, it will definitely see more adoption 
and support from community and help many who still prefer the Receiver Based 
model of Kafka Consumer. 

I understand the Direct Stream is the consumer which can give Exact Once 
semantics and uses Kafka Low Level API  , which is good . But Direct Stream has 
issues around recovering checkpoint on driver code change . Application 
developer need to manage their own offset which complex . Even if some one does 
manages their own offset , it limits the parallelism Spark Streaming can 
achieve. If someone wants more parallelism and want 
spark.streaming.concurrentJobs more than 1 , you can no longer rely on storing 
offset 

[jira] [Assigned] (SPARK-11044) Parquet writer version fixed as version1

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11044:


Assignee: (was: Apache Spark)

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11044) Parquet writer version fixed as version1

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951660#comment-14951660
 ] 

Apache Spark commented on SPARK-11044:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/9060

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11044) Parquet writer version fixed as version1

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11044:


Assignee: Apache Spark

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11047:


Assignee: (was: Apache Spark)

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Carson Wang
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951679#comment-14951679
 ] 

Apache Spark commented on SPARK-11047:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9061

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Carson Wang
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11047:


Assignee: Apache Spark

> Internal accumulators miss the internal flag when replaying events in the 
> history server
> 
>
> Key: SPARK-11047
> URL: https://issues.apache.org/jira/browse/SPARK-11047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Carson Wang
>Assignee: Apache Spark
>
> Internal accumulators don't write the internal flag to event log. So on the 
> history server Web UI, all accumulators are not internal. This causes 
> incorrect peak execution memory and unwanted accumulator table displayed on 
> the stage page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10968) Incorrect Join behavior in filter conditions

2015-10-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10968.
---
Resolution: Not A Problem

FWIW I also agree that this does not look like a problem. The values being 
compared have different types. You're suggesting a higher level semantic 
interpretation that I don't think is what is expected of the analyzer.

> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10968
> URL: https://issues.apache.org/jira/browse/SPARK-10968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: RHEL, spark-shell
>Reporter: RaviShankar KS
>  Labels: DataFramejoin, sql,
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --   d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col1: string (nullable = true)
>  |||-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col1: string (nullable = false)
>  ||-- col2: string (nullable = false)
> --d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col2: string (nullable = true)
>  |||-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col2: string (nullable = false)
>  ||-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array> and 
> array>).;
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct and struct).;
> // Code to be used in spark shell to create the data frames is attached.
> -
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10977) SQL injection bugs in JdbcUtils and DataFrameWriter

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951752#comment-14951752
 ] 

Sean Owen commented on SPARK-10977:
---

Cool, are you working on a PR?

> SQL injection bugs in JdbcUtils and DataFrameWriter
> ---
>
> Key: SPARK-10977
> URL: https://issues.apache.org/jira/browse/SPARK-10977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rick Hillegas
>Priority: Minor
>
> SPARK-10857 identifies a SQL injection bug in the JDBC dialect code. A 
> similar SQL injection bug can be found in 2 places in JdbcUtils and another 
> place in DataFrameWriter:
> {noformat}
> The DROP TABLE logic in JdbcUtils concatenates boilerplate with a 
> user-supplied string:
> def dropTable(conn: Connection, table: String): Unit = {
> conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
>   }
> Same for the INSERT logic in JdbcUtils:
> def insertStatement(conn: Connection, table: String, rddSchema: StructType): 
> PreparedStatement = {
> val sql = new StringBuilder(s"INSERT INTO $table VALUES (")
> var fieldsLeft = rddSchema.fields.length
> while (fieldsLeft > 0) {
>   sql.append("?")
>   if (fieldsLeft > 1) sql.append(", ") else sql.append(")")
>   fieldsLeft = fieldsLeft - 1
> }
> conn.prepareStatement(sql.toString())
>   }
> Same for the CREATE TABLE logic in DataFrameWriter:
>   def jdbc(url: String, table: String, connectionProperties: Properties): 
> Unit = {
>...
>
> if (!tableExists) {
> val schema = JdbcUtils.schemaString(df, url)
> val sql = s"CREATE TABLE $table ($schema)"
> conn.prepareStatement(sql).executeUpdate()
>   }
>...
>   }
> {noformat}
> Maybe we can find a common solution to all of these SQL injection bugs. 
> Something like this:
> 1) Parse the user-supplied table name into a table identifier and an optional 
> schema identifier. We can borrow logic from org.apache.derby.iapi.util.IdUtil 
> in order to do this.
> 2) Double-quote (and escape as necessary) the schema and table identifiers so 
> that the database interprets them as delimited ids.
> That should prevent the SQL injection attacks.
> With this solution, if the user specifies table names like cityTable and 
> trafficSchema.congestionTable, then the generated DROP TABLE statements would 
> be
> {noformat}
> DROP TABLE "CITYTABLE"
> DROP TABLE "TRAFFICSCHEMA"."CONGESTIONTABLE"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server

2015-10-10 Thread Carson Wang (JIRA)
Carson Wang created SPARK-11047:
---

 Summary: Internal accumulators miss the internal flag when 
replaying events in the history server
 Key: SPARK-11047
 URL: https://issues.apache.org/jira/browse/SPARK-11047
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Carson Wang


Internal accumulators don't write the internal flag to event log. So on the 
history server Web UI, all accumulators are not internal. This causes incorrect 
peak execution memory and unwanted accumulator table displayed on the stage 
page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)
Dibyendu Bhattacharya created SPARK-11045:
-

 Summary: Contributing Receiver based Low Level Kafka Consumer from 
Spark-Packages to Apache Spark Project
 Key: SPARK-11045
 URL: https://issues.apache.org/jira/browse/SPARK-11045
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Dibyendu Bhattacharya


This JIRA is to track the progress of making the Receiver based Low Level Kafka 
Consumer from spark-packages 
(http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
contributed back to Apache Spark Project.

This Kafka consumer has been around for more than year and has matured over the 
time . I see there are many adoptions of this package . I receive positive 
feedbacks that this consumer gives better performance and fault tolerant 
capabilities. 

This is the primary intent of this JIRA to give community a better alternative 
if they want to use Receiver Base model. 

If this consumer make it to Spark Core, it will definitely see more adoption 
and support from community and help many who still prefer the Receiver Based 
model of Kafka Consumer. 

I understand the Direct Stream is the consumer which can give Exact Once 
semantics and uses Kafka Low Level API  , which is good . But Direct Stream has 
issues around recovering checkpoint on driver code change . Application 
developer need to manage their own offset which complex . Even if some one does 
manages their own offset , it limits the parallelism Spark Streaming can 
achieve. If someone wants more parallelism and want 
spark.streaming.concurrentJobs more than 1 , you can no longer rely on storing 
offset externally as you have no control which batch will run in which 
sequence. 

Furthermore , the Direct Stream has higher latency , as it fetch messages form 
Kafka during RDD action . Also number of RDD partitions are limited to topic 
partition . So unless your Kafka topic does not have enough partitions, you 
have limited parallelism while RDD processing. 

Due to above mentioned concerns , many people who does not want Exactly Once 
semantics , still prefer Receiver based model. Unfortunately, when customer 
fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
Consumer, there are other issues around the reliability of Kafka High Level 
API.  Kafka High Level API is buggy and has serious issue around Consumer 
Re-balance. Hence I do not think this is correct to advice people to use 
KafkaUtil.CreateStream in production . 


The better option presently is there is to use the Consumer from spark-packages 
. It is is using Kafka Low Level Consumer API , store offset in Zookeeper, and 
can recover from any failure . Below are few highlights of this consumer  ..

1. It has a inbuilt PID Controller for dynamic rate limiting.

2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
Rate Limiting is done by controlling number of  messages. The issue with 
throttling by number of message is, if message size various, block size will 
also vary . Let say your Kafka has messages for different sizes from 10KB to 
500 KB. Thus throttling by number of message can never give any deterministic 
size of your block hence there is no guarantee that Memory Back-Pressure can 
really take affect. 

3. This consumer is using Kafka low level API which gives better performance 
than KafkaUtils.createStream based High Level API.

4. This consumer can give end to end no data loss channel if enabled with WAL.


By accepting this low level kafka consumer from spark packages to apache spark 
project , we will give community a better options for Kafka connectivity both 
for Receiver less and Receiver based model. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11046) Pass schema from R to JVM using JSON format

2015-10-10 Thread Sun Rui (JIRA)
Sun Rui created SPARK-11046:
---

 Summary: Pass schema from R to JVM using JSON format
 Key: SPARK-11046
 URL: https://issues.apache.org/jira/browse/SPARK-11046
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Sun Rui
Priority: Minor


Currently, SparkR passes a DataFrame schema from R to JVM backend using regular 
expression. However, Spark now supports schmea using JSON format.   So enhance 
SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-10 Thread Naden Franciscus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951663#comment-14951663
 ] 

Naden Franciscus commented on SPARK-4226:
-

Is there a reason this pull request can't be merged in ?

It's been so long and such a critical feature for us.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-10 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951661#comment-14951661
 ] 

kevin yu commented on SPARK-7099:
-

Hello Ryan: I tried the similar query with exponent format on spark-submit and 
spark-shell on 1.5.1, both worked, can you try on 1.5 or the latest version? 
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 
'sql/hive/src/test/resources/data/files/kv1.txt' INTO TABLE src")
res2: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("FROM src select key WHERE key = 1E6 
").collect().foreach(println)

scala> sqlContext.sql("FROM src select key WHERE key < 1E6 
").collect().foreach(println)
[238]
[86]
[311]
[27]
[165]
[409]
[255]
[278]
[98]
 

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10079) Make `column` and `col` functions be S4 functions

2015-10-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10079.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8864
[https://github.com/apache/spark/pull/8864]

> Make `column` and `col` functions be S4 functions
> -
>
> Key: SPARK-10079
> URL: https://issues.apache.org/jira/browse/SPARK-10079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
> Fix For: 1.6.0
>
>
> {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined 
> as S3 functions. I think it would be better to define them as S4 functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10079) Make `column` and `col` functions be S4 functions

2015-10-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10079:
--
Assignee: Sun Rui

> Make `column` and `col` functions be S4 functions
> -
>
> Key: SPARK-10079
> URL: https://issues.apache.org/jira/browse/SPARK-10079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>
> {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined 
> as S3 functions. I think it would be better to define them as S4 functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10876) display total application time in spark history UI

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10876:


Assignee: (was: Apache Spark)

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10876) display total application time in spark history UI

2015-10-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951650#comment-14951650
 ] 

Apache Spark commented on SPARK-10876:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9059

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10876) display total application time in spark history UI

2015-10-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10876:


Assignee: Apache Spark

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-10-10 Thread Wenjian Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951769#comment-14951769
 ] 

Wenjian Huang commented on SPARK-6726:
--

I have a question: I noticed that there is a 
BinaryLogisticRegressionTrainingSummary attached in the 
LogisticRegressionModel, when we do the export of the model, should we also 
export this summary?
BTW, do we have a general interface for the model export/import? If I want to 
contribute on this issue, can I just follow the save/load interface of mllib?

> Model export/import for spark.ml: LogisticRegression
> 
>
> Key: SPARK-6726
> URL: https://issues.apache.org/jira/browse/SPARK-6726
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10772) NullPointerException when transform function in DStream returns NULL

2015-10-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10772:
--
Assignee: Jack Hu

> NullPointerException when transform function in DStream returns NULL 
> -
>
> Key: SPARK-10772
> URL: https://issues.apache.org/jira/browse/SPARK-10772
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jack Hu
>Assignee: Jack Hu
>Priority: Minor
> Fix For: 1.6.0
>
>
> NullPointerException raises when transform function returns NULL:
> {quote}
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$3.apply(DStream.scala:442)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$3.apply(DStream.scala:441)
>   at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
>   at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
>   at 
> org.apache.spark.streaming.dstream.DStream.clearMetadata(DStream.scala:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$5.apply(DStream.scala:454)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$5.apply(DStream.scala:454)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.streaming.dstream.DStream.clearMetadata(DStream.scala:454)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$clearMetadata$2.apply(DStreamGraph.scala:129)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$clearMetadata$2.apply(DStreamGraph.scala:129)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.streaming.DStreamGraph.clearMetadata(DStreamGraph.scala:129)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:257)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {quote}
> The code is very simple: 
> {code}
> val sc = new SparkContext(conf)
> val sqlContext = new HiveContext(sc)
> import sqlContext.implicits._
> println(">>> create streamingContext.")
> val ssc = new StreamingContext(sc, Seconds(1))
> ssc.queueStream(
> Queue(
> sc.makeRDD(Seq(1)), 
> sc.makeRDD(Seq[Int]()), 
> sc.makeRDD(Seq(2))
> ), true).transform(rdd => if (rdd.isEmpty()) rdd else 
> null).print
> ssc.start()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10772) NullPointerException when transform function in DStream returns NULL

2015-10-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10772.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8881
[https://github.com/apache/spark/pull/8881]

> NullPointerException when transform function in DStream returns NULL 
> -
>
> Key: SPARK-10772
> URL: https://issues.apache.org/jira/browse/SPARK-10772
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jack Hu
>Priority: Minor
> Fix For: 1.6.0
>
>
> NullPointerException raises when transform function returns NULL:
> {quote}
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$3.apply(DStream.scala:442)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$3.apply(DStream.scala:441)
>   at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
>   at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
>   at 
> org.apache.spark.streaming.dstream.DStream.clearMetadata(DStream.scala:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$5.apply(DStream.scala:454)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$clearMetadata$5.apply(DStream.scala:454)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.streaming.dstream.DStream.clearMetadata(DStream.scala:454)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$clearMetadata$2.apply(DStreamGraph.scala:129)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$clearMetadata$2.apply(DStreamGraph.scala:129)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.streaming.DStreamGraph.clearMetadata(DStreamGraph.scala:129)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:257)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:178)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {quote}
> The code is very simple: 
> {code}
> val sc = new SparkContext(conf)
> val sqlContext = new HiveContext(sc)
> import sqlContext.implicits._
> println(">>> create streamingContext.")
> val ssc = new StreamingContext(sc, Seconds(1))
> ssc.queueStream(
> Queue(
> sc.makeRDD(Seq(1)), 
> sc.makeRDD(Seq[Int]()), 
> sc.makeRDD(Seq(2))
> ), true).transform(rdd => if (rdd.isEmpty()) rdd else 
> null).print
> ssc.start()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dibyendu Bhattacharya updated SPARK-11045:
--
Affects Version/s: (was: 1.5.1)

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951772#comment-14951772
 ] 

Sean Owen commented on SPARK-11016:
---

Yes, I get that roaringbitmap has a particular serialization mechanism that 
Kryo has to be taught to use. I think the answer to my dumb question was: yes 
Spark still uses roaringbitmap so it has to ensure Kryo knows how to serialize 
it, including registering serializers. Yes you're doing the right thing then.

These classes implement Externalizable but not Serializable; I think the 
KryoJavaSerializer could be registered to delegate serialization to the 
correct, custom Java serialization these classes define. But they're not 
Serializable. I wonder if we can build a KryoJavaExternalizableSerializer to do 
something similar automatically? that's tidy. Then it just needs to register 
roaringbitmaps classes to use this.

Otherwise, if it has to be bridged by hand, then since DataOutputStream is a 
DataOutput that can wrap an OutputStream, and you can get an OutputStream from 
KryoOutput, it seems possible.

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org