date:20150409


 [ 
https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6799:
-
Target Version/s: 1.4.0

 Add dataframe examples for SparkR
 -

 Key: SPARK-6799
 URL: https://issues.apache.org/jira/browse/SPARK-6799
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical

 We should add more data frame usage examples for SparkR . This can be similar 
 to the python examples at 
 https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6772) spark sql error when running code on large number of records

2015-04-09 Thread Aditya Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486853#comment-14486853
 ] 

Aditya Parmar commented on SPARK-6772:
--

Please find the code below

import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
import org.apache.spark.sql.hive.api.java.JavaHiveContext;

public class engineshow {

   public static void main(String[] args) {

  SparkConf conf = new SparkConf().setAppName(Engine);
 
  JavaSparkContext sc = new JavaSparkContext(conf);
  JavaHiveContext hContext = new JavaHiveContext(sc);

  String sch;
  ListStructField fields;
  StructType schema;
  JavaRDDRow rowRDD;
  JavaRDDString input;


  JavaSchemaRDD[] inputs = new JavaSchemaRDD[2];
  sch = a b c d e f g h i; // input file schema
  input = sc.textFile(/home/aditya/stocks1.csv);
  fields = new ArrayListStructField();
  for (String fieldName : sch.split( )) {
 fields.add(DataType.createStructField(fieldName,
  DataType.StringType, true));
  }
  schema = DataType.createStructType(fields);
  rowRDD = input.map(new FunctionString, Row() {
 public Row call(String record) throws Exception {
   String[] fields = record.split(,);
   Object[] fields_converted = fields;
   return Row.create(fields_converted);
 }
  });

  inputs[0] = hContext.applySchema(rowRDD, schema);
  inputs[0].registerTempTable(comp1);

  sch = a b;
  fields = new ArrayListStructField();
  for (String fieldName : sch.split( )) {
 fields.add(DataType.createStructField(fieldName,
  DataType.StringType, true));
  }

  schema = DataType.createStructType(fields);
  inputs[1] = hContext.sql(select a,b from comp1);
  inputs[1].saveAsTextFile(/home/aditya/outputog);
   }
}


 spark sql error when running code on large number of records
 

 Key: SPARK-6772
 URL: https://issues.apache.org/jira/browse/SPARK-6772
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.2.0
Reporter: Aditya Parmar

 Hi all ,
 I am getting an Arrayoutboundsindex error when i try to run a simple 
 filtering colums query on a file with 2.5 lac records.runs fine when running 
 on a file with 2k records .
 {code}
 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 
 blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 
 executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
 (null) [duplicate 1]
 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 
 blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on 
 executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
 (null) [duplicate 2]
 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, 
 blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on 
 executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
 (null) [duplicate 3]
 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, 
 blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on 
 executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
 (null) [duplicate 4]
 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; 
 aborting job
 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0
 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled
 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at 
 JavaSchemaRDD.scala:42, took 1.914477 s
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 1 in

[jira] [Assigned] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2352:
---

Assignee: Apache Spark  (was: Bert Greevenbosch)

 [MLLIB] Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch
Assignee: Apache Spark

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-04-09 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487060#comment-14487060
 ] 

Yu Ishikawa commented on SPARK-765:
---

I'm very sorry. I could run a test in spark.examples. Because my IntelliJ 
setting was wrong. You know, we don't need to add the dependency. Thanks.

 Test suite should run Spark example programs
 

 Key: SPARK-765
 URL: https://issues.apache.org/jira/browse/SPARK-765
 Project: Spark
  Issue Type: New Feature
  Components: Examples
Reporter: Josh Rosen

 The Spark test suite should also run each of the Spark example programs (the 
 PySpark suite should do the same).  This should be done through a shell 
 script or other mechanism to simulate the environment setup used by end users 
 that run those scripts.
 This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6806) SparkR examples in programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-6806:
-

Assignee: Davies Liu

 SparkR examples in programming guide
 

 Key: SPARK-6806
 URL: https://issues.apache.org/jira/browse/SPARK-6806
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, SparkR
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 Add R examples for Spark Core and DataFrame programming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6801) spark-submit is not able to identify Alive Master in case of multiple master


 [ 
https://issues.apache.org/jira/browse/SPARK-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6801.
--
Resolution: Duplicate

 spark-submit is not able to identify Alive Master in case of multiple master
 

 Key: SPARK-6801
 URL: https://issues.apache.org/jira/browse/SPARK-6801
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: pankaj

 Hi
 While submitting application using command
 /bin/spark-submit  --class SparkAggregator.java --deploy-mode cluster 
 --supervise --master spark://host1:7077
 getting error 
 Can only accept driver submissions in ALIVE state. Current state: STANDBY.
 if i try giving all possible masters in --master like below command
 /bin/spark-submit  --class SparkAggregator.java --deploy-mode cluster 
 --supervise --master spark://host1:port1,host2:port2
 it doesn't allow that. 
 Thanks
 Pankaj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support

2015-04-09 Thread cynepia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487093#comment-14487093
 ] 

cynepia commented on SPARK-3947:


Can someone update on where do we stand on this issue? Also, if this would also 
be supported beyond SQL for dataframes.

 [Spark SQL] UDAF Support
 

 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pei-Lun Lee
Assignee: Venkata Ramana G

 Right now only Hive UDAFs are supported. It would be nice to have UDAF 
 similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6811) Building binary R packages for SparkR


 [ 
https://issues.apache.org/jira/browse/SPARK-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6811:
-
Target Version/s: 1.4.0

 Building binary R packages for SparkR
 -

 Key: SPARK-6811
 URL: https://issues.apache.org/jira/browse/SPARK-6811
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Blocker

 We should figure out how to distribute binary packages for SparkR as a part 
 of the release process. R packages for Windows might need to be built 
 separately and we could offer a separate download link for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6804) System.exit(1) on error


 [ 
https://issues.apache.org/jira/browse/SPARK-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6804.
--
Resolution: Duplicate

 System.exit(1) on error
 ---

 Key: SPARK-6804
 URL: https://issues.apache.org/jira/browse/SPARK-6804
 Project: Spark
  Issue Type: Improvement
Reporter: Alberto

 We are developing a web application that is using Spark under the hood. 
 Testing our app we have found out that when our spark master is not up and 
 running and we try to connect with it, Spark is killing our app. 
 We've been having a look at the code and we have noticed that the 
 TaskSchedulerImpl class is just killing the JVM and our web application is 
 obviously also killed. See following the code snippet I am talking about:
 {code}
 else {
 // No task sets are active but we still got an error. Just exit since 
 this
 // must mean the error is during registration.
 // It might be good to do something smarter here in the future.
 logError(Exiting due to error from cluster scheduler:  + message)
 System.exit(1)
   }
 {code}
 IMHO this guy should not invoke System.exit(1). Instead, it should throw an 
 exception so the applications will be able to handle the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6811) Building binary R packages for SparkR

Shivaram Venkataraman created SPARK-6811:


 Summary: Building binary R packages for SparkR
 Key: SPARK-6811
 URL: https://issues.apache.org/jira/browse/SPARK-6811
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Blocker


We should figure out how to distribute binary packages for SparkR as a part of 
the release process. R packages for Windows might need to be built separately 
and we could offer a separate download link for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6812) filter() on DataFrame does not work as expected

Davies Liu created SPARK-6812:
-

 Summary: filter() on DataFrame does not work as expected
 Key: SPARK-6812
 URL: https://issues.apache.org/jira/browse/SPARK-6812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Davies Liu
Priority: Blocker


{code}
 filter(df, df$age  21)
Error in filter(df, df$age  21) :
  no method for coercing this S4 class to a vector
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-04-09 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487934#comment-14487934
]

Marcelo Vanzin commented on SPARK-6229:
---

Hi Jeff,

Those kinds of differences are exactly why I think all this channel setup code
should be handled by config options instead of having client code register
bootstraps and other things explicitly. I went with the latter approach to keep
the changes smaller (and even then they're still pretty intrusive). Aaron
doesn't seem to be a fan of that (config-based) approach, though.

For the SSL case, I haven't really looked at your code nor am I familiar with
how to use SSL in netty. Do you have to set it up as soon as the server starts
listening? (In which case you'd need some new `createServer()` method for SSL.)
Or can it be set up after the connection happens? (In which case, the
TransportServerBootstrap interface I added should suffice - just register the
handler when doBootstrap() is called.)

Support SASL encryption in network/common module

Key: SPARK-6229
URL: https://issues.apache.org/jira/browse/SPARK-6229
Project: Spark
Issue Type: Sub-task
Components: Spark Core
Reporter: Marcelo Vanzin

After SASL support has been added to network/common, supporting encryption
should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI.
Since the latter requires a valid kerberos login to work (and so doesn't
really work with executors), encryption would require the use of DIGEST-MD5.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support

2015-04-09 Thread cynepia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487222#comment-14487222
 ] 

cynepia commented on SPARK-3947:


Hi Takeshi San,

Thanks for the quick response. I would like to know, if there are any active 
discussions on the topic. While we refactor the interface for aggregates, We 
should keep in mind the DataFrame, GroupedData, and possibly also 
sql.dataframe.Column, which looks different from pandas.Series and isn't 
currently supporting any aggregations.

We would be happy to participate and contribute, if there are any discussions 
on the same.



 [Spark SQL] UDAF Support
 

 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pei-Lun Lee
Assignee: Venkata Ramana G

 Right now only Hive UDAFs are supported. It would be nice to have UDAF 
 similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

[
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487793#comment-14487793
]

Tathagata Das commented on SPARK-6691:
--

Yes, I complete agree to the motivation. There is a need for backpressure. I
have a design myself but I had not written up a design doc. I think looking
through your design it will help me get more concrete ideas and together we can
come up with a good design. Thanks :)

Abstract and add a dynamic RateLimiter for Spark Streaming
--

Key: SPARK-6691
URL: https://issues.apache.org/jira/browse/SPARK-6691
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

Flow control (or rate control) for input data is very important in streaming
system, especially for Spark Streaming to keep stable and up-to-date. The
unexpected flood of incoming data or the high ingestion rate of input data
which beyond the computation power of cluster will make the system unstable
and increase the delay time. For Spark Streaming’s job generation and
processing pattern, this delay will be accumulated and introduce unacceptable
exceptions.

Currently in Spark Streaming’s receiver based input stream, there’s a
RateLimiter in BlockGenerator which controls the ingestion rate of input
data, but the current implementation has several limitations:
# The max ingestion rate is set by user through configuration beforehand,
user may lack the experience of how to set an appropriate value before the
application is running.
# This configuration is fixed through the life-time of application, which
means you need to consider the worst scenario to set a reasonable
configuration.
# Input stream like DirectKafkaInputStream need to maintain another solution
to achieve the same functionality.
# Lack of slow start control makes the whole system easily trapped into large
processing and scheduling delay at the very beginning.

So here we propose a new dynamic RateLimiter as well as the new interface for
the RateLimiter to better improve the whole system's stability. The target is:
* Dynamically adjust the ingestion rate according to processing rate of
previous finished jobs.
* Offer an uniform solution not only for receiver based input stream, but
also for direct stream like DirectKafkaInputStream and new ones.
* Slow start rate to control the network congestion when job is started.
* Pluggable framework to make the maintenance of extension more easy.

Here is the design doc
(https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
and working branch
(https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
Any comment would be greatly appreciated.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6816) Add SparkConf API to configure SparkR

Shivaram Venkataraman created SPARK-6816:


 Summary: Add SparkConf API to configure SparkR
 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor


Right now the only way to configure SparkR is to pass in arguments to 
sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to 
make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

[
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486980#comment-14486980
]

Tathagata Das commented on SPARK-6691:
--

[~jerryshao] This is a very good attempt! This is a good first attempt!
However, from my experience in rate controlling matters, it can be quite tricky
to balance the stability of the system, with high throughput. So any attempt
needs to be done carefully with some basic theoretically analysis of how the
system will behave in different scenarios (e.g. when processing load increases
with same data rate or vice versa, maybe when cluster size decreases due to
failures, etc.) I would like to see that sort of analysis in the design doc.

Beyond that I will spend time thinking about the pros and cons of this
approach. Nonetheless, thank you for initiating this.

Abstract and add a dynamic RateLimiter for Spark Streaming
--

Key: SPARK-6691
URL: https://issues.apache.org/jira/browse/SPARK-6691
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6819) Support nested types in SparkR DataFrame

Shivaram Venkataraman created SPARK-6819:


 Summary: Support nested types in SparkR DataFrame
 Key: SPARK-6819
 URL: https://issues.apache.org/jira/browse/SPARK-6819
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman


ArrayType, MapType and StructureType
We can infer the correct schema for them, but the serialization can not handle 
them well.

From Davies in https://sparkr.atlassian.net/browse/SPARKR-230

ArrayType could be c(ab, ba), we will serialize it as character. We does 
not support deserialize env in R (serialize Map[] in Scala). So all the three 
do not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6799) Add dataframe examples for SparkR

Shivaram Venkataraman created SPARK-6799:


 Summary: Add dataframe examples for SparkR
 Key: SPARK-6799
 URL: https://issues.apache.org/jira/browse/SPARK-6799
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical


We should add more data frame usage examples for SparkR . This can be similar 
to the python examples at 
https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily

2015-04-09 Thread Kousuke Saruta (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kousuke Saruta updated SPARK-2673:
--
Description:
In current implementation, we are difficult to attach debugger to each Executor
in the cluster.
There are reasons as follows.

1) Multi Executors can run on the same machine so each executor should open
individual debug ports.

2) Even if we can open unique debug port to each Executors running on the same
machine, it's a bother to check debug port of each executor.

To solve those problem, I think following 2 improvement is needed.

1) Enable executor to open unique debug port on a machine.
2) Expand WebUI to be able to show debug ports opening in each executor.

was:
In current implementation, we are difficult to attach debugger to each Executor
in the cluster.
There are reasons as follows.

1) It's difficult for Executors running on the same machine to open debug port
because we can only pass same JVM options to all executors.

2) Even if we can open unique debug port to each Executors running on the same
machine, it's a bother to check debug port of each executor.

To solve those problem, I think following 2 improvement is needed.

1) Enable executor to open unique debug port on a machine.
2) Expand WebUI to be able to show debug ports opening in each executor.

Improve Enable to attach Debugger to Executors easily
-

Key: SPARK-2673
URL: https://issues.apache.org/jira/browse/SPARK-2673
Project: Spark
Issue Type: Improvement
Components: Spark Core, Web UI
Reporter: Kousuke Saruta

In current implementation, we are difficult to attach debugger to each
Executor in the cluster.
There are reasons as follows.
1) Multi Executors can run on the same machine so each executor should open
individual debug ports.
2) Even if we can open unique debug port to each Executors running on the
same machine, it's a bother to check debug port of each executor.
To solve those problem, I think following 2 improvement is needed.
1) Enable executor to open unique debug port on a machine.
2) Expand WebUI to be able to show debug ports opening in each executor.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6803) Support SparkR Streaming

2015-04-09 Thread Hao (JIRA)

Hao created SPARK-6803:
--

 Summary: Support SparkR Streaming
 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


Adds R API for Spark Streaming.

A experimental version is presented in repo [1]. which follows the PySpark 
streaming design. Also, this PR can be further broken down into sub task issues.

[1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6803) [SparkR] Support SparkR Streaming

[
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487872#comment-14487872
]

Tathagata Das commented on SPARK-6803:
--

I know of many use cases where python is desired language for using streaming.
Especially in the devops domain (devops love python, as I have heard),
streaming machine data and processing them. I do not have any knowledge about
the need for writing streaming applications in R. None the less it is very
cool. :D

BTW, the main challenge in the building python API for streaming was to make
the streaming scheduler in Java call back into Python to run an arbitrary
python function (a RDD-to-RDD transformation). Setting up this callback through
Py4j was interesting. I am curious to know how this was solved with R in this
prototype.

[SparkR] Support SparkR Streaming
-

Key: SPARK-6803
URL: https://issues.apache.org/jira/browse/SPARK-6803
Project: Spark
Issue Type: New Feature
Components: SparkR, Streaming
Reporter: Hao
Fix For: 1.4.0

Adds R API for Spark Streaming.
A experimental version is presented in repo [1]. which follows the PySpark
streaming design. Also, this PR can be further broken down into sub task
issues.
[1] https://github.com/hlin09/spark/tree/SparkR-streaming/

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily

2015-04-09 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2673:
--
Component/s: Web UI

 Improve Enable to attach Debugger to Executors easily
 -

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kousuke Saruta

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) It's difficult for Executors running on the same machine to open debug 
 port because we can only pass same JVM options to all executors.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)

Shivaram Venkataraman created SPARK-6823:


 Summary: Add a model.matrix like capability to DataFrames 
(modelDataFrame)
 Key: SPARK-6823
 URL: https://issues.apache.org/jira/browse/SPARK-6823
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Shivaram Venkataraman


Currently Mllib modeling tools work only with double data. However, data tables 
in practice often have a set of categorical fields (factors in R), that need to 
be converted to a set of 0/1 indicator variables (making the data actually used 
in a modeling algorithm completely numeric). In R, this is handled in modeling 
functions using the model.matrix function. Similar functionality needs to be 
available within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2673) Improve Enable to attach Debugger to Executors easily


 [ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2673:
---

Assignee: Apache Spark

 Improve Enable to attach Debugger to Executors easily
 -

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kousuke Saruta
Assignee: Apache Spark

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) Multi Executors can run on the same machine so each executor should open 
 individual debug ports.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2673) Improve Enable to attach Debugger to Executors easily


[ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486897#comment-14486897
 ] 

Apache Spark commented on SPARK-2673:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/5437

 Improve Enable to attach Debugger to Executors easily
 -

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kousuke Saruta

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) Multi Executors can run on the same machine so each executor should open 
 individual debug ports.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2673) Improve Enable to attach Debugger to Executors easily


 [ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2673:
---

Assignee: (was: Apache Spark)

 Improve Enable to attach Debugger to Executors easily
 -

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kousuke Saruta

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) Multi Executors can run on the same machine so each executor should open 
 individual debug ports.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6822) lapplyPartition passes empty list to function

Shivaram Venkataraman created SPARK-6822:


 Summary: lapplyPartition passes empty list to function
 Key: SPARK-6822
 URL: https://issues.apache.org/jira/browse/SPARK-6822
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman


I have an rdd containing two elements, as expected or as shown by a collect. 
When I call lapplyPartition on it with a function that prints its arguments in 
stderr, the function gets called three times, the first two with the expected 
arguments and the third with an empty list as argument. I was wondering if 
that's a bug or if there are conditions under which that's possible. I 
apologize I don't have a simple test case ready yet. I run into this potential 
bug developing a separate package, plyrmr. If you are willing to install it, 
the test case is very simple. The rdd that creates this problem is a result of 
a join, but I couldn't replicate the problem using a plain vanilla join.

Example from Antonio on SparkR JIRA: I don't have time to try any harder to 
repro this without plyrmr. For the record this is the example

{code}
library(plyrmr)
plyrmr.options(backend = spark)
df1 = mtcars[1:4,]
df2 = mtcars[3:6,]
w = as.data.frame(gapply(merge(input(df1), input(df2)), identity))
{code}
the gapply is implemented with a lapplyPartition in most cases. The merge with 
a join. as.data.frame with a collect. The join has an arbitrary argument of 4 
partitions. If I turn that down to 2L, the problem disappears. I will check in 
a version with a workaround in place but a debugging statement will leave a 
record in stderr whenever the workaround kicks in, so that we can track it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6805) ML Pipeline API in SparkR

2015-04-09 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6805:


 Summary: ML Pipeline API in SparkR
 Key: SPARK-6805
 URL: https://issues.apache.org/jira/browse/SPARK-6805
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in 
SparkR. The implementation should be similar to the pipeline API implementation 
in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6817) DataFrame UDFs in R

Shivaram Venkataraman created SPARK-6817:


 Summary: DataFrame UDFs in R
 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman


This depends on some internal interface of Spark SQL, should be done after 
merging into Spark.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6767) Documentation error in Spark SQL Readme file


 [ 
https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6767.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2
 Assignee: Tijo Thomas

 Documentation error in Spark SQL Readme file
 

 Key: SPARK-6767
 URL: https://issues.apache.org/jira/browse/SPARK-6767
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Assignee: Tijo Thomas
Priority: Trivial
 Fix For: 1.3.2, 1.4.0


 Error in Spark SQL Documentation file . The sample script for SQL DSL   
 throwing below error
 scala query.where('key  30).select(avg('key)).collect()
 console:43: error: value  is not a member of Symbol
   query.where('key  30).select(avg('key)).collect()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3276:
---

Assignee: (was: Apache Spark)

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming


[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487073#comment-14487073
 ] 

Apache Spark commented on SPARK-3276:
-

User 'emres' has created a pull request for this issue:
https://github.com/apache/spark/pull/5438

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6804) System.exit(1) on error

2015-04-09 Thread Alberto (JIRA)

Alberto created SPARK-6804:
--

 Summary: System.exit(1) on error
 Key: SPARK-6804
 URL: https://issues.apache.org/jira/browse/SPARK-6804
 Project: Spark
  Issue Type: Improvement
Reporter: Alberto


We are developing a web application that is using Spark under the hood. Testing 
our app we have found out that when our spark master is not up and running and 
we try to connect with it, Spark is killing our app. 

We've been having a look at the code and we have noticed that the 
TaskSchedulerImpl class is just killing the JVM and our web application is 
obviously also killed. See following the code snippet I am talking about:

else {
// No task sets are active but we still got an error. Just exit since 
this
// must mean the error is during registration.
// It might be good to do something smarter here in the future.
logError(Exiting due to error from cluster scheduler:  + message)
System.exit(1)
  }

IMHO this guy should not invoke System.exit(1). Instead, it should throw an 
exception so the applications will be able to handle the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming

2015-04-09 Thread Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao updated SPARK-6803:
---
Summary: [SparkR] Support SparkR Streaming  (was: Support SparkR Streaming)

 [SparkR] Support SparkR Streaming
 -

 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


 Adds R API for Spark Streaming.
 A experimental version is presented in repo [1]. which follows the PySpark 
 streaming design. Also, this PR can be further broken down into sub task 
 issues.
 [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-09 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670
 ] 

Sandy Ryza commented on SPARK-6735:
---

Hi [~twinkle], can you submit the PR against the main Spark project.

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-09 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670
 ] 

Sandy Ryza edited comment on SPARK-6735 at 4/9/15 5:06 PM:
---

Hi [~twinkle], can you submit the PR against the main Spark project?


was (Author: sandyr):
Hi [~twinkle], can you submit the PR against the main Spark project.

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6821) Refactor SerDe API in SparkR to be more developer friendly

Shivaram Venkataraman created SPARK-6821:


 Summary: Refactor SerDe API in SparkR to be more developer friendly
 Key: SPARK-6821
 URL: https://issues.apache.org/jira/browse/SPARK-6821
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman


The existing SerDe API we use in the SparkR JVM backend is limited and not very 
easy to use. We should refactor it to make it use more of Scala's type system 
and also allow extensions for user-defined S3 or S4 types in R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6813) SparkR style guide

Shivaram Venkataraman created SPARK-6813:


 Summary: SparkR style guide
 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman


We should develop a SparkR style guide document based on the some of the 
guidelines we use and some of the best practices in R.
Some examples of R style guide are:
http://r-pkgs.had.co.nz/r.html#style 
http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
A related issue is to work on a automatic style checking tool. 
https://github.com/jimhester/lintr seems promising

We could have a R style guide based on the one from google [1], and adjust some 
of them with the conversation in Spark:
1. Line Length: maximum 100 characters
2. no limit on function name (API should be similar as in other languages)
3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields

2015-04-09 Thread Stefano Parmesan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487010#comment-14487010
 ] 

Stefano Parmesan commented on SPARK-6677:
-

Uhm, don't know what to say. Let's try with this: I've created a docker that 
reproduces the issue, its available here:
https://github.com/armisael/SPARK-6677

I tested it on three different machines, and the issue appeared on all of them. 
Can you give it a try?

 pyspark.sql nondeterministic issue with row fields
 --

 Key: SPARK-6677
 URL: https://issues.apache.org/jira/browse/SPARK-6677
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: spark version: spark-1.3.0-bin-hadoop2.4
 python version: Python 2.7.6
 operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
Reporter: Stefano Parmesan
Assignee: Davies Liu
  Labels: pyspark, row, sql

 The following issue happens only when running pyspark in the python 
 interpreter, it works correctly with spark-submit.
 Reading two json files containing objects with a different structure leads 
 sometimes to the definition of wrong Rows, where the fields of a file are 
 used for the other one.
 I was able to write a sample code that reproduce this issue one out of three 
 times; the code snippet is available at the following link, together with 
 some (very simple) data samples:
 https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6803) Support SparkR Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487800#comment-14487800
 ] 

Shivaram Venkataraman commented on SPARK-6803:
--

Thanks [~hlin09]. The PySpark design is a good starting point for the design. 
However I'm wondering if there are any R-specific paradigms or operations we 
should be supporting . Do you know of any existing R packages that do streaming 
/ time-series analysis ?

[~tdas] [~davies] It'll also be great to know if we have any feedback from 
streaming users in Python about the API ?

 Support SparkR Streaming
 

 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


 Adds R API for Spark Streaming.
 A experimental version is presented in repo [1]. which follows the PySpark 
 streaming design. Also, this PR can be further broken down into sub task 
 issues.
 [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily

2015-04-09 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2673:
--
Summary: Improve Enable to attach Debugger to Executors easily  (was: 
Improve Spark so that we can attach Debugger to Executors easily)

 Improve Enable to attach Debugger to Executors easily
 -

 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kousuke Saruta

 In current implementation, we are difficult to attach debugger to each 
 Executor in the cluster.
 There are reasons as follows.
 1) It's difficult for Executors running on the same machine to open debug 
 port because we can only pass same JVM options to all executors.
 2)  Even if we can open unique debug port to each Executors running on the 
 same machine, it's a bother to check debug port of each executor.
 To solve those problem, I think following 2 improvement is needed.
 1) Enable executor to open unique debug port on a machine.
 2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-04-09 Thread Jeffrey Turpin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487878#comment-14487878
]

Jeffrey Turpin commented on SPARK-6229:
---

Hey Marcelo,

I have been working on SPARK-6373 and have reviewed you pull request and merged
into my wip https://github.com/turp1twin/spark/tree/ssl-shuffle. I tried to
follow the general design pattern that I discussed with Aaron Davidson, by
having a single EncryptionHandler interface and implementations for both SSL
and SASL Encryption. One issue I faced is that the timing of adding the
appropriate encryption handlers differs for SSL and SASL. For SSL, I need to
add the SslHandler to the Netty pipeline before the connection is made, and for
SASL, it looks like you add it during the TransportClient/Server Bootstrap
process which occurs after the initial connection. Anyways, I haven't created a
pull request yet and am waiting on some more feedback... If you have some time
perhaps you can give me your thoughts... Some commits of interest...

https://github.com/apache/spark/commit/ab8743f6ac707060cbae63bdf491723709fe32f3
https://github.com/apache/spark/commit/9527aef89b1bbc80a22337552dd54af936aa1094

Cheers,

Jeff

Support SASL encryption in network/common module

Key: SPARK-6229
URL: https://issues.apache.org/jira/browse/SPARK-6229
Project: Spark
Issue Type: Sub-task
Components: Spark Core
Reporter: Marcelo Vanzin

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6810) Performance benchmarks for SparkR

Shivaram Venkataraman created SPARK-6810:


 Summary: Performance benchmarks for SparkR
 Key: SPARK-6810
 URL: https://issues.apache.org/jira/browse/SPARK-6810
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical


We should port some performance benchmarks from spark-perf to SparkR for 
tracking performance regressions / improvements.

https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list 
of PySpark performance benchmarks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-09 Thread JIRA

Micael Capitão created SPARK-6800:
-

 Summary: Reading from JDBC with SQLContext, using lower/upper 
bounds and numPartitions gives incorrect results.
 Key: SPARK-6800
 URL: https://issues.apache.org/jira/browse/SPARK-6800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 8.1, Derby, Spark 1.3.0 CDH5.4.0, Scala 2.10
Reporter: Micael Capitão


Having a Derby table with people info (id, name, age) defined like this:

{code}
val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
{code}

If I try to read that table from Spark SQL with lower/upper bounds, like this:

{code}
val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
  columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
people.show()
{code}

I get this result:
{noformat}
PERSON_ID NAME AGE
3 Ana Rita Costa   12 
5 Miguel Costa 15 
6 Anabela Sintra   13 
2 Lurdes Pereira   23 
4 Armando Pereira  32 
1 Armando Carvalho 50 
{noformat}

Which is wrong, considering the defined upper bound has been ignored (I get a 
person with age 50!).
Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE 
clauses it generates are the following:
{code}
(0) age  4,0
(1) age = 4  AND age  8,1
(2) age = 8  AND age  12,2
(3) age = 12 AND age  16,3
(4) age = 16 AND age  20,4
(5) age = 20 AND age  24,5
(6) age = 24 AND age  28,6
(7) age = 28 AND age  32,7
(8) age = 32 AND age  36,8
(9) age = 36,9
{code}

The last condition ignores the upper bound and the other ones may result in 
repeated rows being read.

Using the JdbcRDD (and converting it to a DataFrame) I would have something 
like this:
{code}
val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
  SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
  rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
people.show()
{code}

Resulting in:
{noformat}
PERSON_ID NAMEAGE
3 Ana Rita Costa  12 
5 Miguel Costa15 
6 Anabela Sintra  13 
2 Lurdes Pereira  23 
4 Armando Pereira 32 
{noformat}

Which is correct!

Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} 
I've found it generates the following:
{code}
(0) age = 0  AND age = 3
(1) age = 4  AND age = 7
(2) age = 8  AND age = 11
(3) age = 12 AND age = 15
(4) age = 16 AND age = 19
(5) age = 20 AND age = 23
(6) age = 24 AND age = 27
(7) age = 28 AND age = 31
(8) age = 32 AND age = 35
(8) age = 36 AND age = 40
{code}

This is the behaviour I was expecting from the Spark SQL version. The Spark SQL 
version is buggy, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-04-09 Thread Jeffrey Turpin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488104#comment-14488104
 ] 

Jeffrey Turpin commented on SPARK-6229:
---

Hey Marcelo,

So what I have done is to overload the TransportContext constructor, adding a 
constructor that takes an instance of the TransportEncryptionHandler interface:
{code:title=TransportContext.java|linenumbers=false|language=java}
public TransportContext(
  TransportConf conf,
  RpcHandler appRpcHandler,
  TransportEncryptionHandler encryptionHandler) {
this.conf = conf;
this.appRpcHandler = appRpcHandler;
this.decoder = new MessageDecoder();
if (encryptionHandler != null) {
  this.encryptionHandler = encryptionHandler;
} else {
  this.encryptionHandler = new NoEncryptionHandler();
}
this.encoder =
  (this.encryptionHandler.isEnabled() ? new SslMessageEncoder() : new 
MessageEncoder());
  }
{code}

This way the method existing method signatures for createServer and 
createClientFactory don't change. To facilitate this I also added a constructor 
to the TransportClientFactory class and modified the constructor for the 
TransportServer class, to also take a TransportEncryptionHandler instance 
In the TransportClientFactory case I need to add the Netty SslHandler before 
the connection occurs, which can be done by calling the _addToPipeline_ method 
of the TransportEncryptionHandler interface:

{code:title=TransportClientFactory.java|linenumbers=false|language=java}
private void initHandler(
  final Bootstrap bootstrap,
  final AtomicReferenceTransportClient clientRef,
  final AtomicReferenceChannel channelRef) {
bootstrap.handler(new ChannelInitializerSocketChannel() {
  @Override
  protected void initChannel(SocketChannel ch) throws Exception {
TransportChannelHandler clientHandler = context.initializePipeline(ch);
encryptionHandler.addToPipeline(ch.pipeline(), true);
clientRef.set(clientHandler.getClient());
channelRef.set(ch);
  }
});
  }
{code}

This _initHandler_ method is called just before connection is made. In addition 
the TransportEncryptionHandler interface has an _onConnect_ method to allow a 
post connect initialization to occur, which in the SSL case, is to allow the 
handshake process to complete, which is a blocking operation. This could be 
possibly done in a custom TransportClientBootstrap implementation, but the 
method signature of _doBootstrap_ would have to change to allow for this. 

As for the TransportServer, the Netty SslHandler must be added to the pipeline 
before the server binds to a port and starts listening for connections. Again, 
in this case, this could be done in a TransportServerBootstrap implementation, 
but the method signature of _doBootstrap_ would have to change (or we would 
need to add another method) to allow for this... Thoughts?

Jeff


 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-09 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486794#comment-14486794
 ] 

yangping wu commented on SPARK-6770:


Ok, Thank you very much for your reply. I will try to use pure Spark Streaming 
program and use pure scala jdbc to write data to mysql.

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)

[jira] [Resolved] (SPARK-6751) Spark History Server support multiple application attempts

2015-04-09 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6751.
--
Resolution: Duplicate

Dup of SPARK-4705

 Spark History Server support multiple application attempts
 --

 Key: SPARK-6751
 URL: https://issues.apache.org/jira/browse/SPARK-6751
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 Spark on Yarn supports running multiple application attempts (configurable 
 number) in case the first (or second..) attempts fail.  The Spark History 
 server only supports one history file though.  Under the default configs it 
 keeps the first attempts history file. You can set the undocumented config 
 spark.eventLog.overwrite to allow the follow on attempts to overwrite the 
 first attempts history file.
 Note that in spark 1.2 not having the overwrite config set causes any 
 following attempts to actually fail to run, in spark 1.3 they run and you 
 just see a warning at the end of the attempts.
 It would be really nice to have an option that keeps all the attempts history 
 files.  This way a user can go back and look at each one individually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()


 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5960:
---

Assignee: Apache Spark  (was: Chris Fregly)

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Apache Spark

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

Shivaram Venkataraman created SPARK-6820:


 Summary: Convert NAs to null type in SparkR DataFrames
 Key: SPARK-6820
 URL: https://issues.apache.org/jira/browse/SPARK-6820
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman


While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
handle missing values or NAs.
We should convert NAs to SparkSQL's null type to handle the conversion correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6806) SparkR examples in programming guide

Davies Liu created SPARK-6806:
-

 Summary: SparkR examples in programming guide
 Key: SPARK-6806
 URL: https://issues.apache.org/jira/browse/SPARK-6806
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, SparkR
Reporter: Davies Liu
Priority: Blocker


Add R examples for Spark Core and DataFrame programming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark

Davies Liu created SPARK-6807:
-

 Summary: Merge recent changes in SparkR-pkg into Spark
 Key: SPARK-6807
 URL: https://issues.apache.org/jira/browse/SPARK-6807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


There are a few of new features happened on SparkR-pkg while merging, we should 
pull them all in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py

2015-04-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488245#comment-14488245
 ] 

Nicholas Chammas commented on SPARK-6636:
-

[~aasted] - Can you elaborate on this? I haven't used private-network-only 
security groups before. Why wouldn't the IP address work if that kind of 
security group is used?

Just curious since, naively speaking, the public IP and public DNS name should 
always be interchangeable.

 Use public DNS hostname everywhere in spark_ec2.py
 --

 Key: SPARK-6636
 URL: https://issues.apache.org/jira/browse/SPARK-6636
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Matt Aasted
Assignee: Matt Aasted
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 The spark_ec2.py script uses public_dns_name everywhere in the script except 
 for testing ssh availability, which is done using the public ip address of 
 the instances. This breaks the script for users who are deploying the cluster 
 with a private-network-only security group. The fix is to use public_dns_name 
 in the remaining place.
 I am submitting a pull-request alongside this bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support

2015-04-09 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487278#comment-14487278
 ] 

Takeshi Yamamuro commented on SPARK-3947:
-

Sorry, but Im not sure about the issue of that.
SPARK-4233 just simplifies and bug-fixes the interface of Aggregate.
If you'd like to discuss the topic, ISTM you need to make a new jira ticket 
about that.

 [Spark SQL] UDAF Support
 

 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pei-Lun Lee
Assignee: Venkata Ramana G

 Right now only Hive UDAFs are supported. It would be nice to have UDAF 
 similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2451) Enable to load config file for Akka

2015-04-09 Thread Andrea Peruffo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487386#comment-14487386
 ] 

Andrea Peruffo commented on SPARK-2451:
---

As per:
https://issues.apache.org/jira/browse/SPARK-4669


 Enable to load config file for Akka
 ---

 Key: SPARK-2451
 URL: https://issues.apache.org/jira/browse/SPARK-2451
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kousuke Saruta
Priority: Minor

 In current implementation, we cannot let Akka to load config file.
 Sometimes we want to use custom config file for Akka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6804) System.exit(1) on error

2015-04-09 Thread Alberto (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alberto updated SPARK-6804:
---
Description:
We are developing a web application that is using Spark under the hood. Testing
our app we have found out that when our spark master is not up and running and
we try to connect with it, Spark is killing our app.

We've been having a look at the code and we have noticed that the
TaskSchedulerImpl class is just killing the JVM and our web application is
obviously also killed. See following the code snippet I am talking about:

{code}
else {
// No task sets are active but we still got an error. Just exit since
this
// must mean the error is during registration.
// It might be good to do something smarter here in the future.
logError(Exiting due to error from cluster scheduler: + message)
System.exit(1)
}
{code}

IMHO this guy should not invoke System.exit(1). Instead, it should throw an
exception so the applications will be able to handle the error.

was:
We are developing a web application that is using Spark under the hood. Testing
our app we have found out that when our spark master is not up and running and
we try to connect with it, Spark is killing our app.

else {
// No task sets are active but we still got an error. Just exit since
this
// must mean the error is during registration.
// It might be good to do something smarter here in the future.
logError(Exiting due to error from cluster scheduler: + message)
System.exit(1)
}

IMHO this guy should not invoke System.exit(1). Instead, it should throw an
exception so the applications will be able to handle the error.

System.exit(1) on error
---

Key: SPARK-6804
URL: https://issues.apache.org/jira/browse/SPARK-6804
Project: Spark
Issue Type: Improvement
Reporter: Alberto

We are developing a web application that is using Spark under the hood.
Testing our app we have found out that when our spark master is not up and
running and we try to connect with it, Spark is killing our app.
We've been having a look at the code and we have noticed that the
TaskSchedulerImpl class is just killing the JVM and our web application is
obviously also killed. See following the code snippet I am talking about:
{code}
else {
// No task sets are active but we still got an error. Just exit since
this
// must mean the error is during registration.
// It might be good to do something smarter here in the future.
logError(Exiting due to error from cluster scheduler: + message)
System.exit(1)
}
{code}
IMHO this guy should not invoke System.exit(1). Instead, it should throw an
exception so the applications will be able to handle the error.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2352:
---

Assignee: Bert Greevenbosch  (was: Apache Spark)

 [MLLIB] Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch
Assignee: Bert Greevenbosch

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-09 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486810#comment-14486810
 ] 

Guoqiang Li commented on SPARK-3937:


The bug seems to be caused by {{spark.storage.memoryFraction 0.2}}.   
{{spark.storage.memoryFraction 0.4}}  won't appear the bug. These may be 
related with the size of the RDD.



 Unsafe memory access inside of Snappy library
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2015-04-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2205:

Priority: Critical  (was: Minor)

 Unnecessary exchange operators in a join on multiple tables with the same 
 join key.
 ---

 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (y.key=z.key))
 SchemaRDD[1] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
  HashJoin [key#6], [key#8], BuildRight
   Exchange (HashPartitioning [key#6], 200)
HashJoin [key#4], [key#6], BuildRight
 Exchange (HashPartitioning [key#4], 200)
  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
 Some(x)), None
 Exchange (HashPartitioning [key#6], 200)
  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#8], 200)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
 None
 {code}
 However, this is fine...
 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (x.key=z.key))
 res5: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[5] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
  HashJoin [key#26], [key#30], BuildRight
   HashJoin [key#26], [key#28], BuildRight
Exchange (HashPartitioning [key#26], 200)
 HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
 Some(x)), None
Exchange (HashPartitioning [key#28], 200)
 HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#30], 200)
HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
 Some(z)), None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR


[ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488237#comment-14488237
 ] 

Shivaram Venkataraman commented on SPARK-6816:
--

Comments from SparkR JIRA

Shivaram Venkataraman added a comment - 14/Feb/15 10:32 AM
I looked at this recently and I think the existing arguments to `sparkR.init` 
pretty much cover all the options that are exposed in SparkConf.
We could split things out of the function arguments into a separate SparkConf 
object (something like PySpark 
https://github.com/apache/spark/blob/master/python/pyspark/conf.py) but the 
setter-methods don't translate very well to the style we use in SparkR. For 
example it would be something like setAppName(setMaster(conf, local), 
SparkR) instead of conf.setMaster().setAppName()
The other thing brought up by this JIRA is that we should parse arguments 
passed to spark-submit or set in spark-defaults.conf. I think this should 
automatically happen with SPARKR-178
Sun Rui Zongheng Yang Any thoughts on this ?
  
concretevitamin Zongheng Yang added a comment - 15/Feb/15 12:07 PM
I'm +1 on not using the builder pattern in R. What about using a named list or 
an environment to simulate a SparkConf? For example, users can write something 
like:
{code}
 conf - list(spark.master = local[2], spark.executor.memory = 12g)
 conf
$spark.master
[1] local[2]

$spark.executor.memory
[1] 12g
{code}
and pass the named list to `sparkR.init()`.
  
shivaram Shivaram Venkataraman added a comment - 15/Feb/15 5:50 PM
I think the named list might be okay, (one thing is that we will have nested 
named lists for things like executorEnv). However I am not sure if named lists 
are better than just passing named arguments to the `sparkR.init`. I guess the 
better way to ask my question is what functionality do we want to provide to 
the users –
Right now users can pretty much set anything they want in the SparkConf using 
sparkR.init
One functionality that is missing is printing the conf and say inspecting what 
config variables are set. We could say add a getConf(sc) which returns a named 
list to provide this feature.
Is there any other functionality we need ?
  
concretevitamin Zongheng Yang added a comment - 21/Feb/15 3:22 PM
IMO using a named list provides more flexibility: it's ordinary data that users 
can operate/transform on. Using only parameter-passing in the constructor locks 
users in operating on code instead of data. It'd also be easier to just return 
the saved named list if we're going to implement getConf()?
Some relevant discussions: https://aphyr.com/posts/321-builders-vs-option-maps
  
shivaram Shivaram Venkataraman added a comment - 22/Feb/15 4:33 PM
Hmm okay - named lists are not quite the same as option maps though.To move 
forward it'll be good to see how the new API functions we want on the R side 
should look like.
Lets keep this discussion open but I'm going to change the priority / 
description (we are already able to read in spark-defaults.conf now that 
SPARKR-178 has been merged).

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486782#comment-14486782
 ] 

Saisai Shao commented on SPARK-6770:


From my understanding, the basic scenario of your code is trying to put the 
Kafka data into database using JDBC, and you want to leverage SparkSQL for 
easy implementation. I think if you want to use checkpoint file to recover 
from driver failure, it would be better to write a pure Spark Streaming 
program, the Spark Streaming's checkpointing mechanism only guarantee 
streaming's related metadata to write and recover. The more you use 
third-party tools, the less it can recover from current mechanism.

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at

[jira] [Commented] (SPARK-2949) SparkContext does not fate-share with ActorSystem

2015-04-09 Thread Andrea Peruffo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487467#comment-14487467
 ] 

Andrea Peruffo commented on SPARK-2949:
---

Which version of Spark?
ActorSystem has a method registerOnTermination that can trigger callbacks on 
proper shutdown, by the way I've seen issues on netty shutdown under akka, the 
only way I've found to live with it is to await polling until the os port 
become free.

 SparkContext does not fate-share with ActorSystem
 -

 Key: SPARK-2949
 URL: https://issues.apache.org/jira/browse/SPARK-2949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson

 It appears that an uncaught fatal error in Spark's Driver ActorSystem does 
 not cause the SparkContext to terminate. We observed an issue in production 
 that caused a PermGen error, but it just kept throwing this error:
 {code}
 14/08/09 15:07:24 ERROR ActorSystemImpl: Uncaught fatal error from thread 
 [spark-akka.actor.default-dispatcher-26] shutting down ActorSystem [spark]
 java.lang.OutOfMemoryError: PermGen space
 {code}
 We should probably do something similar for what we did in the DAGSCheduler 
 and ensure that we call SparkContext#stop() if the entire ActorSystem dies 
 with a fatal error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6343) Make doc more explicit regarding network connectivity requirements


 [ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6343.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

Issue resolved by pull request 5382
[https://github.com/apache/spark/pull/5382]

 Make doc more explicit regarding network connectivity requirements
 --

 Key: SPARK-6343
 URL: https://issues.apache.org/jira/browse/SPARK-6343
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Peter Parente
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 As a new user of Spark, I read through the official documentation before 
 attempting to stand-up my own cluster and write my own driver application. 
 But only after attempting to run my app remotely against my cluster did I 
 realize that full network connectivity (layer 3) is necessary between my 
 driver program and worker nodes (i.e., my driver was *listening* for 
 connections from my workers).
 I returned to the documentation to see how I had missed this requirement. On 
 a second read-through, I saw that the doc hints at it in a few places (e.g., 
 [driver 
 config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
 [submitting applications 
 suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
 [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
 but never outright says it.
 I think it would help would-be users better understand how Spark works to 
 state the network connectivity requirements right up-front in the overview 
 section of the doc. I suggest revising the diagram and accompanying text 
 found on the [overview 
 page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
 !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
 so that it depicts at least the directionality of the network connections 
 initiated (perhaps like so):
 !http://i.imgur.com/2dqGbCr.png!
 and states that the driver must listen for and accept connections from other 
 Spark components on a variety of ports.
 Please treat my diagram and text as strawmen: I expect more experienced Spark 
 users and developers will have better ideas on how to convey these 
 requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution

2015-04-09 Thread Xinghao Pan (JIRA)

Xinghao Pan created SPARK-6808:
--

 Summary: Checkpointing after zipPartitions results in NODE_LOCAL 
execution
 Key: SPARK-6808
 URL: https://issues.apache.org/jira/browse/SPARK-6808
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: EC2 Ubuntu r3.8xlarge machines
Reporter: Xinghao Pan


I'm encountering a weird issue where a simple iterative zipPartition is 
PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations 
after checkpointing.

Here's an example snippet of code:
var R : RDD[(Long,Int)]
= sc.parallelize((0 until numPartitions), numPartitions)
  .mapPartitions(_ = new Array[(Long,Int)](1000).map(i = 
(0L,0)).toSeq.iterator).cache()

sc.setCheckpointDir(checkpointDir)

var iteration = 0
while (iteration  50){
  R = R.zipPartitions(R)((x,y) = x).cache()
  if ((iteration+1) % checkpointIter == 0) R.checkpoint()
  R.foreachPartition(_ = {})
  iteration += 1
}

I've also tried to unpersist the old RDDs, and increased spark.locality.wait 
but nether helps.

Strangely, by adding a simple identity map
R = R.map(x = x).cache()
after the zipPartitions appears to partially mitigate the issue.

The problem was originally triggered when I attempted to checkpoint after doing 
joinVertices in GraphX, but the above example shows that the issue is in Spark 
core too.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3276:
---

Assignee: Apache Spark

 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
 in streaming
 --

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Assignee: Apache Spark
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6803) [SparkR] Support SparkR Streaming

[
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488315#comment-14488315
]

Davies Liu commented on SPARK-6803:
---

After a quick look over the prototype, the callback server is sit in another
process than the driver, because R does not support multiple threading. This
approach will have some limitation, for example, access some shared variables
in callback functions.

Also, we should have a way to collect the logging from callback server, it's
needed when you run the streaming job as a daemon process, with
dstream.pprint().

This prototype is pretty cool, it shows that it's doable to have a Streaming
API in R, even with some limitations.

But the question is that how many user want to do streaming job in R? There
will be a lots of effort to make it production ready. Even with Python API,
there's lots of work to do, for example, support checkpointing and recovery
with HDFS.

[SparkR] Support SparkR Streaming
-

Key: SPARK-6803
URL: https://issues.apache.org/jira/browse/SPARK-6803
Project: Spark
Issue Type: New Feature
Components: SparkR, Streaming
Reporter: Hao
Fix For: 1.4.0

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6343) Make doc more explicit regarding network connectivity requirements


 [ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6343:
-
Assignee: Peter Parente

 Make doc more explicit regarding network connectivity requirements
 --

 Key: SPARK-6343
 URL: https://issues.apache.org/jira/browse/SPARK-6343
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Peter Parente
Assignee: Peter Parente
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 As a new user of Spark, I read through the official documentation before 
 attempting to stand-up my own cluster and write my own driver application. 
 But only after attempting to run my app remotely against my cluster did I 
 realize that full network connectivity (layer 3) is necessary between my 
 driver program and worker nodes (i.e., my driver was *listening* for 
 connections from my workers).
 I returned to the documentation to see how I had missed this requirement. On 
 a second read-through, I saw that the doc hints at it in a few places (e.g., 
 [driver 
 config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
 [submitting applications 
 suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
 [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
 but never outright says it.
 I think it would help would-be users better understand how Spark works to 
 state the network connectivity requirements right up-front in the overview 
 section of the doc. I suggest revising the diagram and accompanying text 
 found on the [overview 
 page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
 !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
 so that it depicts at least the directionality of the network connections 
 initiated (perhaps like so):
 !http://i.imgur.com/2dqGbCr.png!
 and states that the driver must listen for and accept connections from other 
 Spark components on a variety of ports.
 Please treat my diagram and text as strawmen: I expect more experienced Spark 
 users and developers will have better ideas on how to convey these 
 requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-09 Thread Saisai Shao (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487323#comment-14487323
]

Saisai Shao commented on SPARK-6691:

Hi [~tdas], thanks a lot for your comments. I think this proposal not only add
a new dynamic RateLimiter, but also refactor the previous interface to offer a
uniform solution for receiver based input stream as well as direct input
stream. Also it keeps the backward compatible, the default logic keeps the same
as previous code. So I think it make sense to do such refactor.

Regarding to dynamic RateLimiter, I think your suggestion is quite meaningful,
we need a good design to balance the stability and throughput, currently my
design is simple and straightforward, we still need to polish it. But from my
point stability is very important for in-production use compared to throughput,
since seldom in-produce use will saturate the network bandwidth of each
receiver, but unstable is quite critical. Currently Spark Streaming is
vulnerable to processing delay, and this processing delay will be accumulated
and hard to recover once we met the ingestion burst, it is quite normal in
production environment, especially for online service. So dynamic RateLimiter
could well solve this problem, from this point it is quite meaningful.

My design of dynamic RateLimiter may not be so sophisticated, I think I bring
it here is just to show a possible solution to handle the issues, so we could
improve this. I will continue to do some benchmark and research works on
different scenarios. Thanks a lot for your suggestions.

Abstract and add a dynamic RateLimiter for Spark Streaming
--

Key: SPARK-6691
URL: https://issues.apache.org/jira/browse/SPARK-6691
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-09 Thread Marko Bonaci (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488225#comment-14488225
 ] 

Marko Bonaci commented on SPARK-6646:
-

Wait a minute, don't postpone this one just yet. Hardest problems often give 
the biggest yields.
Other players in the space, spurred (and a bit frightened) by your 
announcement, already started acting.

Nobody wants to be left behind, so strategies are being worked on:
http://app.go.cloudera.com/e/es.aspx?s=1465054361e=177939

bq. Cloudera Wearables ^tm^


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()


[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487177#comment-14487177
 ] 

Apache Spark commented on SPARK-5960:
-

User 'mce' has created a pull request for this issue:
https://github.com/apache/spark/pull/5439

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()


 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5960:
---

Assignee: Chris Fregly  (was: Apache Spark)

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6814) Support sorting for any data type in SparkR

Shivaram Venkataraman created SPARK-6814:


 Summary: Support sorting for any data type in SparkR
 Key: SPARK-6814
 URL: https://issues.apache.org/jira/browse/SPARK-6814
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical


I get various return status == 0 is false and unimplemented type errors 
trying to get data out of any rdd with top() or collect(). The errors are not 
consistent. I think spark is installed properly because some operations do 
work. I apologize if I'm missing something easy or not providing the right 
diagnostic info – I'm new to SparkR, and this seems to be the only resource for 
SparkR issues.
Some logs:
{code}
Browse[1] top(estep.rdd, 1L)
Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
  unimplemented type 'list' in 'orderVector1'
Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
Execution halted
15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.SparkException: R computation failed with
 Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
  unimplemented type 'list' in 'orderVector1'
Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
Execution halted
at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, 
localhost): org.apache.spark.SparkException: R computation failed with
 Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
  unimplemented type 'list' in 'orderVector1'
Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
Execution halted
edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micael Capitão updated SPARK-6800:
--
Description: 
Having a Derby table with people info (id, name, age) defined like this:

{code}
val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
{code}

If I try to read that table from Spark SQL with lower/upper bounds, like this:

{code}
val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
  columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
people.show()
{code}

I get this result:
{noformat}
PERSON_ID NAME AGE
3 Ana Rita Costa   12 
5 Miguel Costa 15 
6 Anabela Sintra   13 
2 Lurdes Pereira   23 
4 Armando Pereira  32 
1 Armando Carvalho 50 
{noformat}

Which is wrong, considering the defined upper bound has been ignored (I get a 
person with age 50!).
Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE 
clauses it generates are the following:
{code}
(0) age  4,0
(1) age = 4  AND age  8,1
(2) age = 8  AND age  12,2
(3) age = 12 AND age  16,3
(4) age = 16 AND age  20,4
(5) age = 20 AND age  24,5
(6) age = 24 AND age  28,6
(7) age = 28 AND age  32,7
(8) age = 32 AND age  36,8
(9) age = 36,9
{code}

The last condition ignores the upper bound and the other ones may result in 
repeated rows being read.

Using the JdbcRDD (and converting it to a DataFrame) I would have something 
like this:
{code}
val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl),
  SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10,
  rs = (rs.getInt(1), rs.getString(2), rs.getInt(3)))
val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE)
people.show()
{code}

Resulting in:
{noformat}
PERSON_ID NAMEAGE
3 Ana Rita Costa  12 
5 Miguel Costa15 
6 Anabela Sintra  13 
2 Lurdes Pereira  23 
4 Armando Pereira 32 
{noformat}

Which is correct!

Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} 
I've found it generates the following:
{code}
(0) age = 0  AND age = 3
(1) age = 4  AND age = 7
(2) age = 8  AND age = 11
(3) age = 12 AND age = 15
(4) age = 16 AND age = 19
(5) age = 20 AND age = 23
(6) age = 24 AND age = 27
(7) age = 28 AND age = 31
(8) age = 32 AND age = 35
(9) age = 36 AND age = 40
{code}

This is the behaviour I was expecting from the Spark SQL version. Is the Spark 
SQL version buggy or this some weird expected behaviour?

  was:
Having a Derby table with people info (id, name, age) defined like this:

{code}
val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50))
stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23))
stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12))
stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32))
stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15))
stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13))
{code}

If I try to read that table from Spark SQL with lower/upper bounds, like this:

{code}
val people = sqlContext.jdbc(url = jdbcUrl, table = Person,
  columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10)
people.show()
{code}

I get this result:
{noformat}
PERSON_ID NAME AGE
3 Ana Rita Costa   12 
5 Miguel Costa 15 
6 Anabela Sintra   13 
2 Lurdes Pereira   23 
4 Armando Pereira  32 
1 Armando Carvalho 50 
{noformat}

Which is wrong, considering the defined upper bound has been ignored (I get a 
person with age 50!).
Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE 
clauses it generates are the following:
{code}
(0) age  4,0
(1) age = 4  AND age  8,1
(2) age = 8  AND age  12,2
(3) age = 12 AND age  16,3
(4) age = 16 AND age  20,4
(5) age = 20 AND age  24,5
(6) age = 24 AND age  28,6
(7) age = 28 AND age  32,7

[jira] [Updated] (SPARK-6772) spark sql error when running code on large number of records

2015-04-09 Thread Aditya Parmar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Parmar updated SPARK-6772:
-
Description: 
Hi all ,
I am getting an Arrayoutboundsindex error when i try to run a simple filtering 
colums query on a file with 2.5 lac records.runs fine when running on a file 
with 2k records .

15/04/09 12:19:01 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1): 
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, 
blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes)
15/04/09 12:19:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0): 
java.lang.ArrayIndexOutOfBoundsException

15/04/09 12:19:01 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 
blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes)
15/04/09 12:19:01 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 
executor :
 java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1]
15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 
blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes)
15/04/09 12:19:01 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on 
executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]
15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, 
blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes)
15/04/09 12:19:01 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on 
executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 3]
15/04/09 12:19:01 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, 
blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes)
15/04/09 12:19:02 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on 
executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 4]
15/04/09 12:19:02 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; 
aborting job
15/04/09 12:19:02 INFO TaskSchedulerImpl: Cancelling stage 0
15/04/09 12:19:02 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/04/09 12:19:02 INFO DAGScheduler: Job 0 failed: saveAsTextFile at 
JavaSchemaRDD.scala:42, took 1.958621 s
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 0.0 (TID 5, ): java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

[jira] [Assigned] (SPARK-6796) Add the batch list to StreamingPage


 [ 
https://issues.apache.org/jira/browse/SPARK-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-6796:


Assignee: Shixiong Zhu

 Add the batch list to StreamingPage
 ---

 Key: SPARK-6796
 URL: https://issues.apache.org/jira/browse/SPARK-6796
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming, Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu

 Show the list of active and completed batches the StreamingPage, as the 
 proposed Task 1 in 
 https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487103#comment-14487103
 ] 

Sean Owen commented on SPARK-6770:
--

That may be so, but it's not obvious that you simply can't use Spark SQL with 
Streaming recovery. For example, the final error makes it sound like it very 
nearly works. Perhaps you just need to use a different constructor to specify 
the SQLConf? maybe this value should be serialized with some object? It might 
be something that is hard to make work now but I wonder if there is an easy fix 
to make the SQL objects recoverable.

 DirectKafkaInputDStream has not been initialized when recovery from checkpoint
 --

 Key: SPARK-6770
 URL: https://issues.apache.org/jira/browse/SPARK-6770
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu

 I am  read data from kafka using createDirectStream method and save the 
 received log to Mysql, the code snippets as follows
 {code}
 def functionToCreateContext(): StreamingContext = {
   val sparkConf = new SparkConf()
   val sc = new SparkContext(sparkConf)
   val ssc = new StreamingContext(sc, Seconds(10))
   ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory
   ssc
 }
 val struct = StructType(StructField(log, StringType) ::Nil)
 // Get StreamingContext from checkpoint data or create a new one
 val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, 
 functionToCreateContext)
 val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topics)
 val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
 SDB.foreachRDD(rdd = {
   val result = rdd.map(item = {
 println(item)
 val result = item._2 match {
   case e: String = Row.apply(e)
   case _ = Row.apply()
 }
 result
   })
   println(result.count())
   val df = sqlContext.createDataFrame(result, struct)
   df.insertIntoJDBC(url, test, overwrite = false)
 })
 ssc.start()
 ssc.awaitTermination()
 ssc.stop()
 {code}
 But when I  recovery the program from checkpoint, I encountered an exception:
 {code}
 Exception in thread main org.apache.spark.SparkException: 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
 been initialized
   at 
 org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
   at 
 org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
   at scala.Option.orElse(Option.scala:257)
   at 
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
   at 
 org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
   at 
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
   at

[jira] [Assigned] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6807:
---

Assignee: Davies Liu  (was: Apache Spark)

 Merge recent changes in SparkR-pkg into Spark
 -

 Key: SPARK-6807
 URL: https://issues.apache.org/jira/browse/SPARK-6807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 There are a few of new features happened on SparkR-pkg while merging, we 
 should pull them all in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark


[ 
https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487687#comment-14487687
 ] 

Apache Spark commented on SPARK-6807:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5436

 Merge recent changes in SparkR-pkg into Spark
 -

 Key: SPARK-6807
 URL: https://issues.apache.org/jira/browse/SPARK-6807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 There are a few of new features happened on SparkR-pkg while merging, we 
 should pull them all in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6807:
---

Assignee: Apache Spark  (was: Davies Liu)

 Merge recent changes in SparkR-pkg into Spark
 -

 Key: SPARK-6807
 URL: https://issues.apache.org/jira/browse/SPARK-6807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 There are a few of new features happened on SparkR-pkg while merging, we 
 should pull them all in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected