[jira] [Closed] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-12040.
-
Resolution: Not A Problem

> Add toJson/fromJson to Vector/Vectors for PySpark
> -
>
> Key: SPARK-12040
> URL: https://issues.apache.org/jira/browse/SPARK-12040
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>  Labels: starter
>
> Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one 
> SPARK-11766.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11605.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10102
[https://github.com/apache/spark/pull/10102]

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.0.0, 1.6.1
>
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2015-12-08 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047229#comment-15047229
 ] 

Xin Ren commented on SPARK-12071:
-

Hi, I'd like to take up this ticket if no one is working on it.

By the way, what is the expected result of this ticket? As my understanding, to 
add explanation to README? 
https://github.com/apache/spark/blob/master/R/README.md

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: releasenotes, starter
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047270#comment-15047270
 ] 

Joseph K. Bradley commented on SPARK-8517:
--

+1 on copying content

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12211) Incorrect version number in graphx doc for migration from 1.1

2015-12-08 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-12211:
--

 Summary: Incorrect version number in graphx doc for migration from 
1.1
 Key: SPARK-12211
 URL: https://issues.apache.org/jira/browse/SPARK-12211
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, GraphX
Affects Versions: 1.5.2, 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 
1.2.1, 1.2.0, 1.6.0
Reporter: Andrew Ray
Priority: Minor


Migration from 1.1 section added to the GraphX doc in 1.2.0 (see 
https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11)
 uses {{site.SPARK_VERSION}} as the version where changes were introduced, it 
should be just 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12159.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10166
[https://github.com/apache/spark/pull/10166]

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11602:
--
Shepherd: Joseph K. Bradley

> ML 1.6 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-11602
> URL: https://issues.apache.org/jira/browse/SPARK-11602
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Audit new public Scala APIs added to MLlib.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please comment here, or better yet create JIRAs and link 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9670) Examples: Check for new APIs requiring example code

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9670.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

I'm going to mark this as resolved since some tasks are done and most of the 
work is actually happening as part of [SPARK-11606]

> Examples: Check for new APIs requiring example code
> ---
>
> Key: SPARK-9670
> URL: https://issues.apache.org/jira/browse/SPARK-9670
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>Priority: Minor
> Fix For: 1.6.0
>
>
> Audit list of new features added to MLlib, and see which major items are 
> missing example code (in the examples folder).  We do not need examples for 
> everything, only for major items such as new ML algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8855) Python API for Association Rules

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047432#comment-15047432
 ] 

Joseph K. Bradley commented on SPARK-8855:
--

{quote}
1) Create the class "Association Rules" inside the "fpm.py" file
1.1) Method train(data, minConfidence), that will generate the association 
rules for a data with a minConfidence specified (0.6 default).  This method 
will call the "trainAssociationRules" from the PythonMLLibAPI with the 
parameters data and minConfidence. Returns a FPGrowthModel.
{quote}

This won't be needed for now since the only public API is via FPGrowthModel.  
(Users can't construct an AssociationRules instance since the constructor is 
private.)

{quote}
1.2) Class Rule, that will be a namedtuple and represents a antecedent or 
consequent tuple.
{quote}

Sounds good.

{quote}
2) Add the method generateAssociationRules to FPGrowthModel class (inside 
fpm.py). This method will map the Rules generated (calling the method 
"getAssociationRule" from FPGrowthModelWrapper) to the namedtuple.
{quote}

Sounds good, but I'd use the same name for the method as in FPGrowthModel.

{quote}
Now comes my real problem: how to make trainAssociationRules return a 
FGrowthModel to the Wrapper, so the Wrapper can map the rule received to the 
antecedent/consequent? I can't make trainAssociationRules returns a 
FPGrowthModel. The wrapper for association rules is in FPGrowthModelWrapper, 
right? Something wrong with this idea?
{quote}

I hope the above answers simplify this problem.  You'll just need to be able to 
return an RDD of Rule objects, which you could do by either (a) writing custom 
serialization or (b) passing via a DataFrame, which could handle serialization. 
 You can find examples of both code paths in the Python-Scala interface, so I'd 
do whichever is simpler.

> Python API for Association Rules
> 
>
> Key: SPARK-8855
> URL: https://issues.apache.org/jira/browse/SPARK-8855
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> A simple Python wrapper and doctests needs to be written for Association 
> Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The 
> code will likely live in {{fpm.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Description: Update the user guide for RFormula to cover feature 
interactions  (was: [~ekhliang] Could you please update the user guide for 
RFormula to cover feature interactions?  Thanks!)

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Assignee: (was: Eric Liang)

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> [~ekhliang] Could you please update the user guide for RFormula to cover 
> feature interactions?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12213) Query with only one distinct should not having on expand

2015-12-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12213:
--

 Summary: Query with only one distinct should not having on expand
 Key: SPARK-12213
 URL: https://issues.apache.org/jira/browse/SPARK-12213
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


Expand will double the number of records, slow down projection and aggregation, 
it's better to generate a plan without Expand for a query with only one 
distinct (for example, ss_max in TPCDS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12159:
--
Target Version/s: 1.6.0

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Benjamin Fradet
>Priority: Minor
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047475#comment-15047475
 ] 

Joseph K. Bradley commented on SPARK-11965:
---

[~yanboliang] Would you be able to take this JIRA?  The feature author won't 
have time to.

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12214) Spark to provide an API to save user-defined metadata as part of Sequence File header

2015-12-08 Thread Naveen Pishe (JIRA)
Naveen Pishe created SPARK-12214:


 Summary: Spark to provide an API to save user-defined metadata as 
part of Sequence File header
 Key: SPARK-12214
 URL: https://issues.apache.org/jira/browse/SPARK-12214
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Naveen Pishe


There is requirement to save "user-defined" Metadata as part of Sequence File 
"Header" using Spark. 

To write a User-defined metadata as part of Sequence File using regular Hadoop 
API's I pass the metadata object to SequenceFile.Writer constructor which when 
creates a SequenceFile ensures the metadata is part of the sequence file header.

Currently Spark's JavaPairRDD Api provides methods to save an RDD to an 
SequenceFile format, but I don't see any API which can either give the 
SequenceFile.writer or a method where in I can pass the user-defined metadata 
so as to be written as part of sequence file header.

The enhancement request an API to implement the same
Creating a Union of two RDD (header + data) and saving as Sequence file is not 
a solution as its not a header. Is there any way I can achieve the same.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-12236:
-
Description: 
It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
https://issues.apache.org/jira/browse/SPARK-11677.

Currently JDBC predicate tests all pass even if filters are not pushed down.

This is because of Spark-side filtering. 

Moreover, {{Not(Equal)}} is also being tested which is actually not pushed down 
to JDBC datasource.

  was:
It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
https://issues.apache.org/jira/browse/SPARK-11677.

Currently JDBC predicate tests all pass even if filters are not pushed down or 
this is disabled.

This is because of Spark-side filtering. 

Moreover, {{Not(Equal)}} is also being tested which is actually not pushed down 
to JDBC datasource.


> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048232#comment-15048232
 ] 

Xiao Li commented on SPARK-12233:
-

I am unable to reproduce your error. Try to run my example, can you hit your 
error? 

{code}
sqlContext.udf.register("lowercase", (s: String) =>{
  if (null == s) "" else s.toLowerCase
})
val df1 = Seq(1, 2, 3).map(i => (i, i.toString, i.toString)).toDF("int", 
"str1", "str2")
df1.registerTempTable("testTable")
val df3 = sqlContext.sql(""" SELECT lowercase(str2) AS emailaddr, int, 
str1, str2
   FROM testTable""").distinct()
val df4 = df3.groupBy("int", "str1", "str2").count()
val res = df4.where("count > 1").drop(count("count")).join(df3, Seq("int", 
"str1", "str2")).select("int", "str1", "str2", "emailaddr").collect()
{code}

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>  

[jira] [Updated] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-12236:
-
Affects Version/s: 1.6.0

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-08 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048237#comment-15048237
 ] 

kevin yu commented on SPARK-12231:
--

Hello Yahsuan: I am looking at this problem now. I can recreate the problem.  
but when you say 'if write data without partitionBy, the error won't happen'. 
are you trying with this? 

df1.write.parquet('./data')

df2 = sqlc.read.parquet('./data')
df2.dropna()
df2.count()

I tried without partitionBy, and using 

df2 = sqlc.read.parquet('./data')
df2.dropna().count()

I still get the exception.

I will update with my progress. Thanks.


> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> # read.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12236:


Assignee: (was: Apache Spark)

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048218#comment-15048218
 ] 

Apache Spark commented on SPARK-12236:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/10221

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12235) Enhance mutate() to support replace existing columns

2015-12-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048217#comment-15048217
 ] 

Sun Rui commented on SPARK-12235:
-

[~felixcheung]I think so. Is there any more requirement in SPARK-10346 than in 
this JIRA?

> Enhance mutate() to support replace existing columns
> 
>
> Key: SPARK-12235
> URL: https://issues.apache.org/jira/browse/SPARK-12235
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> mutate() in the dplyr package supports adding new columns and replacing 
> existing columns. But currently the implementation of mutate() in SparkR 
> supports adding new columns only.
> Also make the behavior of mutate more consistent with that in dplyr.
> 1. Throw error message when there are duplicated column names in the 
> DataFrame being mutated.
> 2. when there are duplicated column names in specified columns by arguments, 
> the last column of the same name takes effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12236:


Assignee: Apache Spark

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12220:


Assignee: (was: Apache Spark)

> Make Utils.fetchFile support files that contain special characters
> --
>
> Key: SPARK-12220
> URL: https://issues.apache.org/jira/browse/SPARK-12220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> Now if a file name contains some illegal characters, such as " ", 
> Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12220:


Assignee: Apache Spark

> Make Utils.fetchFile support files that contain special characters
> --
>
> Key: SPARK-12220
> URL: https://issues.apache.org/jira/browse/SPARK-12220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Now if a file name contains some illegal characters, such as " ", 
> Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12225:
---

 Summary: Support adding or replacing multiple columns at once in 
DataFrame API
 Key: SPARK-12225
 URL: https://issues.apache.org/jira/browse/SPARK-12225
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.2
Reporter: Sun Rui


Currently, withColumn() method of DataFrame supports adding or replacing only 
single column. It would be convenient to support adding or replacing multiple 
columns at once.

Also withColumnRenamed() supports renaming only single column.It would also be 
convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12187) *MemoryPool classes should not be public

2015-12-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12187.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10182
[https://github.com/apache/spark/pull/10182]

> *MemoryPool classes should not be public
> 
>
> Key: SPARK-12187
> URL: https://issues.apache.org/jira/browse/SPARK-12187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-08 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047132#comment-15047132
 ] 

Neelesh Srinivas Salian edited comment on SPARK-9059 at 12/8/15 11:15 PM:
--

Is this JIRA still active? 
[~BenFradet]
The PR seems to be closed. 

Shall I go ahead and begin working on it?

Thank you.


was (Author: neelesh77):
Is this JIRA still active? The PR seems to be closed. 

Shall I go ahead and begin working on it?

Thank you.

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-08 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12220:


 Summary: Make Utils.fetchFile support files that contain special 
characters
 Key: SPARK-12220
 URL: https://issues.apache.org/jira/browse/SPARK-12220
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


Now if a file name contains some illegal characters, such as " ", 
Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047705#comment-15047705
 ] 

Apache Spark commented on SPARK-9372:
-

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10209

> For a join operator, rows with null equal join key expression can be filtered 
> out early
> ---
>
> Key: SPARK-9372
> URL: https://issues.apache.org/jira/browse/SPARK-9372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can 
> filter out rows that have null values for column A.key/B.key because those 
> rows do not contribute to the result of the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9843) Catalyst: Allow adding custom optimizers

2015-12-08 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski reopened SPARK-9843:
--

Made a new PR. Let me know if anyone have feedback

> Catalyst: Allow adding custom optimizers
> 
>
> Key: SPARK-9843
> URL: https://issues.apache.org/jira/browse/SPARK-9843
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Robert Kruszewski
>
> Currently there's only option to plug into query planning. This provides 
> limited capability for optimizing queries.
> Allowing specifying custom optimizers would allow applications that make use 
> of spark sql tune their workflows. One example would be inserting 
> repartitions into query plan which isn't generally applicable but if you know 
> range of possible queries you can do that on the backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12222) deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception

2015-12-08 Thread Fei Wang (JIRA)
Fei Wang created SPARK-1:


 Summary: deserialize RoaringBitmap using Kryo serializer throw 
Buffer underflow exception
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Fei Wang


here are some problems when deserialize RoaringBitmap. see the examples below:
run this piece of code
```
import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput}
import java.io.DataInput

class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
  override def readLong(): Long = input.readLong()
  override def readChar(): Char = input.readChar()
  override def readFloat(): Float = input.readFloat()
  override def readByte(): Byte = input.readByte()
  override def readShort(): Short = input.readShort()
  override def readUTF(): String = input.readString() // readString in kryo 
does utf8
  override def readInt(): Int = input.readInt()
  override def readUnsignedShort(): Int = input.readShortUnsigned()
  override def skipBytes(n: Int): Int = input.skip(n.toLong).toInt
  override def readFully(b: Array[Byte]): Unit = input.read(b)
  override def readFully(b: Array[Byte], off: Int, len: Int): Unit = 
input.read(b, off, len)
  override def readLine(): String = throw new 
UnsupportedOperationException("readLine")
  override def readBoolean(): Boolean = input.readBoolean()
  override def readUnsignedByte(): Int = input.readByteUnsigned()
  override def readDouble(): Double = input.readDouble()
}

class KryoOutputDataOutputBridge(output: KryoOutput) extends DataOutput {
  override def writeFloat(v: Float): Unit = output.writeFloat(v)
  // There is no "readChars" counterpart, except maybe "readLine", which is 
not supported
  override def writeChars(s: String): Unit = throw new 
UnsupportedOperationException("writeChars")
  override def writeDouble(v: Double): Unit = output.writeDouble(v)
  override def writeUTF(s: String): Unit = output.writeString(s) // 
writeString in kryo does UTF8
  override def writeShort(v: Int): Unit = output.writeShort(v)
  override def writeInt(v: Int): Unit = output.writeInt(v)
  override def writeBoolean(v: Boolean): Unit = output.writeBoolean(v)
  override def write(b: Int): Unit = output.write(b)
  override def write(b: Array[Byte]): Unit = output.write(b)
  override def write(b: Array[Byte], off: Int, len: Int): Unit = 
output.write(b, off, len)
  override def writeBytes(s: String): Unit = output.writeString(s)
  override def writeChar(v: Int): Unit = output.writeChar(v.toChar)
  override def writeLong(v: Long): Unit = output.writeLong(v)
  override def writeByte(v: Int): Unit = output.writeByte(v)
}
val outStream = new FileOutputStream("D:\\wfserde")
val output = new KryoOutput(outStream)
val bitmap = new RoaringBitmap
bitmap.add(1)
bitmap.add(3)
bitmap.add(5)
bitmap.serialize(new KryoOutputDataOutputBridge(output))
output.flush()
output.close()

val inStream = new FileInputStream("D:\\wfserde")
val input = new KryoInput(inStream)
val ret = new RoaringBitmap
ret.deserialize(new KryoInputDataInputBridge(input))

println(ret)
```

this will throw `Buffer underflow` error:
```
com.esotericsoftware.kryo.KryoException: Buffer underflow.
at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$6$KryoInputDataInputBridge$1.skipBytes
```

after same investigation,  i found this is caused by a bug of kryo's 
`Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) 
and we call this method in `KryoInputDataInputBridge`.

So i think we can fix this issue in this two ways:
1) upgrade the kryo version to 2.23.0 or 2.24.0, which has fix this bug in kryo 
(i am not sure the upgrade is safe in spark, can you check it? @davies )

2) we can bypass the  kryo's `Input.skip(long count)` by directly call another 
`skip` method in kryo's 
Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124),
 i.e. write the bug-fixed version of `Input.skip(long count)` in 
KryoInputDataInputBridge's `skipBytes` method:
```
   class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
  ...
  override def skipBytes(n: Int): Int = {
var remaining: Long = n
while (remaining > 0) {
  val skip = Math.min(Integer.MAX_VALUE, remaining).asInstanceOf[Int]
  input.skip(skip)
  remaining -= skip
}
n
 }
  ...
}
```



--
This 

[jira] [Commented] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047894#comment-15047894
 ] 

Yanbo Liang commented on SPARK-11965:
-

Sure, I can take it.

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12224) R support for JDBC source

2015-12-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047878#comment-15047878
 ] 

Sun Rui commented on SPARK-12224:
-

great

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12223) Spark 1.5 pre-bulit releases don't work with the java version shipped with Macs

2015-12-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12223.

Resolution: Invalid

Spark 1.5 requires Java 7.

> Spark 1.5 pre-bulit releases don't work with the java version shipped with 
> Macs
> ---
>
> Key: SPARK-12223
> URL: https://issues.apache.org/jira/browse/SPARK-12223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: $ uname -a
> Darwin grixis 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep  1 21:23:09 PDT 
> 2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64
> $ java -version
> java version "1.6.0_65"
> Java(TM) SE Runtime Environment (build 1.6.0_65-b14-466.1-11M4716)
> Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-466.1, mixed mode)
>Reporter: Dan Adkins
>Priority: Blocker
>
> I downloaded the latest release (1.5.2) from 
> [http://spark.apache.org/downloads.html] and attempted to execute step 1 of 
> the Python quick start guide 
> [http://spark.apache.org/docs/latest/quick-start.html].
> $ ./bin/pyspark 
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/launcher/Main
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> This looks similar to SPARK-1703 which is caused by attempting to run a Java7 
> jar with JRE 6. I reproduced the problem with all of the 1.5.x releases. This 
> problem doesn't exist for me in version 1.4.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12069) Documentation update for Datasets

2015-12-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12069.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10060
[https://github.com/apache/spark/pull/10060]

> Documentation update for Datasets
> -
>
> Key: SPARK-12069
> URL: https://issues.apache.org/jira/browse/SPARK-12069
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12221) Add CPU time metric to TaskMetrics

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12221:


Assignee: Apache Spark

> Add CPU time metric to TaskMetrics
> --
>
> Key: SPARK-12221
> URL: https://issues.apache.org/jira/browse/SPARK-12221
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.2
>Reporter: Jisoo Kim
>Assignee: Apache Spark
>
> Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
> so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12224) R support for JDBC source

2015-12-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047868#comment-15047868
 ] 

Felix Cheung commented on SPARK-12224:
--

I'm working on this.

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12224) R support for JDBC source

2015-12-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12224:


 Summary: R support for JDBC source
 Key: SPARK-12224
 URL: https://issues.apache.org/jira/browse/SPARK-12224
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10643) Support HDFS application download in client mode spark submit

2015-12-08 Thread Alan Braithwaite (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Braithwaite updated SPARK-10643:
-
Summary: Support HDFS application download in client mode spark submit  
(was: Support HDFS urls in spark-submit)

> Support HDFS application download in client mode spark submit
> -
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2280) Java & Scala reference docs should describe function reference behavior.

2015-12-08 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047674#comment-15047674
 ] 

Neelesh Srinivas Salian commented on SPARK-2280:


[~srowen] , checking to see if this is still in progress/needed?

> Java & Scala reference docs should describe function reference behavior.
> 
>
> Key: SPARK-2280
> URL: https://issues.apache.org/jira/browse/SPARK-2280
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>Priority: Minor
>  Labels: starter
>
> Example
>  JavaPairRDD groupBy(Function f)
> Return an RDD of grouped elements. Each group consists of a key and a 
> sequence of elements mapping to that key. 
> T and K are not described and there is no explanation of what the function's 
> inputs and outputs should be and how GroupBy uses this information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047687#comment-15047687
 ] 

Apache Spark commented on SPARK-12220:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10208

> Make Utils.fetchFile support files that contain special characters
> --
>
> Key: SPARK-12220
> URL: https://issues.apache.org/jira/browse/SPARK-12220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> Now if a file name contains some illegal characters, such as " ", 
> Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9843) Catalyst: Allow adding custom optimizers

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047721#comment-15047721
 ] 

Apache Spark commented on SPARK-9843:
-

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/10210

> Catalyst: Allow adding custom optimizers
> 
>
> Key: SPARK-9843
> URL: https://issues.apache.org/jira/browse/SPARK-9843
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Robert Kruszewski
>
> Currently there's only option to plug into query planning. This provides 
> limited capability for optimizing queries.
> Allowing specifying custom optimizers would allow applications that make use 
> of spark sql tune their workflows. One example would be inserting 
> repartitions into query plan which isn't generally applicable but if you know 
> range of possible queries you can do that on the backend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12223) Spark 1.5 pre-bulit releases don't work with the java version shipped with Macs

2015-12-08 Thread Dan Adkins (JIRA)
Dan Adkins created SPARK-12223:
--

 Summary: Spark 1.5 pre-bulit releases don't work with the java 
version shipped with Macs
 Key: SPARK-12223
 URL: https://issues.apache.org/jira/browse/SPARK-12223
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.5.1, 1.5.0
 Environment: $ uname -a
Darwin grixis 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep  1 21:23:09 PDT 
2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64

$ java -version
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-466.1-11M4716)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-466.1, mixed mode)

Reporter: Dan Adkins
Priority: Blocker


I downloaded the latest release (1.5.2) from 
[http://spark.apache.org/downloads.html] and attempted to execute step 1 of the 
Python quick start guide [http://spark.apache.org/docs/latest/quick-start.html].

$ ./bin/pyspark 
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/launcher/Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

This looks similar to SPARK-1703 which is caused by attempting to run a Java7 
jar with JRE 6. I reproduced the problem with all of the 1.5.x releases. This 
problem doesn't exist for me in version 1.4.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12221) Add CPU time metric to TaskMetrics

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047813#comment-15047813
 ] 

Apache Spark commented on SPARK-12221:
--

User 'jisookim0513' has created a pull request for this issue:
https://github.com/apache/spark/pull/10212

> Add CPU time metric to TaskMetrics
> --
>
> Key: SPARK-12221
> URL: https://issues.apache.org/jira/browse/SPARK-12221
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.2
>Reporter: Jisoo Kim
>
> Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
> so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12221) Add CPU time metric to TaskMetrics

2015-12-08 Thread Jisoo Kim (JIRA)
Jisoo Kim created SPARK-12221:
-

 Summary: Add CPU time metric to TaskMetrics
 Key: SPARK-12221
 URL: https://issues.apache.org/jira/browse/SPARK-12221
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 1.5.2
Reporter: Jisoo Kim


Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12221) Add CPU time metric to TaskMetrics

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12221:


Assignee: (was: Apache Spark)

> Add CPU time metric to TaskMetrics
> --
>
> Key: SPARK-12221
> URL: https://issues.apache.org/jira/browse/SPARK-12221
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.2
>Reporter: Jisoo Kim
>
> Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
> so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047830#comment-15047830
 ] 

Hyukjin Kwon commented on SPARK-9278:
-

The result might be definitely different as I ran the codes below with master 
branch of Spark, local environment without S3, Scala API and Mac OS. Though, I 
will leave the comment about what I tested in case you might want to test 
without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
StructField("k", StringType, true),
StructField("pk", StringType, true),
StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // create a empty table.
  sdf.filter("FALSE")
.write
.format("parquet")
.option("path", "foo")
.partitionBy("pk")
.saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
.write
.partitionBy("pk")
.insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code}

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Assignee: Cheng Lian
>Priority: Blocker
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047830#comment-15047830
 ] 

Hyukjin Kwon edited comment on SPARK-9278 at 12/9/15 1:21 AM:
--

The result might be definitely different as I ran the codes below with master 
branch of Spark, local environment without S3, Scala API and Mac OS. Though, I 
will leave the comment about what I tested in case you might want to test 
without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
StructField("k", StringType, true),
StructField("pk", StringType, true),
StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // Create a empty table.
  sdf.filter("FALSE")
.write
.format("parquet")
.option("path", "foo")
.partitionBy("pk")
.saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
.write
.partitionBy("pk")
.insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code}


was (Author: hyukjin.kwon):
The result might be definitely different as I ran the codes below with master 
branch of Spark, local environment without S3, Scala API and Mac OS. Though, I 
will leave the comment about what I tested in case you might want to test 
without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
StructField("k", StringType, true),
StructField("pk", StringType, true),
StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // create a empty table.
  sdf.filter("FALSE")
.write
.format("parquet")
.option("path", "foo")
.partitionBy("pk")
.saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
.write
.partitionBy("pk")
.insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code}

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Assignee: Cheng Lian
>Priority: Blocker
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12182) Distributed binning for trees in spark.ml

2015-12-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047660#comment-15047660
 ] 

Seth Hendrickson commented on SPARK-12182:
--

I'm working on this. Following mllib implementation: 
[PR-8246|https://github.com/apache/spark/pull/8246]

> Distributed binning for trees in spark.ml
> -
>
> Key: SPARK-12182
> URL: https://issues.apache.org/jira/browse/SPARK-12182
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> This is for porting [SPARK-10064] to spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047791#comment-15047791
 ] 

Apache Spark commented on SPARK-11678:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10211

> Partition discovery fail if there is a _SUCCESS file in the table's root dir
> 
>
> Key: SPARK-11678
> URL: https://issues.apache.org/jira/browse/SPARK-11678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047928#comment-15047928
 ] 

Xiao Li commented on SPARK-12218:
-

Could you provide the plan by explain(true)? [~imachabeli] Thanks!

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047962#comment-15047962
 ] 

Tao Li commented on SPARK-12179:


Sorry, row_number is udf written by myself, not spark internal udf. Do I still 
need to test it on 1.6-RC1 and 1.3?

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12232) Consider exporting read.table in R

2015-12-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12232:
-
Summary: Consider exporting read.table in R  (was: Consider exporting in R 
read.table)

> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12232) Consider exporting read.table in R

2015-12-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048039#comment-15048039
 ] 

Felix Cheung edited comment on SPARK-12232 at 12/9/15 5:22 AM:
---

WIP here: https://github.com/felixcheung/spark/commits/readtable

It seems to be table() is a odd choice since it is about contingency table.
read.table() matches closer to our intend but by exporting it from SparkR it 
makes base::read.table() inaccessible if calling without package:: prefix 
(there is no S4 generics), which seems very bad to me that the user can't 
create a data.frame.

Thought?

[~shivaram][~sunrui][~yanboliang]


was (Author: felixcheung):
WIP here: 
https://github.com/felixcheung/spark/commit/999607180fa1a30b14a6e182f23aeb322c977cf5

It seems to be table() is a odd choice since it is about contingency table.
read.table() matches closer to our intend but by exporting it from SparkR it 
makes base::read.table() inaccessible if calling without package:: prefix 
(there is no S4 generics), which seems very bad to me that the user can't 
create a data.frame.

Thought?

[~shivaram][~sunrui][~yanboliang]

> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Fengdong Yu (JIRA)
Fengdong Yu created SPARK-12233:
---

 Summary: Cannot specify a data frame column during join
 Key: SPARK-12233
 URL: https://issues.apache.org/jira/browse/SPARK-12233
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Fengdong Yu
Priority: Minor


Background:

two tables: 
tableA(id string, name string, gender string)
tableB(id string, name string)

{code}
val df1 = sqlContext.sql(select * from tableA)
val df2 = sqlContext.sql(select * from tableB)

//Wrong
df1.join(df2, Seq("id", "name").select(df2("id"), df2("name"), df1("gender"))

//Correct
df1.join(df2, Seq("id", "name").select("id", "name", "gender")

{code}


Cannot specify column of data frame for 'gender'




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048059#comment-15048059
 ] 

Xiao Li edited comment on SPARK-12225 at 12/9/15 5:21 AM:
--

This is related to the changes on the external APIs. Need to collect more ideas 
before starting it. 

Do you like the following interfaces? [~marmbrus] [~rxin] [~sunrui]
{code}
  def withColumn(columns: Map[String, Column]): DataFrame
  def withColumnRenamed(columns: Map[String, String]): DataFrame
{code}

Then, how about the multi-column support of withColumn if having metadata? 
{code}
  def withColumn(colName: String, col: Column, metadata: Metadata): DataFrame
{code}



was (Author: smilegator):
This is related to the changes on the external APIs. Need to collect more ideas 
before starting it. 

Do you like the following interfaces? [~marmbrus] [~rxin] [~sunrui]
{code}
  def withColumn(columns: Map[String, Column]): DataFrame
  def withColumnRenamed(columns: Map[String, String]): DataFrame
{code}

Then, how about the multi-column support of withColumn if having metadata? 
{code}
  def withColumn(colName: String, col: Column, metadata: Metadata): 
{code}


> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048079#comment-15048079
 ] 

holdenk commented on SPARK-12233:
-

Could you maybe so what happens with the "wrong" example? Also it seems like 
some parts may have gotten lost (e.g. there is no "s around the SQL statement 
and the brackets don't balance, etc.) - maybe double check the repro example?

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> Background:
> two tables: 
> tableA(id string, name string, gender string)
> tableB(id string, name string)
> {code}
> val df1 = sqlContext.sql(select * from tableA)
> val df2 = sqlContext.sql(select * from tableB)
> //Wrong
> df1.join(df2, Seq("id", "name").select(df2("id"), df2("name"), df1("gender"))
> //Correct
> df1.join(df2, Seq("id", "name").select("id", "name", "gender")
> {code}
> 
> Cannot specify column of data frame for 'gender'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6929) Alias for more complex expression causes attribute not been able to resolve

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048098#comment-15048098
 ] 

Xiao Li commented on SPARK-6929:


[~srowen] Could you close this issue? This has been resolved, I think. 

> Alias for more complex expression causes attribute not been able to resolve
> ---
>
> Key: SPARK-6929
> URL: https://issues.apache.org/jira/browse/SPARK-6929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michał Warecki
>Priority: Critical
>
> I've extracted the minimal query that don't work with aliases. You can remove 
> tstudent expression ((tstudent((COUNT(g_0.test2_value) - 1)) from that query 
> and result will be the same. In exception you can see that c_0 is not 
> resolved but c_1 cause that problem.
> {code}
> SELECT g_0.test1 AS c_0, (AVG(g_0.test2) - ((tstudent((COUNT(g_0.test2_value) 
> - 1)) * stddev(g_0.test2_value)) / sqrt(convert(COUNT(g_0.test2), long AS 
> c_1 FROM sometable AS g_0 GROUP BY g_0.test1 ORDER BY c_0 LIMIT 502
> {code}
> cause exception:
> {code}
> Remote org.apache.spark.sql.AnalysisException: cannot resolve 'c_0' given 
> input columns c_0, c_1; line 1 pos 246
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> 

[jira] [Updated] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12234:

Description: 
SparkR subset function throw error when only set "select" argument, it's easy 
to repreduce.
In SparkR:
{code}
> df <- suppressWarnings(createDataFrame(sqlContext, iris))
> subset(df, select=c("Sepal_Length", "Petal_Length", "Species"))
Error in x[subset, select, ...] : 
  error in evaluating the argument 'i' in selecting a method for function '[': 
Error: argument "subset" is missing, with no default
{code}
But in base R, the subset function works well with only specifying "select" 
argument:
{code}
> df <- iris
> subset(df, select=c("Sepal.Length", "Petal.Length", "Species"))
Sepal.Length Petal.LengthSpecies
15.1  1.4 setosa
24.9  1.4 setosa
34.7  1.3 setosa
44.6  1.5 setosa
55.0  1.4 setosa
..
{code}

  was:
SparkR subset throw error when only set "select" argument, it's easy to 
repreduce.
In SparkR:
{code}
> df <- suppressWarnings(createDataFrame(sqlContext, iris))
> subset(df, select=c("Sepal_Length", "Petal_Length", "Species"))
Error in x[subset, select, ...] : 
  error in evaluating the argument 'i' in selecting a method for function '[': 
Error: argument "subset" is missing, with no default
{code}
But in base R, the subset function works well with only specifying "select" 
argument:
{code}
> df <- iris
> subset(df, select=c("Sepal.Length", "Petal.Length", "Species"))
Sepal.Length Petal.LengthSpecies
15.1  1.4 setosa
24.9  1.4 setosa
34.7  1.3 setosa
44.6  1.5 setosa
55.0  1.4 setosa
..
{code}


> SparkR subset throw error when only set "select" argument
> -
>
> Key: SPARK-12234
> URL: https://issues.apache.org/jira/browse/SPARK-12234
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR subset function throw error when only set "select" argument, it's easy 
> to repreduce.
> In SparkR:
> {code}
> > df <- suppressWarnings(createDataFrame(sqlContext, iris))
> > subset(df, select=c("Sepal_Length", "Petal_Length", "Species"))
> Error in x[subset, select, ...] : 
>   error in evaluating the argument 'i' in selecting a method for function 
> '[': Error: argument "subset" is missing, with no default
> {code}
> But in base R, the subset function works well with only specifying "select" 
> argument:
> {code}
> > df <- iris
> > subset(df, select=c("Sepal.Length", "Petal.Length", "Species"))
> Sepal.Length Petal.LengthSpecies
> 15.1  1.4 setosa
> 24.9  1.4 setosa
> 34.7  1.3 setosa
> 44.6  1.5 setosa
> 55.0  1.4 setosa
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12234:

Description: 
SparkR subset throw error when only set "select" argument, it's easy to 
repreduce.
In SparkR:
{code}
> df <- suppressWarnings(createDataFrame(sqlContext, iris))
> subset(df, select=c("Sepal_Length", "Petal_Length", "Species"))
Error in x[subset, select, ...] : 
  error in evaluating the argument 'i' in selecting a method for function '[': 
Error: argument "subset" is missing, with no default
{code}
But in base R, the subset function works well with only specifying "select" 
argument:
{code}
> df <- iris
> subset(df, select=c("Sepal.Length", "Petal.Length", "Species"))
Sepal.Length Petal.LengthSpecies
15.1  1.4 setosa
24.9  1.4 setosa
34.7  1.3 setosa
44.6  1.5 setosa
55.0  1.4 setosa
..
{code}

  was:SparkR subset throw error when only set "select" argument


> SparkR subset throw error when only set "select" argument
> -
>
> Key: SPARK-12234
> URL: https://issues.apache.org/jira/browse/SPARK-12234
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR subset throw error when only set "select" argument, it's easy to 
> repreduce.
> In SparkR:
> {code}
> > df <- suppressWarnings(createDataFrame(sqlContext, iris))
> > subset(df, select=c("Sepal_Length", "Petal_Length", "Species"))
> Error in x[subset, select, ...] : 
>   error in evaluating the argument 'i' in selecting a method for function 
> '[': Error: argument "subset" is missing, with no default
> {code}
> But in base R, the subset function works well with only specifying "select" 
> argument:
> {code}
> > df <- iris
> > subset(df, select=c("Sepal.Length", "Petal.Length", "Species"))
> Sepal.Length Petal.LengthSpecies
> 15.1  1.4 setosa
> 24.9  1.4 setosa
> 34.7  1.3 setosa
> 44.6  1.5 setosa
> 55.0  1.4 setosa
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-08 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048177#comment-15048177
 ] 

Benjamin Fradet commented on SPARK-9059:


Hi [~neelesh77],

I know the documentation has been updated and I don't see any use of 
`HasOffsetRanges` in the Scala or Java examples.
Pinging [~tdas], to get more information.

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047905#comment-15047905
 ] 

Joseph K. Bradley commented on SPARK-8517:
--

I merged this PR, but am leaving the JIRA open since it has remaining subtasks.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047934#comment-15047934
 ] 

Xiao Li commented on SPARK-12225:
-

Ok, thank you! 

> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047930#comment-15047930
 ] 

Sun Rui commented on SPARK-12225:
-

No. Go ahead if you are interested

> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12228) Use in-memory for execution hive's derby metastore

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12228:


Assignee: Apache Spark  (was: Yin Huai)

> Use in-memory for execution hive's derby metastore
> --
>
> Key: SPARK-12228
> URL: https://issues.apache.org/jira/browse/SPARK-12228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Starting from Hive 0.13, the derby metastore can use a in-memory backend. 
> Since our execution hive is a fake metastore, if we use in-memory mode, we 
> can reduce the time that is used on creating the execution hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12228) Use in-memory for execution hive's derby metastore

2015-12-08 Thread Yin Huai (JIRA)
Yin Huai created SPARK-12228:


 Summary: Use in-memory for execution hive's derby metastore
 Key: SPARK-12228
 URL: https://issues.apache.org/jira/browse/SPARK-12228
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


Starting from Hive 0.13, the derby metastore can use a in-memory backend. Since 
our execution hive is a fake metastore, if we use in-memory mode, we can reduce 
the time that is used on creating the execution hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12228) Use in-memory for execution hive's derby metastore

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047954#comment-15047954
 ] 

Apache Spark commented on SPARK-12228:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10204

> Use in-memory for execution hive's derby metastore
> --
>
> Key: SPARK-12228
> URL: https://issues.apache.org/jira/browse/SPARK-12228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Starting from Hive 0.13, the derby metastore can use a in-memory backend. 
> Since our execution hive is a fake metastore, if we use in-memory mode, we 
> can reduce the time that is used on creating the execution hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12228) Use in-memory for execution hive's derby metastore

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12228:


Assignee: Yin Huai  (was: Apache Spark)

> Use in-memory for execution hive's derby metastore
> --
>
> Key: SPARK-12228
> URL: https://issues.apache.org/jira/browse/SPARK-12228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Starting from Hive 0.13, the derby metastore can use a in-memory backend. 
> Since our execution hive is a fake metastore, if we use in-memory mode, we 
> can reduce the time that is used on creating the execution hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12230) WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.

2015-12-08 Thread Imran Younus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047985#comment-15047985
 ] 

Imran Younus commented on SPARK-12230:
--

I can work on this.

> WeightedLeastSquares.fit() should handle division by zero properly if 
> standard deviation of target variable is zero.
> 
>
> Key: SPARK-12230
> URL: https://issues.apache.org/jira/browse/SPARK-12230
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Imran Younus
>Priority: Trivial
>
> This is a TODO in WeightedLeastSquares.fit() method. If the standard 
> deviation of the target variablel is zero, then the regression is 
> meaningless. I think the fit() method should inform the user and exit nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12164) [SQL] Display the binary/encoded values

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047997#comment-15047997
 ] 

Apache Spark commented on SPARK-12164:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10215

> [SQL] Display the binary/encoded values
> ---
>
> Key: SPARK-12164
> URL: https://issues.apache.org/jira/browse/SPARK-12164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, we are using comma-separated decimal format to output the encoded 
> contents. This way is rare when the data is in binary. This could be a common 
> issue when we use Dataset API. 
> For example, 
> {code}
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
> KryoClassData("c", 3)).toDS()
> ds.show(20, false);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12222) deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception

2015-12-08 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-1.

   Resolution: Fixed
Fix Version/s: 1.6.0
   2.0.0

Issue resolved by pull request 10213
[https://github.com/apache/spark/pull/10213]

> deserialize RoaringBitmap using Kryo serializer throw Buffer underflow 
> exception
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Fei Wang
> Fix For: 2.0.0, 1.6.0
>
>
> here are some problems when deserialize RoaringBitmap. see the examples below:
> run this piece of code
> ```
> import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput}
> import java.io.DataInput
> class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
>   override def readLong(): Long = input.readLong()
>   override def readChar(): Char = input.readChar()
>   override def readFloat(): Float = input.readFloat()
>   override def readByte(): Byte = input.readByte()
>   override def readShort(): Short = input.readShort()
>   override def readUTF(): String = input.readString() // readString in 
> kryo does utf8
>   override def readInt(): Int = input.readInt()
>   override def readUnsignedShort(): Int = input.readShortUnsigned()
>   override def skipBytes(n: Int): Int = input.skip(n.toLong).toInt
>   override def readFully(b: Array[Byte]): Unit = input.read(b)
>   override def readFully(b: Array[Byte], off: Int, len: Int): Unit = 
> input.read(b, off, len)
>   override def readLine(): String = throw new 
> UnsupportedOperationException("readLine")
>   override def readBoolean(): Boolean = input.readBoolean()
>   override def readUnsignedByte(): Int = input.readByteUnsigned()
>   override def readDouble(): Double = input.readDouble()
> }
> class KryoOutputDataOutputBridge(output: KryoOutput) extends DataOutput {
>   override def writeFloat(v: Float): Unit = output.writeFloat(v)
>   // There is no "readChars" counterpart, except maybe "readLine", which 
> is not supported
>   override def writeChars(s: String): Unit = throw new 
> UnsupportedOperationException("writeChars")
>   override def writeDouble(v: Double): Unit = output.writeDouble(v)
>   override def writeUTF(s: String): Unit = output.writeString(s) // 
> writeString in kryo does UTF8
>   override def writeShort(v: Int): Unit = output.writeShort(v)
>   override def writeInt(v: Int): Unit = output.writeInt(v)
>   override def writeBoolean(v: Boolean): Unit = output.writeBoolean(v)
>   override def write(b: Int): Unit = output.write(b)
>   override def write(b: Array[Byte]): Unit = output.write(b)
>   override def write(b: Array[Byte], off: Int, len: Int): Unit = 
> output.write(b, off, len)
>   override def writeBytes(s: String): Unit = output.writeString(s)
>   override def writeChar(v: Int): Unit = output.writeChar(v.toChar)
>   override def writeLong(v: Long): Unit = output.writeLong(v)
>   override def writeByte(v: Int): Unit = output.writeByte(v)
> }
> val outStream = new FileOutputStream("D:\\wfserde")
> val output = new KryoOutput(outStream)
> val bitmap = new RoaringBitmap
> bitmap.add(1)
> bitmap.add(3)
> bitmap.add(5)
> bitmap.serialize(new KryoOutputDataOutputBridge(output))
> output.flush()
> output.close()
> val inStream = new FileInputStream("D:\\wfserde")
> val input = new KryoInput(inStream)
> val ret = new RoaringBitmap
> ret.deserialize(new KryoInputDataInputBridge(input))
> println(ret)
> ```
> this will throw `Buffer underflow` error:
> ```
> com.esotericsoftware.kryo.KryoException: Buffer underflow.
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
>   at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
>   at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
>   at 
> org.apache.spark.sql.SQLQuerySuite$$anonfun$6$KryoInputDataInputBridge$1.skipBytes
> ```
> after same investigation,  i found this is caused by a bug of kryo's 
> `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) 
> and we call this method in `KryoInputDataInputBridge`.
> So i think we can fix this issue in this two ways:
> 1) upgrade the kryo version to 2.23.0 or 2.24.0, which has fix this bug in 
> kryo (i am not sure the upgrade is safe in spark, can you check it? @davies )
> 2) we can bypass the  kryo's `Input.skip(long count)` by directly call 
> another `skip` method in kryo's 
> 

[jira] [Comment Edited] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048096#comment-15048096
 ] 

Xiao Li edited comment on SPARK-12233 at 12/9/15 6:00 AM:
--

This is another self join issue. I will try to see if it is a known issue. 


was (Author: smilegator):
This is another self join issue. I will try to see if it is a well-known issue. 

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   

[jira] [Created] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12234:
---

 Summary: SparkR subset throw error when only set "select" argument
 Key: SPARK-12234
 URL: https://issues.apache.org/jira/browse/SPARK-12234
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Yanbo Liang


SparkR subset throw error when only set "select" argument



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12234:


Assignee: (was: Apache Spark)

> SparkR subset throw error when only set "select" argument
> -
>
> Key: SPARK-12234
> URL: https://issues.apache.org/jira/browse/SPARK-12234
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR subset throw error when only set "select" argument



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048139#comment-15048139
 ] 

Apache Spark commented on SPARK-12234:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10217

> SparkR subset throw error when only set "select" argument
> -
>
> Key: SPARK-12234
> URL: https://issues.apache.org/jira/browse/SPARK-12234
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR subset throw error when only set "select" argument



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12234) SparkR subset throw error when only set "select" argument

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12234:


Assignee: Apache Spark

> SparkR subset throw error when only set "select" argument
> -
>
> Key: SPARK-12234
> URL: https://issues.apache.org/jira/browse/SPARK-12234
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> SparkR subset throw error when only set "select" argument



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047965#comment-15047965
 ] 

Tao Li edited comment on SPARK-12179 at 12/9/15 6:56 AM:
-

The row_number implementation is as follows:

package UDF;

import java.io.PrintStream;
import org.apache.hadoop.hive.ql.exec.UDF;

public class row_number extends UDF
{
  private static int MAX_VALUE = 50;
  private static String[] comparedColumn = new String[MAX_VALUE];
  private static int rowNum = 1;

  public int evaluate(Object[] args) {
String[] columnValue = new String[args.length];
for (int i = 0; i < args.length; i++) {
  columnValue[i] = (args[i] == null ? "" : args[i].toString());
}
if (rowNum == 1) {
  for (int i = 0; i < columnValue.length; i++) {
comparedColumn[i] = columnValue[i];
  }
}
for (int i = 0; i < columnValue.length; i++) {
  if (!comparedColumn[i].equals(columnValue[i])) {
for (int j = 0; j < columnValue.length; j++) {
  comparedColumn[j] = columnValue[j];
}
rowNum = 1;

return rowNum++;
  }
}
return rowNum++;
  }
}


was (Author: litao1990):
The row_number implementation is as follows:

package UDF;

import org.apache.hadoop.hive.ql.exec.UDF;

public class RowNumber extends UDF
{
  private static int MAX_VALUE = 50;
  private static String[] comparedColumn = new String[MAX_VALUE];
  private static int rowNum = 1;

  public int evaluate(Object[] args) {
String[] columnValue = new String[args.length];
for (int i = 0; i < args.length; i++)
{
  columnValue[i] = (args[i] == null ? "" : args[i].toString());
}
if (rowNum == 1) {
  for (int i = 0; i < columnValue.length; i++) {
comparedColumn[i] = columnValue[i];
  }
}
for (int i = 0; i < columnValue.length; i++) {
  if (!comparedColumn[i].equals(columnValue[i])) {
for (int j = 0; j < columnValue.length; j++) {
  comparedColumn[j] = columnValue[j];
}
rowNum = 1;
return rowNum++;
  }
}
return rowNum++;
  }
}

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-12236:


 Summary: JDBC filter tests all pass if filters are not really 
pushed down
 Key: SPARK-12236
 URL: https://issues.apache.org/jira/browse/SPARK-12236
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Hyukjin Kwon


It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
https://issues.apache.org/jira/browse/SPARK-11677.

Currently JDBC predicate tests all pass even if filters are not pushed down or 
this is disabled.

This is because of Spark-side filtering. 

Moreover, {{Not(Equal)}} is also being tested which is actually not pushed down 
to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047922#comment-15047922
 ] 

Xiao Li commented on SPARK-12225:
-

[~sunrui] Will you deliver the feature? Otherwise, I can work on it. Thanks!

> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12227) Support drop multiple columns specified by Column class in DataFrame API

2015-12-08 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12227:
---

 Summary: Support drop multiple columns specified by Column class 
in DataFrame API
 Key: SPARK-12227
 URL: https://issues.apache.org/jira/browse/SPARK-12227
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.2
Reporter: Sun Rui


In SPARK-11884, dropping multiple columns specified by column names in the 
DataFrame API was supported.

However, there are two drop variants for single column:
{code}
def drop(colName: String)
def drop(col: Column)
{code}

>From API parity's point of view, it would be better to also support dropping 
>multiple columns specified by Column class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-08 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047938#comment-15047938
 ] 

Tao Li commented on SPARK-12179:


ok, i will try on it

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048059#comment-15048059
 ] 

Xiao Li commented on SPARK-12225:
-

This is related to the changes on the external APIs. Need to collect more ideas 
before starting it. 

Do you like the following interfaces? [~marmbrus] [~rxin] [~sunrui]
{code}
  def withColumn(columns: Map[String, Column]): DataFrame
  def withColumnRenamed(columns: Map[String, String]): DataFrame
{code}

Then, how about the multi-column support of withColumn if having metadata? 
{code}
  def withColumn(colName: String, col: Column, metadata: Metadata): 
{code}


> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Description: 
{code}
sqlContext.udf.register("lowercase", (s: String) =>{
  if (null == s) "" else s.toLowerCase
})

sqlContext.udf.register("substr", (s: String) =>{
  if (null == s) ""
  else {
val index = s.indexOf("@")
if (index < 0) s else s.toLowerCase.substring(index + 1)}
})

sqlContext.read.orc("/data/test/test.data")
.registerTempTable("testTable")

val extracted = 
sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
  lowercase(family_name) AS 
family_name, 
  substr(email_address) AS domain, 
  lowercase(email_address) AS emailaddr,
  experience
  
   FROM testTable 
   WHERE email_address != '' 
   """)
   .distinct

val count =
 extracted.groupBy("given_name", "family_name", "domain")
   .count

count.where(count("count") > 1)
 .drop(count("count"))
 .join(extracted, Seq("given_name", "family_name", "domain"))
{code}
{color:red} .select(count("given_name"), count("family_name"), 
extracted("emailaddr"))  {color}
{code}
.show
{code}

Red Font should be:
{color:red} select("given_name", "family_name", "emailaddr") {color}

{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
missing from 
given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 
in operator !Project [given_name#522,family_name#523,emailaddr#525];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC.(:59)
at $iwC$$iwC.(:61)
at $iwC.(:63)
at (:65)
at .(:69)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:675)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:640)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:633)
at 
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
  

[jira] [Updated] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Description: 
{code}
sqlContext.udf.register("lowercase", (s: String) =>{
  if (null == s) "" else s.toLowerCase
})

sqlContext.udf.register("substr", (s: String) =>{
  if (null == s) ""
  else {
val index = s.indexOf("@")
if (index < 0) s else s.toLowerCase.substring(index + 1)}
})

sqlContext.read.orc("/data/test/test.data")
.registerTempTable("testTable")

val extracted = 
sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
  lowercase(family_name) AS 
family_name, 
  substr(email_address) AS domain, 
  lowercase(email_address) AS emailaddr,
  experience
  
   FROM testTable 
   WHERE email_address != '' 
   """)
   .distinct

val count =
 extracted.groupBy("given_name", "family_name", "domain")
   .count

count.where(count("count") > 1)
 .drop(count("count"))
 .join(extracted, Seq("given_name", "family_name", "domain"))
{code}
{color:red} .select(count("given_name"), count("family_name"), 
extracted("emailaddr"))  {color}



Red Font should be:
{color:red} select("given_name", "family_name", "emailaddr") {color}

{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
missing from 
given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 
in operator !Project [given_name#522,family_name#523,emailaddr#525];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC.(:59)
at $iwC$$iwC.(:61)
at $iwC.(:63)
at (:65)
at .(:69)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:675)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:640)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:633)
at 
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at 

[jira] [Commented] (SPARK-12227) Support drop multiple columns specified by Column class in DataFrame API

2015-12-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048126#comment-15048126
 ] 

Sun Rui commented on SPARK-12227:
-

[~hyukjin.kwon] sure

> Support drop multiple columns specified by Column class in DataFrame API
> 
>
> Key: SPARK-12227
> URL: https://issues.apache.org/jira/browse/SPARK-12227
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> In SPARK-11884, dropping multiple columns specified by column names in the 
> DataFrame API was supported.
> However, there are two drop variants for single column:
> {code}
> def drop(colName: String)
> def drop(col: Column)
> {code}
> From API parity's point of view, it would be better to also support dropping 
> multiple columns specified by Column class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-08 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3200:
---
Description: 
Reproducer:
{noformat}
val a = sc.textFile("README.md").count
case class A(i: Int) { val j = a} 
sc.parallelize(1 to 10).map(A(_)).collect()
{noformat}
This will happen only in distributed mode, when one refers something that 
refers sc and not otherwise. 
There are many ways to work around this, like directly assign a constant value 
instead of referring the variable. 

  was:
Reproducer:
{noformat}
val a = sc.textFile("README.md").count
case class A(i: Int) { val j = a} 
sc.parallelize(1 to 10).map(A(_)).collect()
{noformat}
This will happen, when one refers something that refers sc and not otherwise. 
There are many ways to work around this, like directly assign a constant value 
instead of referring the variable. 


> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen only in distributed mode, when one refers something that 
> refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12232) Consider exporting read.table in R

2015-12-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048189#comment-15048189
 ] 

Yanbo Liang edited comment on SPARK-12232 at 12/9/15 7:18 AM:
--

I vote for do not expose read.table because it has different semantics compared 
with base R and other read.*** functions.
In function "SQLContext.read.table(tableName: String)", users load a table as a 
DataFrame by specifying the tableName, but the table metadata must already 
exist in the catalog such as "HiveMetastoreCatalog". It means users can not use 
"read.table()" to load an external data source as a DataFrame if it does not 
have metadata stored at Spark catalog, user must know the file format and use 
corresponding function such as "read.json".
The read.table interface mainly used to access a table which has already loaded 
into Spark as RDD at Spark SQL side, consider that RDD will be deprecated at 
2.0, I think it's unnecessary for SparkR. 



was (Author: yanboliang):
I vote for do not expose read.table because it has different semantics compared 
with base R and other read.*** functions.
In function "SQLContext.read.table(tableName: String)", users load a table as a 
DataFrame by specifying the tableName, but the table metadata must already 
exist in the catalog such as "HiveMetastoreCatalog". It means users can not use 
"read.table()" to load an external data source as a DataFrame if it does not 
have metadata stored at Spark catalog, user must know the file format and use 
corresponding function such as "read.json".
The read.table interface mainly used to access a table which has already loaded 
into Spark as RDD at Spark SQL side, so I think it's unnecessary for SparkR. 


> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12232) Consider exporting read.table in R

2015-12-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048039#comment-15048039
 ] 

Felix Cheung commented on SPARK-12232:
--

WIP here: 
https://github.com/felixcheung/spark/commit/999607180fa1a30b14a6e182f23aeb322c977cf5

It seems to be table() is a odd choice since it is about contingency table.
read.table() matches closer to our intend but by exporting it from SparkR it 
makes base::read.table() inaccessible if calling without package:: prefix 
(there is no S4 generics), which seems very bad to me that the user can't 
create a data.frame.

Thought?

[~shivaram][~sunrui][~yanboliang]

> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Description: 
{code}
sqlContext.udf.register("lowercase", (s: String) =>{
  if (null == s) "" else s.toLowerCase
})

sqlContext.udf.register("substr", (s: String) =>{
  if (null == s) ""
  else {
val index = s.indexOf("@")
if (index < 0) s else s.toLowerCase.substring(index + 1)}
})

sqlContext.read.orc("/data/test/test.data")
.registerTempTable("testTable")

val extracted = 
sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
  lowercase(family_name) AS 
family_name, 
  substr(email_address) AS domain, 
  lowercase(email_address) AS emailaddr,
  experience
  
   FROM testTable 
   WHERE email_address != '' 
   """)
   .distinct

val count =
 extracted.groupBy("given_name", "family_name", "domain")
   .count

count.where(count("count") > 1)
 .drop(count("count"))
 .join(extracted, Seq("given_name", "family_name", "domain"))
 {color:red}.select(count("given_name"), count("family_name"), 
extracted("emailaddr")) {color}
.show
{code}

Red Font should be:
{color:red} select("given_name", "family_name", "emailaddr") {color}

{code}
org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
missing from 
given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 
in operator !Project [given_name#522,family_name#523,emailaddr#525];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC.(:59)
at $iwC$$iwC.(:61)
at $iwC.(:63)
at (:65)
at .(:69)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:675)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:640)
at 
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:633)
at 
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048087#comment-15048087
 ] 

Xiao Li commented on SPARK-12233:
-

Please post the error message you got. Thanks! 

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> Background:
> two tables: 
> tableA(id string, name string, gender string)
> tableB(id string, name string)
> {code}
> val df1 = sqlContext.sql("select * from tableA")
> val df2 = sqlContext.sql("select * from tableB")
> //Wrong
> df1.join(df2, Seq("id", "name")).select(df2("id"), df2("name"), df1("gender"))
> //Correct
> df1.join(df2, Seq("id", "name").select("id", "name", "gender")
> {code}
> 
> Cannot specify column of data frame for 'gender'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-08 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048128#comment-15048128
 ] 

Prashant Sharma commented on SPARK-3200:


[~srowen] Reproducer is intended for distributed mode, I can still reproduce it.

> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen only in distributed mode, when one refers something that 
> refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-08 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-3200.
--
Resolution: Not A Problem

> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen only in distributed mode, when one refers something that 
> refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-08 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reopened SPARK-3200:


> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen only in distributed mode, when one refers something that 
> refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12227) Support drop multiple columns specified by Column class in DataFrame API

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12227:


Assignee: (was: Apache Spark)

> Support drop multiple columns specified by Column class in DataFrame API
> 
>
> Key: SPARK-12227
> URL: https://issues.apache.org/jira/browse/SPARK-12227
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> In SPARK-11884, dropping multiple columns specified by column names in the 
> DataFrame API was supported.
> However, there are two drop variants for single column:
> {code}
> def drop(colName: String)
> def drop(col: Column)
> {code}
> From API parity's point of view, it would be better to also support dropping 
> multiple columns specified by Column class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-08 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048144#comment-15048144
 ] 

Prashant Sharma commented on SPARK-3200:


However, the good part is no one actually ran into this bug except me imagining 
it and filing it. This bug exists since the beginning of the spark-repl. So we 
can close it with some other rationale.

> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen only in distributed mode, when one refers something that 
> refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12227) Support drop multiple columns specified by Column class in DataFrame API

2015-12-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12227:


Assignee: Apache Spark

> Support drop multiple columns specified by Column class in DataFrame API
> 
>
> Key: SPARK-12227
> URL: https://issues.apache.org/jira/browse/SPARK-12227
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> In SPARK-11884, dropping multiple columns specified by column names in the 
> DataFrame API was supported.
> However, there are two drop variants for single column:
> {code}
> def drop(colName: String)
> def drop(col: Column)
> {code}
> From API parity's point of view, it would be better to also support dropping 
> multiple columns specified by Column class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12227) Support drop multiple columns specified by Column class in DataFrame API

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048146#comment-15048146
 ] 

Apache Spark commented on SPARK-12227:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/10218

> Support drop multiple columns specified by Column class in DataFrame API
> 
>
> Key: SPARK-12227
> URL: https://issues.apache.org/jira/browse/SPARK-12227
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> In SPARK-11884, dropping multiple columns specified by column names in the 
> DataFrame API was supported.
> However, there are two drop variants for single column:
> {code}
> def drop(colName: String)
> def drop(col: Column)
> {code}
> From API parity's point of view, it would be better to also support dropping 
> multiple columns specified by Column class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12232) Consider exporting read.table in R

2015-12-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048195#comment-15048195
 ] 

Felix Cheung commented on SPARK-12232:
--

right, but then table() is confusing as well.
R's notion of `table` is more like `data.frame`


> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048199#comment-15048199
 ] 

Hyukjin Kwon commented on SPARK-12236:
--

I would like to work on this.

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down 
> or this is disabled.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >