[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-29 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605273#comment-14605273
 ] 

Animesh Baranawal commented on SPARK-8621:
--

cc [~marmbrus] [~rxin] I think enclosing the column names and row names in   
is better

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-28 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605112#comment-14605112
 ] 

Animesh Baranawal commented on SPARK-8636:
--

No a row with null will not be equal to another row with null if we follow the 
modification. What do you say [~smolav] ?

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411
 ] 

Animesh Baranawal commented on SPARK-8636:
--

So the condition should be :
if (l == null || r == null) false
else l == r

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411
 ] 

Animesh Baranawal edited comment on SPARK-8636 at 6/26/15 5:13 AM:
---

So the condition should be :
if (l == null || r == null) false
else l == r ?


was (Author: animeshbaranawal):
So the condition should be :
if (l == null || r == null) false
else l == r

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601008#comment-14601008
 ] 

Animesh Baranawal commented on SPARK-8621:
--

How about enclosing the column names and row names in   ?

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-24 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 6:54 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much better to check the rule before calling write function instead of adding 
the rule in checkanalysis.scala


was (Author: animeshbaranawal):
[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to check the rule before calling write function instead of 
adding the rule in checkanalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:34 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to check the rule before calling write function instead of 
adding the rule in checkanalysis.scala


was (Author: animeshbaranawal):
[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM:
---

[rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala


was (Author: animeshbaranawal):
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala


was (Author: animeshbaranawal):
[rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal commented on SPARK-8072:
--

If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597271#comment-14597271
 ] 

Animesh Baranawal commented on SPARK-8072:
--

[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

Can you suggest how to resolve this?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597271#comment-14597271
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/23/15 8:24 AM:
---

[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. How to 
resolve this?


was (Author: animeshbaranawal):
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

Can you suggest how to resolve this?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597271#comment-14597271
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/23/15 9:28 AM:
---

[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. This can 
also be resolved with a minor tweak in the code of agg.

Should the first test then be removed?


was (Author: animeshbaranawal):
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. How to 
resolve this?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-23 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597271#comment-14597271
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/23/15 9:28 AM:
---

[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. This can 
also be resolved with a minor tweak in the code of 'describe'.

Should the first test then be removed?


was (Author: animeshbaranawal):
[~rxin]
Creating a rule regarding duplicate columns results in three of the tests in 
DataFrameSuite to fail. The tests are :

1. drop column after join with duplicate columns using column reference
2. explode alias and star
3. describe 

If we are supporting only distinct columns in a dataframe, then we dont need 
the first test and the second test can be made to pass easily.
However, in the third test, the describe function uses the agg function which 
results in a dataframe with multiple columns with identical names. This can 
also be resolved with a minor tweak in the code of agg.

Should the first test then be removed?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-19 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593180#comment-14593180
 ] 

Animesh Baranawal commented on SPARK-8072:
--

For the AnalysisException message, is it sufficient to state that duplicate 
columns are present? or the column which is duplicate also needs to be stated?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 

[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py

2015-06-09 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578383#comment-14578383
 ] 

Animesh Baranawal commented on SPARK-8120:
--

Is it not working correctly?

 Typos in warning message in sql/types.py
 

 Key: SPARK-8120
 URL: https://issues.apache.org/jira/browse/SPARK-8120
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 See 
 [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093]
 Need to fix string concat + use of %



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-07 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572714#comment-14572714
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/8/15 4:25 AM:
--

I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.
Any thoughts [~rxin]?
Can I do this ?


was (Author: animeshbaranawal):
I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.
[~rxin]?

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at 

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-05 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572714#comment-14572714
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/5/15 8:06 AM:
--

I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.
[~rxin]?


was (Author: animeshbaranawal):
I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 

[jira] [Issue Comment Deleted] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-04 Thread Animesh Baranawal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Animesh Baranawal updated SPARK-8072:
-
Comment: was deleted

(was: I was thinking that maybe we can use __repr__ on column list to extract 
their names and then find whether any duplicates are their in list of names 
obtained.)

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-04 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572627#comment-14572627
 ] 

Animesh Baranawal commented on SPARK-8072:
--

I was thinking that maybe we can use __repr__ on column list to extract their 
names and then find whether any duplicates are their in list of names obtained.

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-04 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572714#comment-14572714
 ] 

Animesh Baranawal commented on SPARK-8072:
--

I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 

[jira] [Issue Comment Deleted] (SPARK-7980) Support SQLContext.range(end)

2015-06-03 Thread Animesh Baranawal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Animesh Baranawal updated SPARK-7980:
-
Comment: was deleted

(was: Regarding the python support for range, I am unable to check the 
functioning in pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given))

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568720#comment-14568720
 ] 

Animesh Baranawal commented on SPARK-7980:
--

Regarding the python support for range, I am unable to check the functioning in 
pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given)

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568722#comment-14568722
 ] 

Animesh Baranawal commented on SPARK-7980:
--

Regarding the python support for range, I am unable to check the functioning in 
pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given)

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7980) Support SQLContext.range(end)

2015-06-02 Thread Animesh Baranawal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Animesh Baranawal updated SPARK-7980:
-
Comment: was deleted

(was: Regarding the python support for range, I am unable to check the 
functioning in pyspark...
Even for the pre-defined range function in context.py, when I type the 
following in ./bin/pyspark:

sqlContext.range(1, 7, 2).collect() 

I get the error:
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: range() takes exactly 2 arguments (4 given))

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7828) LACK OF EXAMPLES IN SPARK SQL

2015-05-22 Thread Animesh Baranawal (JIRA)
Animesh Baranawal created SPARK-7828:


 Summary: LACK OF EXAMPLES IN SPARK SQL
 Key: SPARK-7828
 URL: https://issues.apache.org/jira/browse/SPARK-7828
 Project: Spark
  Issue Type: Improvement
  Components: Examples, SQL
Reporter: Animesh Baranawal
Priority: Minor


There are insufficient examples in Spark SQL regarding JSON datasets, schema 
merging and JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7828) LACK OF EXAMPLES IN SPARK SQL

2015-05-22 Thread Animesh Baranawal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Animesh Baranawal closed SPARK-7828.


 LACK OF EXAMPLES IN SPARK SQL
 -

 Key: SPARK-7828
 URL: https://issues.apache.org/jira/browse/SPARK-7828
 Project: Spark
  Issue Type: Improvement
  Components: Examples, SQL
Reporter: Animesh Baranawal
Priority: Minor

 There are insufficient examples in Spark SQL regarding JSON datasets, schema 
 merging and JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7273) The SQLContext.jsonFile() api has a problem when load a format json file?

2015-05-11 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537691#comment-14537691
 ] 

Animesh Baranawal commented on SPARK-7273:
--

I am getting no error on compiling your code... can you post some more details 
or probably post it on the mailing list?

 The SQLContext.jsonFile() api has a problem when load a format json file?
 -

 Key: SPARK-7273
 URL: https://issues.apache.org/jira/browse/SPARK-7273
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: steven
Priority: Minor

 my code as follow:
  val df = sqlContext.jsonFile(test.json);
 test.json content is:
  { name: steven,
 age : 20
 }
 the jsonFile invoke will get a Exception as follow:
   java.lang.RuntimeException: Failed to parse record age : 20}. 
 Please make sure that each line of the file (or each string in the RDD) is a 
 valid JSON object or an array of JSON objects.
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:313)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:307)
 is it a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org