[jira] [Updated] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-11 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13821:
--
Description: 
TPC-DS Query 20 Fails to compile with the follwing Error Message
{noformat}
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)

{noformat}

  was:
TPC-DS Query 20 Fails to compile with the follwing Error Message
{format}
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)

{format}


> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at 

[jira] [Updated] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-11 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13821:
--
Description: 
TPC-DS Query 20 Fails to compile with the follwing Error Message
{format}
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)

{format}

  was:
TPC-DS Query 20 Fails to compile with the follwing Error Message

Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( tableAllColumns 
)=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( 
KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN 
) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)




> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {format}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at 

[jira] [Resolved] (SPARK-13794) Rename DataFrameWriter.stream DataFrameWriter.startStream

2016-03-10 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha resolved SPARK-13794.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Rename DataFrameWriter.stream DataFrameWriter.startStream
> -
>
> Key: SPARK-13794
> URL: https://issues.apache.org/jira/browse/SPARK-13794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This makes it more obvious with the verb "start" that we are actually 
> starting some execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame

2016-03-10 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13795:
--
Description: 
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be 
previously an inferred column of type long that was dropped  
 
{code}

DataFrame allDataJoined = whiteEventJoinedWithReference.
   drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() 
 
{code}

throws the following exception vv

{noformat}

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at scala.math.Ordering$Int$.compare(Ordering.scala:256)
at scala.math.Ordering$class.gt(Ordering.scala:97)
at scala.math.Ordering$Int$.gt(Ordering.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
  

[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame

2016-03-10 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13795:
--
Description: 
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be 
previously an inferred column of type long that was dropped  
 
DataFrame allDataJoined = whiteEventJoinedWithReference.
   drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() throws the following exception vv

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at scala.math.Ordering$Int$.compare(Ordering.scala:256)
at scala.math.Ordering$class.gt(Ordering.scala:97)
at scala.math.Ordering$Int$.gt(Ordering.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
at 

[jira] [Created] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty

2015-09-15 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-10623:
-

 Summary: turning on predicate pushdown throws nonsuch element 
exception when RDD is empty 
 Key: SPARK-10623
 URL: https://issues.apache.org/jira/browse/SPARK-10623
 Project: Spark
  Issue Type: Bug
Reporter: Ram Sriharsha


Turning on predicate pushdown for ORC datasources results in a 
NoSuchElementException:

scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")
df: org.apache.spark.sql.DataFrame = [name: string]

scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

scala> df.explain
== Physical Plan ==
java.util.NoSuchElementException

Disabling the pushdown makes things work again:

scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")

scala> df.explain
== Physical Plan ==
Project [name#6]
 Filter (age#7 < 15)
  Scan 
OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty

2015-09-15 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-10623:
--
Assignee: Zhan Zhang

> turning on predicate pushdown throws nonsuch element exception when RDD is 
> empty 
> -
>
> Key: SPARK-10623
> URL: https://issues.apache.org/jira/browse/SPARK-10623
> Project: Spark
>  Issue Type: Bug
>Reporter: Ram Sriharsha
>Assignee: Zhan Zhang
>
> Turning on predicate pushdown for ORC datasources results in a 
> NoSuchElementException:
> scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")
> df: org.apache.spark.sql.DataFrame = [name: string]
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> scala> df.explain
> == Physical Plan ==
> java.util.NoSuchElementException
> Disabling the pushdown makes things work again:
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")
> scala> df.explain
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 < 15)
>   Scan 
> OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9670) Examples: Check for new APIs requiring example code

2015-09-11 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741196#comment-14741196
 ] 

Ram Sriharsha commented on SPARK-9670:
--

Hi Joseph, I don't have any more items to add to this list. But SPARK-7546 is 
still  open. I am not happy with the example I have there, will work on it a 
bit and close it out for the next release. For now can we keep this open since 
it has SPARK-7546 as a dependency?

> Examples: Check for new APIs requiring example code
> ---
>
> Key: SPARK-9670
> URL: https://issues.apache.org/jira/browse/SPARK-9670
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Ram Sriharsha
>Priority: Minor
>
> Audit list of new features added to MLlib, and see which major items are 
> missing example code (in the examples folder).  We do not need examples for 
> everything, only for major items such as new ML algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10251) Some internal spark classes are not registered with kryo

2015-08-26 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-10251:
-

Assignee: Ram Sriharsha

 Some internal spark classes are not registered with kryo
 

 Key: SPARK-10251
 URL: https://issues.apache.org/jira/browse/SPARK-10251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Soren Macbeth
Assignee: Ram Sriharsha

 When running a job using kryo serialization and setting 
 `spark.kryo.registrationRequired=true` some internal classes are not 
 registered, causing the job to die. This is still a problem when this setting 
 is false (which is the default) because it makes the space required to store 
 serialized objects in memory or disk much much more expensive in terms of 
 runtime and storage space.
 {code}
 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 
 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not 
 registered: scala.Tuple2[]
 Note: To register this class use: kryo.register(scala.Tuple2[].class);
 at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
 at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
 at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
 at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10251) Some internal spark classes are not registered with kryo

2015-08-26 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714161#comment-14714161
 ] 

Ram Sriharsha commented on SPARK-10251:
---

as far as I can see, this happens from spark 1.2 onward. haven't gone back yet 
to see if this was present before spark 1.2
a temporary workaround is to register the necessary classes manually by setting 
the following conf property:
--conf spark.kryo.classesToRegister = [Lscala.Tuple2;

I'm looking into a better solution now

 Some internal spark classes are not registered with kryo
 

 Key: SPARK-10251
 URL: https://issues.apache.org/jira/browse/SPARK-10251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Soren Macbeth
Assignee: Ram Sriharsha

 When running a job using kryo serialization and setting 
 `spark.kryo.registrationRequired=true` some internal classes are not 
 registered, causing the job to die. This is still a problem when this setting 
 is false (which is the default) because it makes the space required to store 
 serialized objects in memory or disk much much more expensive in terms of 
 runtime and storage space.
 {code}
 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 
 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not 
 registered: scala.Tuple2[]
 Note: To register this class use: kryo.register(scala.Tuple2[].class);
 at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
 at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
 at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
 at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10251) Some internal spark classes are not registered with kryo

2015-08-26 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714161#comment-14714161
 ] 

Ram Sriharsha edited comment on SPARK-10251 at 8/26/15 3:36 PM:


as far as I can see, this happens from spark 1.2 onward. haven't gone back yet 
to see if this was present before spark 1.2
a temporary workaround is to register the necessary classes manually by setting 
the following conf property:
--conf spark.kryo.classesToRegister = [Lscala.Tuple2;

I'm looking into a better solution now: one option is to automatically register 
such classes (i.e. arrays of tuples, lists of tuples, None, etc)


was (Author: rams):
as far as I can see, this happens from spark 1.2 onward. haven't gone back yet 
to see if this was present before spark 1.2
a temporary workaround is to register the necessary classes manually by setting 
the following conf property:
--conf spark.kryo.classesToRegister = [Lscala.Tuple2;

I'm looking into a better solution now

 Some internal spark classes are not registered with kryo
 

 Key: SPARK-10251
 URL: https://issues.apache.org/jira/browse/SPARK-10251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Soren Macbeth
Assignee: Ram Sriharsha

 When running a job using kryo serialization and setting 
 `spark.kryo.registrationRequired=true` some internal classes are not 
 registered, causing the job to die. This is still a problem when this setting 
 is false (which is the default) because it makes the space required to store 
 serialized objects in memory or disk much much more expensive in terms of 
 runtime and storage space.
 {code}
 15/08/25 20:28:21 WARN spark.scheduler.TaskSetManager: Lost task 0.0 in stage 
 0.0 (TID 0, a.b.c.d): java.lang.IllegalArgumentException: Class is not 
 registered: scala.Tuple2[]
 Note: To register this class use: kryo.register(scala.Tuple2[].class);
 at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
 at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
 at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
 at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:250)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:236)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9670) ML 1.5 QA: Examples: Check for new APIs requiring example code

2015-08-13 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694756#comment-14694756
 ] 

Ram Sriharsha commented on SPARK-9670:
--

Hey Yuhao

There is already a JIRA to add a complex pipeline example. You can use the same 
JIRA.
https://github.com/apache/spark/pull/6654
What do you have in mind?

 ML 1.5 QA: Examples: Check for new APIs requiring example code
 --

 Key: SPARK-9670
 URL: https://issues.apache.org/jira/browse/SPARK-9670
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha
Priority: Minor

 Audit list of new features added to MLlib, and see which major items are 
 missing example code (in the examples folder).  We do not need examples for 
 everything, only for major items such as new ML algorithms.
 For any such items:
 * Create a JIRA for that feature, and assign it to the author of the feature 
 (or yourself if interested).
 * Link it to (a) the original JIRA which introduced that feature (related 
 to) and (b) to this JIRA (requires).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-07-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7690:
-
Assignee: Eron Wright   (was: Ram Sriharsha)

 MulticlassClassificationEvaluator for tuning Multiclass Classifiers
 ---

 Key: SPARK-7690
 URL: https://issues.apache.org/jira/browse/SPARK-7690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Eron Wright 

 Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
 multiclass classifiers using Pipeline API.
 MLLib already provides a MulticlassMetrics functionality which can be wrapped 
 around a MulticlassClassificationEvaluator to expose weighted F1-score as 
 metric.
 The functionality could be similar to 
 scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
   in that we can support micro, macro and weighted versions of the F1-score 
 (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-07-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7690:
-
Shepherd: Ram Sriharsha

 MulticlassClassificationEvaluator for tuning Multiclass Classifiers
 ---

 Key: SPARK-7690
 URL: https://issues.apache.org/jira/browse/SPARK-7690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Eron Wright 

 Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
 multiclass classifiers using Pipeline API.
 MLLib already provides a MulticlassMetrics functionality which can be wrapped 
 around a MulticlassClassificationEvaluator to expose weighted F1-score as 
 metric.
 The functionality could be similar to 
 scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
   in that we can support micro, macro and weighted versions of the F1-score 
 (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7546) Example code for ML Pipelines feature transformations

2015-06-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-7546:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Example code for ML Pipelines feature transformations
 -

 Key: SPARK-7546
 URL: https://issues.apache.org/jira/browse/SPARK-7546
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 This should be added for Scala, Java, and Python.
 It should cover ML Pipelines using a complex series of feature 
 transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming

2015-06-03 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-8092:
-
Fix Version/s: 1.4.1

 OneVsRest doesn't allow flexibility in label/ feature column renaming
 -

 Key: SPARK-8092
 URL: https://issues.apache.org/jira/browse/SPARK-8092
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming

2015-06-03 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-8092:
-
Component/s: ML

 OneVsRest doesn't allow flexibility in label/ feature column renaming
 -

 Key: SPARK-8092
 URL: https://issues.apache.org/jira/browse/SPARK-8092
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7546) Example code for ML Pipelines feature transformations

2015-06-01 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-7546:


Assignee: Ram Sriharsha

 Example code for ML Pipelines feature transformations
 -

 Key: SPARK-7546
 URL: https://issues.apache.org/jira/browse/SPARK-7546
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 This should be added for Scala, Java, and Python.
 It should cover ML Pipelines using a complex series of feature 
 transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6013) Add more Python ML examples for spark.ml

2015-05-29 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14565819#comment-14565819
 ] 

Ram Sriharsha commented on SPARK-6013:
--

Cross Validator Example is covered as part of this PR:
https://issues.apache.org/jira/browse/SPARK-7387

 Add more Python ML examples for spark.ml
 

 Key: SPARK-6013
 URL: https://issues.apache.org/jira/browse/SPARK-6013
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 Now that the spark.ml Pipelines API is supported within Python, we should 
 duplicate the remaining Scala/Java spark.ml examples within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7882) HBase Input Format Example does not allow passing ZK parent node

2015-05-26 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7882:


 Summary: HBase Input Format Example does not allow passing ZK 
parent node
 Key: SPARK-7882
 URL: https://issues.apache.org/jira/browse/SPARK-7882
 Project: Spark
  Issue Type: Bug
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
Priority: Minor


HBase Input Format example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L52

precludes passing a fourth parameter (zk.node.parent) even though down the line 
there is code checking for a possible fourth parameter and interpreting it as 
zk.node.parent here :
https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py#L71




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7861) Python wrapper for OneVsRest

2015-05-25 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7861:


 Summary: Python wrapper for OneVsRest
 Key: SPARK-7861
 URL: https://issues.apache.org/jira/browse/SPARK-7861
 Project: Spark
  Issue Type: Improvement
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7833) Add python wrapper for RegressionEvaluator

2015-05-22 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7833:


 Summary: Add python wrapper for RegressionEvaluator
 Key: SPARK-7833
 URL: https://issues.apache.org/jira/browse/SPARK-7833
 Project: Spark
  Issue Type: Improvement
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha


Add a python wrapper for RegressionEvaluator in the ML Pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6013) Add more Python ML examples for spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-6013:


Assignee: Ram Sriharsha

 Add more Python ML examples for spark.ml
 

 Key: SPARK-6013
 URL: https://issues.apache.org/jira/browse/SPARK-6013
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 Now that the spark.ml Pipelines API is supported within Python, we should 
 duplicate the remaining Scala/Java spark.ml examples within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555326#comment-14555326
 ] 

Ram Sriharsha commented on SPARK-7404:
--

ah perfect, didn't notice RegressionMetrics in codebase. that is great!

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Ram Sriharsha

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-7404:


Assignee: Ram Sriharsha

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Ram Sriharsha

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:35 PM:


scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.


was (Author: rams):
sickout learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Ram Sriharsha

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha edited comment on SPARK-7404 at 5/21/15 11:36 PM:


scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural first metrics to make available via the 
Evaluator.


was (Author: rams):
scikit learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Ram Sriharsha

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7404) Add RegressionEvaluator to spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555285#comment-14555285
 ] 

Ram Sriharsha commented on SPARK-7404:
--

sickout learn and R provide a variety of regression metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

http://artax.karlin.mff.cuni.cz/r-help/library/rminer/html/mmetric.html
R2 score and RMSE seem like natural metrics to make available via the Evaluator.

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Ram Sriharsha

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-05-17 Thread Ram Sriharsha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-7690:


Assignee: Ram Sriharsha

 MulticlassClassificationEvaluator for tuning Multiclass Classifiers
 ---

 Key: SPARK-7690
 URL: https://issues.apache.org/jira/browse/SPARK-7690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha

 Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
 multiclass classifiers using Pipeline API.
 MLLib already provides a MulticlassMetrics functionality which can be wrapped 
 around a MulticlassClassificationEvaluator to expose weighted F1-score as 
 metric.
 The functionality could be similar to 
 scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
   in that we can support micro, macro and weighted versions of the F1-score 
 (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-05-17 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7690:


 Summary: MulticlassClassificationEvaluator for tuning Multiclass 
Classifiers
 Key: SPARK-7690
 URL: https://issues.apache.org/jira/browse/SPARK-7690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha


Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
multiclass classifiers using Pipeline API.
MLLib already provides a MulticlassMetrics functionality which can be wrapped 
around a MulticlassClassificationEvaluator to expose weighted F1-score as 
metric.
The functionality could be similar to 
scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
  in that we can support micro, macro and weighted versions of the F1-score 
(with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7460) Provide DataFrame.zip (analog of RDD.zip) to merge two data frames

2015-05-07 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7460:


 Summary: Provide DataFrame.zip (analog of RDD.zip) to merge two 
data frames
 Key: SPARK-7460
 URL: https://issues.apache.org/jira/browse/SPARK-7460
 Project: Spark
  Issue Type: Sub-task
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
Priority: Minor


an analog of RDD1.zip(RDD2) for data frames allows us to merge two data frames 
without stepping down to the RDD layer and back. (syntactic sugar)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5866) pyspark read from s3

2015-05-05 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686
 ] 

Ram Sriharsha commented on SPARK-5866:
--

i'm not sure how the scala version is working..the exception suggests its 
looking for a path with protocol s3, when it should be s3n:// (The 
NativeS3FileSystem scheme is s3n)

 pyspark read from s3
 

 Key: SPARK-5866
 URL: https://issues.apache.org/jira/browse/SPARK-5866
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1
 Environment: mac OSx and ec2 ubuntu
Reporter: venu k tangirala

 I am trying to read data from s3 via pyspark, I gave the credentials with 
 sc= SparkContext()
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key)
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key)
 I also tried setting the credentials with core-site.xml and placed in the 
 conf/ dir. 
 Interestingly, the same works with scala version of spark, both by setting 
 the s3 key and secret key in scala code and also by setting it in 
 core-site.xml
 The pySpark error is as follows :
 File /Users/myname/path/./spark_json.py, line 55, in module
 vals_table = sqlContext.inferSchema(values)
   File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in 
 inferSchema
 first = rdd.first()
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first
 rs = self.take(1)
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take
 totalParts = self._jrdd.partitions().size()
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py,
  line 538, in __call__
 self.target_id, self.name)
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py,
  line 300, in get_return_value
 format(target_id, '.', name), value)
 py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
 : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
 does not exist: s3://bucketName/pathS3/_1417479684
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
   at 
 org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
   at 
 org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
   at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5866) pyspark read from s3

2015-05-05 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686
 ] 

Ram Sriharsha edited comment on SPARK-5866 at 5/5/15 3:44 PM:
--

i'm not sure how the scala version is working..the exception suggests its 
looking for a path with scheme s3, when it should be s3n (The 
NativeS3FileSystem scheme is s3n)


was (Author: rams):
i'm not sure how the scala version is working..the exception suggests its 
looking for a path with protocol s3, when it should be s3n:// (The 
NativeS3FileSystem scheme is s3n)

 pyspark read from s3
 

 Key: SPARK-5866
 URL: https://issues.apache.org/jira/browse/SPARK-5866
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1
 Environment: mac OSx and ec2 ubuntu
Reporter: venu k tangirala

 I am trying to read data from s3 via pyspark, I gave the credentials with 
 sc= SparkContext()
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key)
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key)
 I also tried setting the credentials with core-site.xml and placed in the 
 conf/ dir. 
 Interestingly, the same works with scala version of spark, both by setting 
 the s3 key and secret key in scala code and also by setting it in 
 core-site.xml
 The pySpark error is as follows :
 File /Users/myname/path/./spark_json.py, line 55, in module
 vals_table = sqlContext.inferSchema(values)
   File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in 
 inferSchema
 first = rdd.first()
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first
 rs = self.take(1)
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take
 totalParts = self._jrdd.partitions().size()
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py,
  line 538, in __call__
 self.target_id, self.name)
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py,
  line 300, in get_return_value
 format(target_id, '.', name), value)
 py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
 : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
 does not exist: s3://bucketName/pathS3/_1417479684
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
   at 
 org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
   at 
 org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
   at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5866) pyspark read from s3

2015-05-05 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528686#comment-14528686
 ] 

Ram Sriharsha edited comment on SPARK-5866 at 5/5/15 3:45 PM:
--

i'm not sure how the scala version is working..the exception suggests its 
looking for a path with scheme s3, when it should be s3n (The 
NativeS3FileSystem scheme is s3n)
As far as I can see, you have a typo in your path and this is not a bug.


was (Author: rams):
i'm not sure how the scala version is working..the exception suggests its 
looking for a path with scheme s3, when it should be s3n (The 
NativeS3FileSystem scheme is s3n)

 pyspark read from s3
 

 Key: SPARK-5866
 URL: https://issues.apache.org/jira/browse/SPARK-5866
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1
 Environment: mac OSx and ec2 ubuntu
Reporter: venu k tangirala

 I am trying to read data from s3 via pyspark, I gave the credentials with 
 sc= SparkContext()
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, key)
 sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, secret_key)
 I also tried setting the credentials with core-site.xml and placed in the 
 conf/ dir. 
 Interestingly, the same works with scala version of spark, both by setting 
 the s3 key and secret key in scala code and also by setting it in 
 core-site.xml
 The pySpark error is as follows :
 File /Users/myname/path/./spark_json.py, line 55, in module
 vals_table = sqlContext.inferSchema(values)
   File /Users/myname/spark-1.2.1/python/pyspark/sql.py, line 1332, in 
 inferSchema
 first = rdd.first()
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1139, in first
 rs = self.take(1)
   File /Users/myname/spark-1.2.1/python/pyspark/rdd.py, line 1091, in take
 totalParts = self._jrdd.partitions().size()
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py,
  line 538, in __call__
 self.target_id, self.name)
   File 
 /anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py,
  line 300, in get_return_value
 format(target_id, '.', name), value)
 py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
 : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path 
 does not exist: s3://bucketName/pathS3/_1417479684
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
   at 
 org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
   at 
 org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
   at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504197#comment-14504197
 ] 

Ram Sriharsha commented on SPARK-7015:
--

sounds good. Let me know what reference you had in mind.. I am familiar with 
Beygelzimer,Langford's error correcting tournaments 
http://hunch.net/~beygel/tournament.pdf but if you have a better reference in 
mind, let me know I can use that as the starting point.

 Multiclass to Binary Reduction
 --

 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
   Original Estimate: 336h
  Remaining Estimate: 336h

 With the new Pipeline API, it is possible to seamlessly support machine 
 learning reductions as meta algorithms.
 GBDT and SVM today are binary classifiers and we can implement multi class 
 classification as a One vs All, or All vs All (or even more sophisticated 
 reduction) using binary classifiers as primitives.
 This JIRA is to track the creation of a reduction API for multi class 
 classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7014) Support loading VW formatted data.

2015-04-20 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7014:


 Summary: Support loading VW formatted data.
 Key: SPARK-7014
 URL: https://issues.apache.org/jira/browse/SPARK-7014
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha


Support loading data in VW format. VW format is used fairly widely and has 
support for namespaces, importance weighting and multi label and cost sensitive 
multi class classification formats.
We can support this just as we support avro and csv formats today .
It probably belongs to a new package say spark-vw but this JIRA is simply to 
track and discuss the issue of supporting VW format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7015) Multiclass to Binary Reduction

2015-04-20 Thread Ram Sriharsha (JIRA)
Ram Sriharsha created SPARK-7015:


 Summary: Multiclass to Binary Reduction
 Key: SPARK-7015
 URL: https://issues.apache.org/jira/browse/SPARK-7015
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha


With the new Pipeline API, it is possible to seamlessly support machine 
learning reductions as meta algorithms.
GBDT and SVM today are binary classifiers and we can implement multi class 
classification as a One vs All, or All vs All (or even more sophisticated 
reduction) using binary classifiers as primitives.
This JIRA is to track the creation of a reduction API for multi class 
classification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7014) Support loading VW formatted data.

2015-04-20 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503476#comment-14503476
 ] 

Ram Sriharsha commented on SPARK-7014:
--

good point, the priority is minor.
the plan is to have it as a separate library code like spark-afro for example.
it would be good to know who else is using VW formatted data for example., 
maybe there isn't enough usage to warrant a parser.

 Support loading VW formatted data.
 --

 Key: SPARK-7014
 URL: https://issues.apache.org/jira/browse/SPARK-7014
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
Priority: Minor
   Original Estimate: 96h
  Remaining Estimate: 96h

 Support loading data in VW format. VW format is used fairly widely and has 
 support for namespaces, importance weighting and multi label and cost 
 sensitive multi class classification formats.
 We can support this just as we support avro and csv formats today .
 It probably belongs to a new package say spark-vw but this JIRA is simply to 
 track and discuss the issue of supporting VW format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7014) Support loading VW formatted data.

2015-04-20 Thread Ram Sriharsha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503476#comment-14503476
 ] 

Ram Sriharsha edited comment on SPARK-7014 at 4/20/15 7:36 PM:
---

good point, the priority is minor.
the plan is to have it as a separate library code like spark-avro for example.
it would be good to know who else is using VW formatted data for example., 
maybe there isn't enough usage to warrant a parser.


was (Author: rams):
good point, the priority is minor.
the plan is to have it as a separate library code like spark-afro for example.
it would be good to know who else is using VW formatted data for example., 
maybe there isn't enough usage to warrant a parser.

 Support loading VW formatted data.
 --

 Key: SPARK-7014
 URL: https://issues.apache.org/jira/browse/SPARK-7014
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
Priority: Minor
   Original Estimate: 96h
  Remaining Estimate: 96h

 Support loading data in VW format. VW format is used fairly widely and has 
 support for namespaces, importance weighting and multi label and cost 
 sensitive multi class classification formats.
 We can support this just as we support avro and csv formats today .
 It probably belongs to a new package say spark-vw but this JIRA is simply to 
 track and discuss the issue of supporting VW format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org