date:20160825

[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Alexander Tronchin-James (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438461#comment-15438461
 ] 

Alexander Tronchin-James commented on SPARK-12394:
--

Awesome news Tejas!

The filter feature is secondary AFAIK, and I'd prioritize the sorted-merge
bucketed-map (SMB) join if I had the choice. Strong preference for
supporting all of inner and left/right/full outer joins between tables with
integer multiple differences in the number of buckets, selecting the number
of executors (a rational multiple or fraction of the number of buckets),
and also selecting the number of emitted buckets. Bonus points for an
implementation that automatically applies SMB joins and avoids
re-sorts where possible. Maybe a tall order, but we know it can be done.
;-)

If we don't get it all in the first pull request we can always iterate.
Thanks for pushing on this!






> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457
 ] 

Xin Wu edited comment on SPARK-14927 at 8/26/16 4:46 AM:
-

[~smilegator] Do you think what you are working on will fix this issue by the 
way? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 


was (Author: xwu0226):
[~smilegator] Do you think what you are working on regarding will fix this 
issue? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457
 ] 

Xin Wu commented on SPARK-14927:


[~smilegator] Do you think what you are working on regarding will fix this 
issue? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17242) Update links of external dstream projects

2016-08-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17242.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Update links of external dstream projects
> -
>
> Key: SPARK-17242
> URL: https://issues.apache.org/jira/browse/SPARK-17242
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-25 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438429#comment-15438429
 ] 

Takeshi Yamamuro commented on SPARK-16998:
--

If no problem, I'll pick up the pr.

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438423#comment-15438423
 ] 

Tejas Patil commented on SPARK-12394:
-

[~alex.n.ja...@gmail.com] [~kotim...@gmail.com] : The doc lists two things for 
future in the end. I am working on the last one : 
https://issues.apache.org/jira/browse/SPARK-15453. I am not sure if the `Filter 
on sorted data` one is already being worked on but I can work on that as well 
(just created a jira for that : 
https://issues.apache.org/jira/browse/SPARK-17254)

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data

2016-08-25 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-17254:
---

 Summary: Filter operator should have “stop if false” semantics for 
sorted data
 Key: SPARK-17254
 URL: https://issues.apache.org/jira/browse/SPARK-17254
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tejas Patil
Priority: Minor


>From 
>https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:

Filter on sorted data

If the data is sorted by a key, filters on the key could stop as soon as the 
data is out of range. For example, WHERE ticker_id < “F” should stop as soon as 
the first row starting with “F” is seen. This can be done adding a Filter 
operator that has “stop if false” semantics. This is generally useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17253) Left join where ON clause does not reference the right table produces analysis error

2016-08-25 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17253:
--

 Summary: Left join where ON clause does not reference the right 
table produces analysis error
 Key: SPARK-17253
 URL: https://issues.apache.org/jira/browse/SPARK-17253
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Priority: Minor


The following query produces an AnalysisException:

{code}
CREATE TABLE currency (
 cur CHAR(3)
);

CREATE TABLE exchange (
 cur1 CHAR(3),
 cur2 CHAR(3),
 rate double
);

INSERT INTO currency VALUES ('EUR');
INSERT INTO currency VALUES ('GBP');
INSERT INTO currency VALUES ('USD');

INSERT INTO exchange VALUES ('EUR', 'GBP', 0.85);
INSERT INTO exchange VALUES ('GBP', 'EUR', 1.0/0.85);

SELECT c1.cur cur1, c2.cur cur2, COALESCE(self.rate, x.rate) rate
FROM currency c1
CROSS JOIN currency c2
LEFT JOIN exchange x
   ON x.cur1=c1.cur
   AND x.cur2=c2.cur
LEFT JOIN (SELECT 1 rate) self
   ON c1.cur=c2.cur;
{code}

{code}
AnalysisException: cannot resolve '`c1.cur`' given input columns: [cur, cur1, 
cur2, rate]; line 5 pos 13
{code}

However, this query is runnable in sqlite3 and postgres. This example query was 
adapted from https://www.sqlite.org/src/tktview?name=ebdbadade5, a sqlite bug 
report in which this query gave a wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing

2016-08-25 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17252:
--

 Summary: Performing arithmetic in VALUES can lead to 
ClassCastException / MatchErrors during query parsing
 Key: SPARK-17252
 URL: https://issues.apache.org/jira/browse/SPARK-17252
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen


The following example fails with a ClassCastException:

{code}
create table t(d double);
insert into t VALUES (1 * 1.0);
{code}

 Here's the error:

{code}
java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast 
to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57)
at 
org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
at 
org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
at 
org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658)
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162)
at 
org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleInsertQuery(AstBuilder.scala:157)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleInsertQuery(AstBuilder.scala:43)
at 
org.apache.spark.sql.catalyst.parser.SqlBaseParser$SingleInsertQueryContext.accept(SqlBaseParser.java:6500)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
at 
org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
at

[jira] [Created] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-08-25 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17251:
--

 Summary: "ClassCastException: OuterReference cannot be cast to 
NamedExpression" for correlated subquery on the RHS of an IN operator
 Key: SPARK-17251
 URL: https://issues.apache.org/jira/browse/SPARK-17251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen


The following test case produces a ClassCastException in the analyzer:

{code}
CREATE TABLE t1(a INTEGER);
INSERT INTO t1 VALUES(1),(2);
CREATE TABLE t2(b INTEGER);
INSERT INTO t2 VALUES(1);

SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
{code}

Here's the exception:

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
org.apache.spark.sql.catalyst.expressions.NamedExpression
at 
org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
at scala.collection.immutable.List.exists(List.scala:84)
at 
org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
at 
org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-25 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438331#comment-15438331
 ] 

Sun Rui commented on SPARK-13525:
-

sorry, due to missing log information in such point, it is hard to determine 
the root cause.  It seems that your are using an old SparkR version. and I am 
wondering if it is possible for you to modify R code to collection more 
information?

Also you can try to set "spark.sparkr.use.daemon" to false to see wether this 
issue gone or not

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: â€˜SparkRâ€™
> The following objects are masked from â€˜package:baseâ€™:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at

[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function

2016-08-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16283:
---
Assignee: (was: Sean Zhong)

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-25 Thread Darren Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438317#comment-15438317
 ] 

Darren Fu commented on SPARK-12394:
---

Any luck to see this feature implemented in v2.0?

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-25 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438316#comment-15438316
 ] 

Sun Rui commented on SPARK-13525:
-

I checked the code and realized that SocketTimeoutException means the R worker 
process should have been started. Otherwise, other exception should be thrown 
when calling ProcessBuilder.start() if there is any problem starting the R 
worker process.

we don't know the root cause of such kind of issue, it may be due to a bug or 
just issues in system runtime environment. But at least we can:
1. Update documentation for setup of SparkR. state clearly that R must be 
installed on each node, and R worker executable can be configured to a proper 
path.
2. Update the RRunner, enlarge the scope of the try block to cover the 
connection establishment, so that the stderr output of the R process can be 
printed when SocketTimeoutException is thrown. That will help us to find the 
root cause.

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: â€˜SparkRâ€™
> The following objects are masked from â€˜package:baseâ€™:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at

[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function

2016-08-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16283:
---
Assignee: Sean Zhong

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16948) Support empty orc table when converting hive serde table to data source table

2016-08-25 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-16948:
-
Summary: Support empty orc table when converting hive serde table to data 
source table  (was: Querying empty partitioned orc tables throws exception)

> Support empty orc table when converting hive serde table to data source table
> -
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438233#comment-15438233
 ] 

Apache Spark commented on SPARK-17250:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14821

> Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
> 
>
> Key: SPARK-17250
> URL: https://issues.apache.org/jira/browse/SPARK-17250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> This is the first step to clean `HiveClient` from `HiveSessionState`. In the 
> metastore interaction, we always set fully qualified names when 
> accessing/operating a table. That means, we always specify the database. 
> Thus, it is not necessary to use `HiveClient` to change the active database 
> in Hive metastore. 
> In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses 
> `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17250:


Assignee: Apache Spark

> Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
> 
>
> Key: SPARK-17250
> URL: https://issues.apache.org/jira/browse/SPARK-17250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> This is the first step to clean `HiveClient` from `HiveSessionState`. In the 
> metastore interaction, we always set fully qualified names when 
> accessing/operating a table. That means, we always specify the database. 
> Thus, it is not necessary to use `HiveClient` to change the active database 
> in Hive metastore. 
> In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses 
> `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog

2016-08-25 Thread Xiao Li (JIRA)

Xiao Li created SPARK-17250:
---

 Summary: Remove HiveClient and setCurrentDatabase from 
HiveSessionCatalog
 Key: SPARK-17250
 URL: https://issues.apache.org/jira/browse/SPARK-17250
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


This is the first step to clean `HiveClient` from `HiveSessionState`. In the 
metastore interaction, we always set fully qualified names when 
accessing/operating a table. That means, we always specify the database. Thus, 
it is not necessary to use `HiveClient` to change the active database in Hive 
metastore. 

In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses 
`HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17212) TypeCoercion support widening conversion between DateType and TimestampType

2016-08-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17212:

Assignee: Hyukjin Kwon

> TypeCoercion support widening conversion between DateType and TimestampType
> ---
>
> Key: SPARK-17212
> URL: https://issues.apache.org/jira/browse/SPARK-17212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, type-widening does not work between {{TimestampType}} and 
> {{DateType}}.
> This applies to {{SetOperation}}, {{Union}}, {{In}}, {{CaseWhen}}, 
> {{Greatest}}, {{Leatest}}, {{CreateArray}}, {{CreateMap}} and {{Coalesce}}.
> For a simple example, 
> {code}
> Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", 
> "b").selectExpr("greatest(a, b)").show()
> {code}
> {code}
> cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The 
> expressions should all have the same type, got GREATEST(timestamp, date)
> {code}
> or Union as below:
> {code}
> val a = Seq(Tuple1(new Timestamp(0))).toDF()
> val b = Seq(Tuple1(new Date(0))).toDF()
> a.union(b).show()
> {code}
> {code}
> Union can only be performed on tables with the compatible column types. 
> DateType <> TimestampType at the first column of the second table;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17212) TypeCoercion support widening conversion between DateType and TimestampType

2016-08-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17212.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14786
[https://github.com/apache/spark/pull/14786]

> TypeCoercion support widening conversion between DateType and TimestampType
> ---
>
> Key: SPARK-17212
> URL: https://issues.apache.org/jira/browse/SPARK-17212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, type-widening does not work between {{TimestampType}} and 
> {{DateType}}.
> This applies to {{SetOperation}}, {{Union}}, {{In}}, {{CaseWhen}}, 
> {{Greatest}}, {{Leatest}}, {{CreateArray}}, {{CreateMap}} and {{Coalesce}}.
> For a simple example, 
> {code}
> Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", 
> "b").selectExpr("greatest(a, b)").show()
> {code}
> {code}
> cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The 
> expressions should all have the same type, got GREATEST(timestamp, date)
> {code}
> or Union as below:
> {code}
> val a = Seq(Tuple1(new Timestamp(0))).toDF()
> val b = Seq(Tuple1(new Date(0))).toDF()
> a.union(b).show()
> {code}
> {code}
> Union can only be performed on tables with the compatible column types. 
> DateType <> TimestampType at the first column of the second table;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17249) java.lang.IllegalStateException: Did not find registered driver with class org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper

2016-08-25 Thread Graeme Edwards (JIRA)

Graeme Edwards created SPARK-17249:
--

 Summary: java.lang.IllegalStateException: Did not find registered 
driver with class org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper 
 Key: SPARK-17249
 URL: https://issues.apache.org/jira/browse/SPARK-17249
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Graeme Edwards
Priority: Minor


This issue is a corner case relating to SPARK-14162 that isn't fixed by that 
change.

It occurs when we:
- Are using Oracle's ojdbc 
- The driver is wrapping ojdbc with a DriverWrapper because it is added via the 
Spark class loader.
- We don't specify an explicit "driver" property

Then in /org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala 
(createConnectionFactory)

The driver will get the driverClass as:

 val driverClass: String = userSpecifiedDriverClass.getOrElse {
  DriverManager.getDriver(url).getClass.getCanonicalName
}

Which since the Driver is wrapped by a DriverWrapper will be 
"org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper"

That gets passed to the Executor which will attempt to find a matching wrapper 
with the name "org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper". 
However the Executor is aware of the wrapping and will compare with the wrapped 
classname instead:

  case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass 
=> d

I think the fix is just to change the initialization of driverClass to also be 
aware that there might be a wrapper and if so pass the wrapped classname.

The problem can be worked around by setting the driver property for the jdbc 
call:

val props = new java.util.Properties()
props.put("driver", "oracle.jdbc.OracleDriver")
val result = sqlContext.read.jdbc(connectionString, query, props)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17248) Add native Scala enum support to Dataset Encoders

2016-08-25 Thread Silvio Fiorito (JIRA)

Silvio Fiorito created SPARK-17248:
--

 Summary: Add native Scala enum support to Dataset Encoders
 Key: SPARK-17248
 URL: https://issues.apache.org/jira/browse/SPARK-17248
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Silvio Fiorito


Enable support for Scala enums in Encoders. Ideally, users should be able to 
use enums as part of case classes automatically.

Currently, this code...

{code}
object MyEnum extends Enumeration {
  type MyEnum = Value
  val EnumVal1, EnumVal2 = Value
}

case class MyClass(col: MyEnum.MyEnum)

val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS()
{code}

...results in this stacktrace:

{code}
ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum
- field (class: "scala.Enumeration.Value", name: "col")
- root class: 
"line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass"
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:274)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17246) Support BigDecimal literal parsing

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17246:


Assignee: Herman van Hovell  (was: Apache Spark)

> Support BigDecimal literal parsing
> --
>
> Key: SPARK-17246
> URL: https://issues.apache.org/jira/browse/SPARK-17246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17246) Support BigDecimal literal parsing

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17246:


Assignee: Apache Spark  (was: Herman van Hovell)

> Support BigDecimal literal parsing
> --
>
> Key: SPARK-17246
> URL: https://issues.apache.org/jira/browse/SPARK-17246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17246) Support BigDecimal literal parsing

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438152#comment-15438152
 ] 

Apache Spark commented on SPARK-17246:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14819

> Support BigDecimal literal parsing
> --
>
> Key: SPARK-17246
> URL: https://issues.apache.org/jira/browse/SPARK-17246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17187:
-
Assignee: Sean Zhong

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17187.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14753
[https://github.com/apache/spark/pull/14753]

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
> Fix For: 2.1.0
>
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17231.
--
   Resolution: Fixed
 Assignee: Michael Allman
Fix Version/s: 2.1.0
   2.0.1

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status

2016-08-25 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437325#comment-15437325
 ] 

Erik Erlandson edited comment on SPARK-7007 at 8/25/16 11:25 PM:
-

Are there instructions for how to enable these metrics?   Is it an incantation 
in the `metrics.properties` file?

Update: my cluster had been mistakenly configured without dynamic executor 
allocation.  When that is turned on, these metrics are published under driver 
metrics, without any special configuration.


was (Author: eje):
Are there instructions for how to enable these metrics?   Is it an incantation 
in the `metrics.properties` file?


> Add metrics source for ExecutorAllocationManager to expose internal status
> --
>
> Key: SPARK-7007
> URL: https://issues.apache.org/jira/browse/SPARK-7007
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Add a metric source to expose the internal status of 
> ExecutorAllocationManager to better monitoring the executor allocation when 
> running on Yarn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438054#comment-15438054
 ] 

Apache Spark commented on SPARK-17157:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/14818

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17157:


Assignee: (was: Apache Spark)

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field

2016-08-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17240.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> SparkConf is Serializable but contains a non-serializable field
> ---
>
> Key: SPARK-17240
> URL: https://issues.apache.org/jira/browse/SPARK-17240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Marcelo Vanzin
> Fix For: 2.1.0
>
>
> This commit: 
> https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5
> Added ConfigReader to SparkConf.  SparkConf is Serializable, but ConfigReader 
> is not, which results in the following exception:
> {code}
> java.io.NotSerializableException: 
> org.apache.spark.internal.config.ConfigReader
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:134)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17157:


Assignee: Apache Spark

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16627) --jars doesn't work in Mesos mode

2016-08-25 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt closed SPARK-16627.
---
Resolution: Won't Fix

> --jars doesn't work in Mesos mode
> -
>
> Key: SPARK-16627
> URL: https://issues.apache.org/jira/browse/SPARK-16627
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Michael Gummelt
>
> Definitely doesn't work in cluster mode.  Might not work in client mode 
> either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Corentin Kerisit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437861#comment-15437861
 ] 

Corentin Kerisit commented on SPARK-14927:
--

Any guideline on how we could help get this resolved ?

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17247:


Assignee: Apache Spark

> when fall back to hdfs is enabled for stats calculation, the hdfs listing and 
> size calcuation should be terminated as soon as total size > broadcast 
> threshold
> --
>
> Key: SPARK-17247
> URL: https://issues.apache.org/jira/browse/SPARK-17247
> Project: Spark
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Apache Spark
>
> Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats 
> are available from metastore we fall back to hdfs. This is useful join 
> optimization however this can slow things down. To speed up the operation we 
> could stop size calculation as soon as we hit the broadcast threshold as the 
> accuracy of size is not important.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17247:


Assignee: (was: Apache Spark)

> when fall back to hdfs is enabled for stats calculation, the hdfs listing and 
> size calcuation should be terminated as soon as total size > broadcast 
> threshold
> --
>
> Key: SPARK-17247
> URL: https://issues.apache.org/jira/browse/SPARK-17247
> Project: Spark
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>
> Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats 
> are available from metastore we fall back to hdfs. This is useful join 
> optimization however this can slow things down. To speed up the operation we 
> could stop size calculation as soon as we hit the broadcast threshold as the 
> accuracy of size is not important.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437860#comment-15437860
 ] 

Apache Spark commented on SPARK-17247:
--

User 'Parth-Brahmbhatt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14817

> when fall back to hdfs is enabled for stats calculation, the hdfs listing and 
> size calcuation should be terminated as soon as total size > broadcast 
> threshold
> --
>
> Key: SPARK-17247
> URL: https://issues.apache.org/jira/browse/SPARK-17247
> Project: Spark
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>
> Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats 
> are available from metastore we fall back to hdfs. This is useful join 
> optimization however this can slow things down. To speed up the operation we 
> could stop size calculation as soon as we hit the broadcast threshold as the 
> accuracy of size is not important.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-25 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437856#comment-15437856
 ] 

Shivaram Venkataraman commented on SPARK-13525:
---

Yeah this is related but a slightly different error - This means that the R 
daemons were started but the workers they forked didn't connect back to the 
JVM. I think this could happen if the machine runs of memory / file descriptors 
etc causing a fork to fail ?

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: â€˜SparkRâ€™
> The following objects are masked from â€˜package:baseâ€™:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at

[jira] [Created] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold

2016-08-25 Thread Parth Brahmbhatt (JIRA)

Parth Brahmbhatt created SPARK-17247:


 Summary: when fall back to hdfs is enabled for stats calculation, 
the hdfs listing and size calcuation should be terminated as soon as total size 
> broadcast threshold
 Key: SPARK-17247
 URL: https://issues.apache.org/jira/browse/SPARK-17247
 Project: Spark
  Issue Type: Bug
Reporter: Parth Brahmbhatt


Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats 
are available from metastore we fall back to hdfs. This is useful join 
optimization however this can slow things down. To speed up the operation we 
could stop size calculation as soon as we hit the broadcast threshold as the 
accuracy of size is not important.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-25 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17205.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Literal.sql does not properly convert NaN and Infinity literals
> ---
>
> Key: SPARK-17205
> URL: https://issues.apache.org/jira/browse/SPARK-17205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
> needs to be special-cased instead of simply appending a suffix to the string 
> representation of the value



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17246) Support BigDecimal literal parsing

2016-08-25 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-17246:
-

 Summary: Support BigDecimal literal parsing
 Key: SPARK-17246
 URL: https://issues.apache.org/jira/browse/SPARK-17246
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17245) NPE thrown by ClientWrapper.conf

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437809#comment-15437809
 ] 

Apache Spark commented on SPARK-17245:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14816

> NPE thrown by ClientWrapper.conf
> 
>
> Key: SPARK-17245
> URL: https://issues.apache.org/jira/browse/SPARK-17245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>
> This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
> access the ThreadLocal SessionState, which has been set. 
> {code}
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603)
>  
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>  
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>  
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17245) NPE thrown by ClientWrapper.conf

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17245:


Assignee: (was: Apache Spark)

> NPE thrown by ClientWrapper.conf
> 
>
> Key: SPARK-17245
> URL: https://issues.apache.org/jira/browse/SPARK-17245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>
> This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
> access the ThreadLocal SessionState, which has been set. 
> {code}
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603)
>  
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>  
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>  
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17245) NPE thrown by ClientWrapper.conf

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17245:


Assignee: Apache Spark

> NPE thrown by ClientWrapper.conf
> 
>
> Key: SPARK-17245
> URL: https://issues.apache.org/jira/browse/SPARK-17245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
> access the ThreadLocal SessionState, which has been set. 
> {code}
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603)
>  
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>  
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>  
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16700:
-
Fix Version/s: 2.0.1

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>  Labels: releasenotes
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437764#comment-15437764
 ] 

Alex Bozarth commented on SPARK-17243:
--

Thanks, that'll help when I look into it

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437768#comment-15437768
 ] 

Alex Bozarth commented on SPARK-17243:
--

Sorry, my misunderstanding of your problem, I will make sure to keep this in 
mind once I start my work


> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2016-08-25 Thread SURESH CHAGANTI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437727#comment-15437727
 ] 

SURESH CHAGANTI commented on SPARK-11085:
-

Hi All,
I have made the code changes to accept the HTTP proxy as a run-time argument 
and use that  for out bound calls

below is the pull request:

https://github.com/SureshChaganti/spark-ec2/commit/cfd4bf727bdf46b9456f8f4d89221d1377d9c221


> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Junyang Qian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437717#comment-15437717
 ] 

Junyang Qian commented on SPARK-17241:
--

I'll take a closer look and see if we can add it easily.

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17245) NPE thrown by ClientWrapper.conf

2016-08-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17245:
-
Summary: NPE thrown by ClientWrapper.conf  (was: NPE thrown by 
ClientWrapper )

> NPE thrown by ClientWrapper.conf
> 
>
> Key: SPARK-17245
> URL: https://issues.apache.org/jira/browse/SPARK-17245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>
> This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
> access the ThreadLocal SessionState, which has been set. 
> {code}
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
>  
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603)
>  
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>  
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>  
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>  
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17245) NPE thrown by ClientWrapper

2016-08-25 Thread Yin Huai (JIRA)

Yin Huai created SPARK-17245:


 Summary: NPE thrown by ClientWrapper 
 Key: SPARK-17245
 URL: https://issues.apache.org/jira/browse/SPARK-17245
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Yin Huai


This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to 
access the ThreadLocal SessionState, which has been set. 
{code}
java.lang.NullPointerException 
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) 
at 
org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) 
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291)
 
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246)
 
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245)
 
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288)
 
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) 
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483)
 
at 
org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) 
at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) 
at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
 
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
 
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) 
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
 
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) 
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) 
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) 
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) 
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads

2016-08-25 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17229.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Postgres JDBC dialect should not widen float and short types during reads
> -
>
> Key: SPARK-17229
> URL: https://issues.apache.org/jira/browse/SPARK-17229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.1.0
>
>
> When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's 
> Postgres dialect widens these types to Decimal and Integer rather than using 
> the narrower Float and Short types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Junyang Qian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437692#comment-15437692
 ] 

Junyang Qian commented on SPARK-17241:
--

[~shivaram] It seems that spark has it for linear regression but not for glm. 

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437691#comment-15437691
 ] 

Gang Wu commented on SPARK-17243:
-

This doesn't work. This is for the cache of WEB UIs not for the application 
metadata. The default value is 50 which is small enough.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437688#comment-15437688
 ] 

Gang Wu commented on SPARK-17243:
-

Hi Alex, I think in Spark 1.5 history server obtains all application summary 
metadata directly from class FsHistoryProvider. You can check in 
HistoryPage.scala. While in Spark 2.0 it deals with JSON string (in 
historypage.js) which is MUCH slower than before. It may make sense if the old 
way is used?

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682
 ] 

Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:13 PM:
---

[~wgtmac] until this is fixed you can limit the number of applications 
available by setting {{spark.history.retainedApplications}} It limits the apps 
the history server loads


was (Author: ajbozarth):
[~wgtmac] until this is fixed you can limit the number of applications 
available by setting {spark.history.retainedApplications} It limits the apps 
the history server loads

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682
 ] 

Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:13 PM:
---

[~wgtmac] until this is fixed you can limit the number of applications 
available by setting {spark.history.retainedApplications} It limits the apps 
the history server loads


was (Author: ajbozarth):
[~wgtmac] until this is fixed you can limit the number of applications 
available by setting `spark.history.retainedApplications` It limits the apps 
the history server loads

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682
 ] 

Alex Bozarth commented on SPARK-17243:
--

[~wgtmac] until this is fixed you can limit the number of applications 
available by setting `spark.history.retainedApplications` It limits the apps 
the history server loads

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-25 Thread Sean McKibben (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437672#comment-15437672
 ] 

Sean McKibben commented on SPARK-17147:
---

I tried Robert's changes, but the performance for any sizable number of reads 
is really bad. At least the way I understand it, whenever there is a 
discontiguous offset, it forces Kafka to do a seek, which is extremely slow.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437668#comment-15437668
 ] 

Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:05 PM:
---

I'm not sure I agree that this should be a blocker, but I was actually planning 
on filing a JIRA and starting work on a pr next month (September) that will 
switch the history server to only load application data when an application ui 
is opened and only loading application metadata on the initial load of the 
history server. This is just one of many problems that would be fixed by such a 
change. I won't have the bandwidth to start working on it for another week or 
two though.

tl;dr I plan to fix this but not until next month


was (Author: ajbozarth):
I'm not sure I agree that this should be a blocker, but I was actually planning 
on filing a JIRA and starting work on a pr next month (September) that will 
switch the history server to only load application data when an application ui 
is opened and only loading application metadata on the initial load of the 
history server. This is just one of many problems that would be fixed by such a 
change. I won't have the bandwidth to start working on it for another week or 
two though.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17243:
--
Priority: Major  (was: Blocker)

Yes, should not be assigned as a Blocker.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437668#comment-15437668
 ] 

Alex Bozarth commented on SPARK-17243:
--

I'm not sure I agree that this should be a blocker, but I was actually planning 
on filing a JIRA and starting work on a pr next month (September) that will 
switch the history server to only load application data when an application ui 
is opened and only loading application metadata on the initial load of the 
history server. This is just one of many problems that would be fixed by such a 
change. I won't have the bandwidth to start working on it for another week or 
two though.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437653#comment-15437653
 ] 

Sean Owen commented on SPARK-17243:
---

Related, but not identical: https://issues.apache.org/jira/browse/SPARK-15083

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17244:


Assignee: Apache Spark

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437649#comment-15437649
 ] 

Apache Spark commented on SPARK-17244:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/14815

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17244:


Assignee: (was: Apache Spark)

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437642#comment-15437642
 ] 

DB Tsai commented on SPARK-17163:
-

Maybe we can store them as the same format, and for scoring, we convert it into 
pivoted version for BLOR. Thus, at least the storage will be unified. 

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-25 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-17244:
--

 Summary: Joins should not pushdown non-deterministic conditions
 Key: SPARK-17244
 URL: https://issues.apache.org/jira/browse/SPARK-17244
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17243:

Summary: Spark 2.0 history server summary page gets stuck at "loading 
history summary" with 10K+ application history  (was: Spark history server 
summary page gets stuck at "loading history summary" with 10K+ application 
history)

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)

Gang Wu created SPARK-17243:
---

 Summary: Spark history server summary page gets stuck at "loading 
history summary" with 10K+ application history
 Key: SPARK-17243
 URL: https://issues.apache.org/jira/browse/SPARK-17243
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
 Environment: Linux
Reporter: Gang Wu
Priority: Blocker


The summary page of Spark history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17243:

Description: 
The summary page of Spark 2.0 history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.

  was:
The summary page of Spark history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.


> Spark history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-08-25 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437616#comment-15437616
 ] 

Nicholas Chammas commented on SPARK-14241:
--

[~marmbrus] - Would it be tough to make this function deterministic, or somehow 
"stable"? The linked Stack Overflow question shows some pretty surprising 
behavior from an end-user perspective.

If this would be tough to change, what are some alternatives you would 
recommend?

Do you think, for example, it would be possible to make a window function that 
_is_ deterministic and does effectively the same thing? Maybe something like 
{{row_number()}}, except the {{WindowSpec}} would not need to specify any 
partitioning or ordering. (Required ordering would be the main downside of 
using {{row_number()}} instead of {{monotonically_increasing_id()}}.)

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437606#comment-15437606
 ] 

Seth Hendrickson commented on SPARK-17163:
--

If we store the binomial case with 2 x numFeatures coefficients, I wonder how 
much it will affect prediction? Doubling the number of operations for each. 
From a code perspective, unifying the representation is much nicer, but we may 
see a regression in performance. Thoughts?

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-25 Thread Arihanth Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437605#comment-15437605
 ] 

Arihanth Jain commented on SPARK-13525:
---

[~shivaram] I have similar trace, please see below:

ERROR RBackendHandler: fitRModelFormula on 
org.apache.spark.ml.api.r.SparkRWrappers failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4, test.jiffybox.net): java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:71)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadChec

-
the only difference when observed to [~vmenda] trace is: 
"at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)"

Does this indicate that R workers were actually started on worker machines ? 


> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: â€˜SparkRâ€™
> The following objects are masked from â€˜package:baseâ€™:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at

[jira] [Comment Edited] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-25 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437598#comment-15437598
 ] 

Herman van Hovell edited comment on SPARK-16998 at 8/25/16 8:27 PM:


I still have a code generation PR lying around: 
https://github.com/apache/spark/pull/13065 That should fix a lot of the 
performance issues.

I could bring it up to date, if there are any takers.


was (Author: hvanhovell):
I still have a code generation PR lying around: 
https://github.com/apache/spark/pull/13065

I could bring it up to date.

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-25 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437598#comment-15437598
 ] 

Herman van Hovell commented on SPARK-16998:
---

I still have a code generation PR lying around: 
https://github.com/apache/spark/pull/13065

I could bring it up to date.

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11085) Add support for HTTP proxy

2016-08-25 Thread SURESH CHAGANTI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SURESH CHAGANTI updated SPARK-11085:

Comment: was deleted

(was: The following Script accepts the  "--proxy_host_port" argument

from __future__ import division, print_function, with_statement

import codecs
import hashlib
import itertools
import logging
import os
import os.path
import pipes
import random
import shutil
import string
from stat import S_IRUSR
import subprocess
import sys
import tarfile
import tempfile
import textwrap
import time
import warnings
from datetime import datetime
from optparse import OptionParser
from sys import stderr

if sys.version < "3":
from urllib2 import urlopen, Request, HTTPError
else:
from urllib.request import urlopen, Request
from urllib.error import HTTPError
raw_input = input
xrange = range

SPARK_EC2_VERSION = "1.6.2"
SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__))

VALID_SPARK_VERSIONS = set([
"0.7.3",
"0.8.0",
"0.8.1",
"0.9.0",
"0.9.1",
"0.9.2",
"1.0.0",
"1.0.1",
"1.0.2",
"1.1.0",
"1.1.1",
"1.2.0",
"1.2.1",
"1.3.0",
"1.3.1",
"1.4.0",
"1.4.1",
"1.5.0",
"1.5.1",
"1.5.2",
"1.6.0",
"1.6.1",
"1.6.2",
])

SPARK_TACHYON_MAP = {
"1.0.0": "0.4.1",
"1.0.1": "0.4.1",
"1.0.2": "0.4.1",
"1.1.0": "0.5.0",
"1.1.1": "0.5.0",
"1.2.0": "0.5.0",
"1.2.1": "0.5.0",
"1.3.0": "0.5.0",
"1.3.1": "0.5.0",
"1.4.0": "0.6.4",
"1.4.1": "0.6.4",
"1.5.0": "0.7.1",
"1.5.1": "0.7.1",
"1.5.2": "0.7.1",
"1.6.0": "0.8.2",
"1.6.1": "0.8.2",
"1.6.2": "0.8.2",
}

DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION
DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark;

# Default location to get the spark-ec2 scripts (and ami-list) from
DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2;
DEFAULT_SPARK_EC2_BRANCH = "branch-1.6"


def setup_external_libs(libs):
"""
Download external libraries from PyPI to SPARK_EC2_DIR/lib/ and prepend 
them to our PATH.
"""
PYPI_URL_PREFIX = "https://pypi.python.org/packages/source;
SPARK_EC2_LIB_DIR = os.path.join(SPARK_EC2_DIR, "lib")

if not os.path.exists(SPARK_EC2_LIB_DIR):
print("Downloading external libraries that spark-ec2 needs from PyPI to 
{path}...".format(
path=SPARK_EC2_LIB_DIR
))
print("This should be a one-time operation.")
os.mkdir(SPARK_EC2_LIB_DIR)

for lib in libs:
versioned_lib_name = "{n}-{v}".format(n=lib["name"], v=lib["version"])
lib_dir = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name)

if not os.path.isdir(lib_dir):
tgz_file_path = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name 
+ ".tar.gz")
print(" - Downloading {lib}...".format(lib=lib["name"]))
download_stream = urlopen(

"{prefix}/{first_letter}/{lib_name}/{lib_name}-{lib_version}.tar.gz".format(
prefix=PYPI_URL_PREFIX,
first_letter=lib["name"][:1],
lib_name=lib["name"],
lib_version=lib["version"]
)
)
with open(tgz_file_path, "wb") as tgz_file:
tgz_file.write(download_stream.read())
with open(tgz_file_path, "rb") as tar:
if hashlib.md5(tar.read()).hexdigest() != lib["md5"]:
print("ERROR: Got wrong md5sum for 
{lib}.".format(lib=lib["name"]), file=stderr)
sys.exit(1)
tar = tarfile.open(tgz_file_path)
tar.extractall(path=SPARK_EC2_LIB_DIR)
tar.close()
os.remove(tgz_file_path)
print(" - Finished downloading {lib}.".format(lib=lib["name"]))
sys.path.insert(1, lib_dir)


# Only PyPI libraries are supported.
external_libs = [
{
"name": "boto",
"version": "2.34.0",
"md5": "5556223d2d0cc4d06dd4829e671dcecd"
}
]

setup_external_libs(external_libs)


import boto
from boto.ec2.blockdevicemapping import BlockDeviceMapping, BlockDeviceType, 
EBSBlockDeviceType
from boto import ec2


class UsageError(Exception):
pass


# Configure and parse our command-line arguments
def parse_args():
parser = OptionParser(
prog="spark-ec2",
version="%prog {v}".format(v=SPARK_EC2_VERSION),
usage="%prog [options]  \n\n"
+ " can be: launch, destroy, login, stop, start, get-master, 
reboot-slaves")

parser.add_option(
"-s", "--slaves", type="int", default=1,
help="Number of slaves to launch (default: %default)")
parser.add_option(
"-w", "--wait", type="int",
help="DEPRECATED (no longer necessary) - Seconds to wait for nodes to 
start")
parser.add_option(
"-k", "--key-pair",
help="Key pair to use on

[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437588#comment-15437588
 ] 

Xin Ren commented on SPARK-17241:
-

I can work on this one :)

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17242) Update links of external dstream projects

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17242:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Update links of external dstream projects
> -
>
> Key: SPARK-17242
> URL: https://issues.apache.org/jira/browse/SPARK-17242
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17242) Update links of external dstream projects

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437584#comment-15437584
 ] 

Apache Spark commented on SPARK-17242:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14814

> Update links of external dstream projects
> -
>
> Key: SPARK-17242
> URL: https://issues.apache.org/jira/browse/SPARK-17242
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17242) Update links of external dstream projects

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17242:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Update links of external dstream projects
> -
>
> Key: SPARK-17242
> URL: https://issues.apache.org/jira/browse/SPARK-17242
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17242) Update links of external dstream projects

2016-08-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-17242:


 Summary: Update links of external dstream projects
 Key: SPARK-17242
 URL: https://issues.apache.org/jira/browse/SPARK-17242
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-08-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437548#comment-15437548
 ] 

Dongjoon Hyun commented on SPARK-14586:
---

Hi, [~stephane.maa...@gmail.com] and [~tsuresh].

FYI, Spark 2.0.0 supports this like the following now.
{code}
scala> sql("create table csv_t using csv options(path '/csv')")
scala> sql("select * from csv_t").show
+---++
|_c0| _c1|
+---++
|  a| 2.0|
|   | 3.0|
+---++

scala> spark.version
res2: String = 2.0.0
{code}

> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2016-08-25 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437539#comment-15437539
 ] 

Alex Bozarth commented on SPARK-16501:
--

The first problem (web ui) was fixed by 
https://github.com/apache/spark/pull/14484 as a follow up to SPARK-16796

> spark.mesos.secret exposed on UI and command line
> -
>
> Key: SPARK-16501
> URL: https://issues.apache.org/jira/browse/SPARK-16501
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.2
>Reporter: Eric Daniel
>  Labels: security
>
> There are two related problems with spark.mesos.secret:
> 1) The web UI shows its value in the "environment" tab
> 2) Passing it as a command-line option to spark-submit (or creating a 
> SparkContext from python, with the effect of launching spark-submit)  exposes 
> it to "ps"
> I'll be happy to submit a patch but I could use some advice first.
> The first problem is easy enough, just don't show that value in the UI
> For the second problem, I'm not sure what the best solution is. A 
> "spark.mesos.secret-file" parameter would let the user store the secret in a 
> non-world-readable file. Alternatively, the mesos secret could be obtained 
> from the environment, which other users don't have access to.  Either 
> solution would work in client mode, but I don't know if they're workable in 
> cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437535#comment-15437535
 ] 

DB Tsai edited comment on SPARK-17163 at 8/25/16 7:58 PM:
--

BTW, having the name as `coefficientMatrix` doesn't look good as 
`coefficients`, but for backward compatibility, seems we don't have a choice. 
Also, we may need to handle loading old BLOR model, and have the new model 
written as matrix and vector for both MLOR and BLOR.


was (Author: dbtsai):
BTW, having the name as `coefficientMatrix` doesn't look good as 
`coefficients`, but for backward compatibility, seems we don't have a choice. 

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437535#comment-15437535
 ] 

DB Tsai commented on SPARK-17163:
-

BTW, having the name as `coefficientMatrix` doesn't look good as 
`coefficients`, but for backward compatibility, seems we don't have a choice. 

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437536#comment-15437536
 ] 

Shivaram Venkataraman commented on SPARK-17241:
---

+1 - This would be good to have.

Also on a related note is it hard to get elasticnet working in spark.glm ? We 
can create a new JIRA for it if all we need is a new wrapper.


> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437531#comment-15437531
 ] 

DB Tsai commented on SPARK-17163:
-

Why do we need `_intercept`? Also, for intercept, we may need to do the 
following so the MLOR and BLOR have the same model format, and thus only single 
implementation of score is required.

{code}
def intercept: Double = {
if (isMultinomial) {
  throw new Exception
}
intercepts(1) - intercepts(0)
  }

def coefficients: Vector = if (!isMultinomial) {
  val temp = coefficientMatrix(1) -  coefficientMatrix(0)
  Vectors.dense()
} else {
  throw new Exception
}
{code}

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437524#comment-15437524
 ] 

Apache Spark commented on SPARK-17240:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14813

> SparkConf is Serializable but contains a non-serializable field
> ---
>
> Key: SPARK-17240
> URL: https://issues.apache.org/jira/browse/SPARK-17240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> This commit: 
> https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5
> Added ConfigReader to SparkConf.  SparkConf is Serializable, but ConfigReader 
> is not, which results in the following exception:
> {code}
> java.io.NotSerializableException: 
> org.apache.spark.internal.config.ConfigReader
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:134)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17240:


Assignee: Apache Spark

> SparkConf is Serializable but contains a non-serializable field
> ---
>
> Key: SPARK-17240
> URL: https://issues.apache.org/jira/browse/SPARK-17240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> This commit: 
> https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5
> Added ConfigReader to SparkConf.  SparkConf is Serializable, but ConfigReader 
> is not, which results in the following exception:
> {code}
> java.io.NotSerializableException: 
> org.apache.spark.internal.config.ConfigReader
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:134)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17240:


Assignee: (was: Apache Spark)

> SparkConf is Serializable but contains a non-serializable field
> ---
>
> Key: SPARK-17240
> URL: https://issues.apache.org/jira/browse/SPARK-17240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> This commit: 
> https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5
> Added ConfigReader to SparkConf.  SparkConf is Serializable, but ConfigReader 
> is not, which results in the following exception:
> {code}
> java.io.NotSerializableException: 
> org.apache.spark.internal.config.ConfigReader
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:134)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437400#comment-15437400
 ] 

DB Tsai edited comment on SPARK-17163 at 8/25/16 7:43 PM:
--

Why will it break the api if we throw the exception when we call coefficients 
and intercept when the models are MLOR?

Also, for BLOR, how will you store the actual representation? Store them in 
coefficientsMatrix as 2 by nfeature matrix and intercepts as array of 2? I 
think this is nice since we have a single representation for BLOR and MLOR. 

Thanks.


was (Author: dbtsai):
Why will it break the api if we throw the exception when we call coefficients 
and intercept when the models are BLOR?

Also, for BLOR, how will you store the actual representation? Store them in 
coefficientsMatrix as 2 by nfeature matrix and intercepts as array of 2? I 
think this is nice since we have a single representation for BLOR and MLOR. 

Thanks.

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Junyang Qian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junyang Qian updated SPARK-17241:
-
Summary: SparkR spark.glm should have configurable regularization parameter 
 (was: SparkR spark.glm should have configurable regularization parameter(s))

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for linear regression and 
> an additional elastic-net parameter for generalized linear model. It is very 
> important to have them in SparkR so that users can run ridge regression and 
> elastic-net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-25 Thread Junyang Qian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junyang Qian updated SPARK-17241:
-
Description: Spark has configurable L2 regularization parameter for 
generalized linear regression. It is very important to have them in SparkR so 
that users can run ridge regression.  (was: Spark has configurable L2 
regularization parameter for linear regression and an additional elastic-net 
parameter for generalized linear model. It is very important to have them in 
SparkR so that users can run ridge regression and elastic-net.)

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2016-08-25 Thread SURESH CHAGANTI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437505#comment-15437505
 ] 

SURESH CHAGANTI commented on SPARK-11085:
-

The following Script accepts the  "--proxy_host_port" argument

from __future__ import division, print_function, with_statement

import codecs
import hashlib
import itertools
import logging
import os
import os.path
import pipes
import random
import shutil
import string
from stat import S_IRUSR
import subprocess
import sys
import tarfile
import tempfile
import textwrap
import time
import warnings
from datetime import datetime
from optparse import OptionParser
from sys import stderr

if sys.version < "3":
from urllib2 import urlopen, Request, HTTPError
else:
from urllib.request import urlopen, Request
from urllib.error import HTTPError
raw_input = input
xrange = range

SPARK_EC2_VERSION = "1.6.2"
SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__))

VALID_SPARK_VERSIONS = set([
"0.7.3",
"0.8.0",
"0.8.1",
"0.9.0",
"0.9.1",
"0.9.2",
"1.0.0",
"1.0.1",
"1.0.2",
"1.1.0",
"1.1.1",
"1.2.0",
"1.2.1",
"1.3.0",
"1.3.1",
"1.4.0",
"1.4.1",
"1.5.0",
"1.5.1",
"1.5.2",
"1.6.0",
"1.6.1",
"1.6.2",
])

SPARK_TACHYON_MAP = {
"1.0.0": "0.4.1",
"1.0.1": "0.4.1",
"1.0.2": "0.4.1",
"1.1.0": "0.5.0",
"1.1.1": "0.5.0",
"1.2.0": "0.5.0",
"1.2.1": "0.5.0",
"1.3.0": "0.5.0",
"1.3.1": "0.5.0",
"1.4.0": "0.6.4",
"1.4.1": "0.6.4",
"1.5.0": "0.7.1",
"1.5.1": "0.7.1",
"1.5.2": "0.7.1",
"1.6.0": "0.8.2",
"1.6.1": "0.8.2",
"1.6.2": "0.8.2",
}

DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION
DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark;

# Default location to get the spark-ec2 scripts (and ami-list) from
DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2;
DEFAULT_SPARK_EC2_BRANCH = "branch-1.6"


def setup_external_libs(libs):
"""
Download external libraries from PyPI to SPARK_EC2_DIR/lib/ and prepend 
them to our PATH.
"""
PYPI_URL_PREFIX = "https://pypi.python.org/packages/source;
SPARK_EC2_LIB_DIR = os.path.join(SPARK_EC2_DIR, "lib")

if not os.path.exists(SPARK_EC2_LIB_DIR):
print("Downloading external libraries that spark-ec2 needs from PyPI to 
{path}...".format(
path=SPARK_EC2_LIB_DIR
))
print("This should be a one-time operation.")
os.mkdir(SPARK_EC2_LIB_DIR)

for lib in libs:
versioned_lib_name = "{n}-{v}".format(n=lib["name"], v=lib["version"])
lib_dir = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name)

if not os.path.isdir(lib_dir):
tgz_file_path = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name 
+ ".tar.gz")
print(" - Downloading {lib}...".format(lib=lib["name"]))
download_stream = urlopen(

"{prefix}/{first_letter}/{lib_name}/{lib_name}-{lib_version}.tar.gz".format(
prefix=PYPI_URL_PREFIX,
first_letter=lib["name"][:1],
lib_name=lib["name"],
lib_version=lib["version"]
)
)
with open(tgz_file_path, "wb") as tgz_file:
tgz_file.write(download_stream.read())
with open(tgz_file_path, "rb") as tar:
if hashlib.md5(tar.read()).hexdigest() != lib["md5"]:
print("ERROR: Got wrong md5sum for 
{lib}.".format(lib=lib["name"]), file=stderr)
sys.exit(1)
tar = tarfile.open(tgz_file_path)
tar.extractall(path=SPARK_EC2_LIB_DIR)
tar.close()
os.remove(tgz_file_path)
print(" - Finished downloading {lib}.".format(lib=lib["name"]))
sys.path.insert(1, lib_dir)


# Only PyPI libraries are supported.
external_libs = [
{
"name": "boto",
"version": "2.34.0",
"md5": "5556223d2d0cc4d06dd4829e671dcecd"
}
]

setup_external_libs(external_libs)


import boto
from boto.ec2.blockdevicemapping import BlockDeviceMapping, BlockDeviceType, 
EBSBlockDeviceType
from boto import ec2


class UsageError(Exception):
pass


# Configure and parse our command-line arguments
def parse_args():
parser = OptionParser(
prog="spark-ec2",
version="%prog {v}".format(v=SPARK_EC2_VERSION),
usage="%prog [options]  \n\n"
+ " can be: launch, destroy, login, stop, start, get-master, 
reboot-slaves")

parser.add_option(
"-s", "--slaves", type="int", default=1,
help="Number of slaves to launch (default: %default)")
parser.add_option(
"-w", "--wait", type="int",
help="DEPRECATED (no longer necessary) - Seconds to wait for nodes to 
start")
parser.add_option(
"-k", "--key-pair",
help="Key pair to use

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-25 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437503#comment-15437503
 ] 

Seth Hendrickson commented on SPARK-17163:
--

I misunderstood your suggestion. I think that's best - to throw an error when 
those methods are called on a multinomial model, and to return the normal 
values in the binomial case.

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2016-08-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17237:


Assignee: (was: Apache Spark)

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>  Labels: newbie
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2016-08-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437500#comment-15437500
 ] 

Apache Spark commented on SPARK-17237:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14812

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>  Labels: newbie
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 212 matches

Mail list logo