[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2016-09-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495431#comment-15495431
 ] 

Hyukjin Kwon commented on SPARK-17557:
--

Do you mind if I ask a simple file so that I can reproduce this error? If not 
possible, could you please let me know some steps to reproduce this? I would 
like to reproduce this and test.

If you are going to submit a PR soon, then, I will definitely just refer your 
PR.

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495413#comment-15495413
 ] 

Hyukjin Kwon commented on SPARK-17545:
--

FYI - this is related with https://github.com/apache/spark/pull/14279 if you 
only meant 2.0 but not the master branch.

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495408#comment-15495408
 ] 

Hyukjin Kwon edited comment on SPARK-17545 at 9/16/16 5:20 AM:
---

Hi [~nbeyer], the basic ISO format currently follows 
https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in 
http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the 
minimum number of
separators necessary for the required expression is used, or completely in 
extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300}} whereas the extended format 
is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not 
be in the basic format when the date and time of day is in the extended format.





was (Author: hyukjin.kwon):
Hi [~nbeyer], the basic ISO format currently follows 
https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in 
http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the 
minimum number of
separators necessary for the required expression is used, or completely in 
extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300 }} whereas the extended format 
is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not 
be in the basic format when the date and time of day is in the extended format.




> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495410#comment-15495410
 ] 

Hyukjin Kwon commented on SPARK-17545:
--

Therefore, IMHO, this is not an issue as we can workaround via 
`timestampFormat` and `dateFormat` options in current master branch.

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495408#comment-15495408
 ] 

Hyukjin Kwon commented on SPARK-17545:
--

Hi [~nbeyer], the basic ISO format currently follows 
https://www.w3.org/TR/NOTE-datetime

That says

{quote}
1997-07-16T19:20:30.45+01:00
{quote}

is the right ISO format where timezone is

{quote}
TZD  = time zone designator (Z or +hh:mm or -hh:mm)
{quote}

To make sure, I double-checked the ISO 8601 - 2004 full specification in 
http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf

That says,

{quote}
...
the expression shall either be completely in basic format, in which case the 
minimum number of
separators necessary for the required expression is used, or completely in 
extended format, in which case
additional separators shall be used
...
{quote}

where the basic format is {{20160707T211822+0300 }} whereas the extended format 
is {{2016-07-07T21:18:22+03:00}}.

In addition, basic format seems even discouraged in text format

{quote}
NOTE : The basic format should be avoided in plain text.
{quote}

Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004.
whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not 
be in the basic format when the date and time of day is in the extended format.




> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17559:


Assignee: (was: Apache Spark)

> PeriodicGraphCheckpointer didnot persist edges as expected in some cases
> 
>
> Key: SPARK-17559
> URL: https://issues.apache.org/jira/browse/SPARK-17559
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: ding
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
> persisted. As currently only when vertices's storage level is none, graph is 
> persisted. However there is a chance vertices's storage level is not none 
> while edges's is none. Eg. graph created by a outerJoinVertices operation, 
> vertices is automatically cached while edges is not. In this way, edges will 
> not be persisted if we use PeriodicGraphCheckpointer do persist.
> See below minimum example:
>val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], 
> Int](2, sc)
> val users = sc.textFile("data/graphx/users.txt")
>   .map(line => line.split(",")).map(parts => (parts.head.toLong, 
> parts.tail))
> val followerGraph = GraphLoader.edgeListFile(sc, 
> "data/graphx/followers.txt")
> val graph = followerGraph.outerJoinVertices(users) {
>   case (uid, deg, Some(attrList)) => attrList
>   case (uid, deg, None) => Array.empty[String]
> }
> graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495370#comment-15495370
 ] 

Apache Spark commented on SPARK-17559:
--

User 'dding3' has created a pull request for this issue:
https://github.com/apache/spark/pull/15116

> PeriodicGraphCheckpointer didnot persist edges as expected in some cases
> 
>
> Key: SPARK-17559
> URL: https://issues.apache.org/jira/browse/SPARK-17559
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: ding
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
> persisted. As currently only when vertices's storage level is none, graph is 
> persisted. However there is a chance vertices's storage level is not none 
> while edges's is none. Eg. graph created by a outerJoinVertices operation, 
> vertices is automatically cached while edges is not. In this way, edges will 
> not be persisted if we use PeriodicGraphCheckpointer do persist.
> See below minimum example:
>val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], 
> Int](2, sc)
> val users = sc.textFile("data/graphx/users.txt")
>   .map(line => line.split(",")).map(parts => (parts.head.toLong, 
> parts.tail))
> val followerGraph = GraphLoader.edgeListFile(sc, 
> "data/graphx/followers.txt")
> val graph = followerGraph.outerJoinVertices(users) {
>   case (uid, deg, Some(attrList)) => attrList
>   case (uid, deg, None) => Array.empty[String]
> }
> graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17559:


Assignee: Apache Spark

> PeriodicGraphCheckpointer didnot persist edges as expected in some cases
> 
>
> Key: SPARK-17559
> URL: https://issues.apache.org/jira/browse/SPARK-17559
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: ding
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
> persisted. As currently only when vertices's storage level is none, graph is 
> persisted. However there is a chance vertices's storage level is not none 
> while edges's is none. Eg. graph created by a outerJoinVertices operation, 
> vertices is automatically cached while edges is not. In this way, edges will 
> not be persisted if we use PeriodicGraphCheckpointer do persist.
> See below minimum example:
>val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], 
> Int](2, sc)
> val users = sc.textFile("data/graphx/users.txt")
>   .map(line => line.split(",")).map(parts => (parts.head.toLong, 
> parts.tail))
> val followerGraph = GraphLoader.edgeListFile(sc, 
> "data/graphx/followers.txt")
> val graph = followerGraph.outerJoinVertices(users) {
>   case (uid, deg, Some(attrList)) => attrList
>   case (uid, deg, None) => Array.empty[String]
> }
> graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases

2016-09-15 Thread ding (JIRA)
ding created SPARK-17559:


 Summary: PeriodicGraphCheckpointer didnot persist edges as 
expected in some cases
 Key: SPARK-17559
 URL: https://issues.apache.org/jira/browse/SPARK-17559
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: ding
Priority: Minor


When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
persisted. As currently only when vertices's storage level is none, graph is 
persisted. However there is a chance vertices's storage level is not none while 
edges's is none. Eg. graph created by a outerJoinVertices operation, vertices 
is automatically cached while edges is not. In this way, edges will not be 
persisted if we use PeriodicGraphCheckpointer do persist.

See below minimum example:
   val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], Int](2, 
sc)
val users = sc.textFile("data/graphx/users.txt")
  .map(line => line.split(",")).map(parts => (parts.head.toLong, 
parts.tail))
val followerGraph = GraphLoader.edgeListFile(sc, 
"data/graphx/followers.txt")

val graph = followerGraph.outerJoinVertices(users) {
  case (uid, deg, Some(attrList)) => attrList
  case (uid, deg, None) => Array.empty[String]
}
graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17522) [MESOS] More even distribution of executors on Mesos cluster

2016-09-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495243#comment-15495243
 ] 

Saisai Shao edited comment on SPARK-17522 at 9/16/16 3:19 AM:
--

[~sunrui] I think the performance is depended on different workloads. For 
example if your workloads are mainly ETL like workloads, spreading out will 
better leverage the network and IO bandwidth. But in some other cases like ML, 
in which CPU plays a dominant role while input data is not large, it is better 
to put executors together for fast data exchange and iteration. You could refer 
to Slider for affinity and anti-affinity resource allocation, this could either 
be done in cluster manager or upstream frameworks.


was (Author: jerryshao):
[~sunrui] I think the performance is depended on different workloads. For 
example if your workloads are mainly ETL like workloads, spreading out will 
better leverage the network and IO bandwidth. But in some other cases like ML, 
in which CPU plays a dominant role while input data is not large, it is better 
to put executors together for fast data exchange and iteration. You could refer 
to Slider for affinity and anti-affinity resource allocation, should could 
either be done in cluster manager or upstream frameworks.

> [MESOS] More even distribution of executors on Mesos cluster
> 
>
> Key: SPARK-17522
> URL: https://issues.apache.org/jira/browse/SPARK-17522
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Sun Rui
>
> The MesosCoarseGrainedSchedulerBackend launch executors in a round-robin way 
> among accepted offers that are received at once, but it is observed that 
> typically executors are launched on a small number of slaves.
> It is found that MesosCoarseGrainedSchedulerBackend mostly is receiving only 
> one offer once on a cluster composed of many nodes, so that the round-robin 
> assignment of executors among offers do not have expected result, which leads 
> to the fact that executors are located on a smaller number of slave nodes 
> than expected, which suffers bad data locality.
> An experimental slight change to 
> MesosCoarseGrainedSchedulerBackend::buildMesosTasks() shows better executor 
> distribution among nodes:
> {code}
> while (launchTasks) {
>   launchTasks = false
>   for (offer <- offers) {
> ...
>}
> +  if (conf.getBoolean("spark.deploy.spreadOut", true)) {
> +launchTasks = false
> +  }
> }
> tasks.toMap
> {code}
> One of my spark programs can run 30% faster due to this change because of 
> better data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17522) [MESOS] More even distribution of executors on Mesos cluster

2016-09-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495243#comment-15495243
 ] 

Saisai Shao commented on SPARK-17522:
-

[~sunrui] I think the performance is depended on different workloads. For 
example if your workloads are mainly ETL like workloads, spreading out will 
better leverage the network and IO bandwidth. But in some other cases like ML, 
in which CPU plays a dominant role while input data is not large, it is better 
to put executors together for fast data exchange and iteration. You could refer 
to Slider for affinity and anti-affinity resource allocation, should could 
either be done in cluster manager or upstream frameworks.

> [MESOS] More even distribution of executors on Mesos cluster
> 
>
> Key: SPARK-17522
> URL: https://issues.apache.org/jira/browse/SPARK-17522
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Sun Rui
>
> The MesosCoarseGrainedSchedulerBackend launch executors in a round-robin way 
> among accepted offers that are received at once, but it is observed that 
> typically executors are launched on a small number of slaves.
> It is found that MesosCoarseGrainedSchedulerBackend mostly is receiving only 
> one offer once on a cluster composed of many nodes, so that the round-robin 
> assignment of executors among offers do not have expected result, which leads 
> to the fact that executors are located on a smaller number of slave nodes 
> than expected, which suffers bad data locality.
> An experimental slight change to 
> MesosCoarseGrainedSchedulerBackend::buildMesosTasks() shows better executor 
> distribution among nodes:
> {code}
> while (launchTasks) {
>   launchTasks = false
>   for (offer <- offers) {
> ...
>}
> +  if (conf.getBoolean("spark.deploy.spreadOut", true)) {
> +launchTasks = false
> +  }
> }
> tasks.toMap
> {code}
> One of my spark programs can run 30% faster due to this change because of 
> better data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-09-15 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495131#comment-15495131
 ] 

ding commented on SPARK-5484:
-

I will work on the issue if nobody took it.

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495125#comment-15495125
 ] 

Apache Spark commented on SPARK-17558:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15115

> Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
> ---
>
> Key: SPARK-17558
> URL: https://issues.apache.org/jira/browse/SPARK-17558
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The new patch release fixes some bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17558:


Assignee: Apache Spark  (was: Reynold Xin)

> Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
> ---
>
> Key: SPARK-17558
> URL: https://issues.apache.org/jira/browse/SPARK-17558
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The new patch release fixes some bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17558:


Assignee: Reynold Xin  (was: Apache Spark)

> Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
> ---
>
> Key: SPARK-17558
> URL: https://issues.apache.org/jira/browse/SPARK-17558
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The new patch release fixes some bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3

2016-09-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17558:
---

 Summary: Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
 Key: SPARK-17558
 URL: https://issues.apache.org/jira/browse/SPARK-17558
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin


The new patch release fixes some bugs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) API design: window and session specification

2016-09-15 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495032#comment-15495032
 ] 

Tathagata Das commented on SPARK-10816:
---

Yeah, its yet to be designed.

> API design: window and session specification
> 
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15472) Add support for writing in `csv`, `json`, `text` formats in Structured Streaming

2016-09-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495015#comment-15495015
 ] 

Reynold Xin commented on SPARK-15472:
-

This is done in 2.0, isn't it?

cc [~zsxwing]

> Add support for writing in `csv`, `json`, `text` formats in Structured 
> Streaming
> 
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10815) API design: data sources and sinks

2016-09-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10815:

Description: 
The existing source/sink interface for structured streaming depends on RDDs. 
This dependency has two issues:

1. The RDD interface is wide and difficult to stabilize across versions. This 
is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
Ideally, a source/sink implementation created for Spark 2.x should work in 
Spark 10.x, assuming the JVM is still around.

2. It is difficult to swap in/out a different execution engine.

The purpose of this ticket is to create a stable interface that addresses the 
above two.


> API design: data sources and sinks
> --
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> The existing source/sink interface for structured streaming depends on RDDs. 
> This dependency has two issues:
> 1. The RDD interface is wide and difficult to stabilize across versions. This 
> is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
> Ideally, a source/sink implementation created for Spark 2.x should work in 
> Spark 10.x, assuming the JVM is still around.
> 2. It is difficult to swap in/out a different execution engine.
> The purpose of this ticket is to create a stable interface that addresses the 
> above two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10815) API design: data sources and sinks

2016-09-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10815:

Description: 
The existing (in 2.0) source/sink interface for structured streaming depends on 
RDDs. This dependency has two issues:

1. The RDD interface is wide and difficult to stabilize across versions. This 
is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
Ideally, a source/sink implementation created for Spark 2.x should work in 
Spark 10.x, assuming the JVM is still around.

2. It is difficult to swap in/out a different execution engine.

The purpose of this ticket is to create a stable interface that addresses the 
above two.


  was:
The existing source/sink interface for structured streaming depends on RDDs. 
This dependency has two issues:

1. The RDD interface is wide and difficult to stabilize across versions. This 
is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
Ideally, a source/sink implementation created for Spark 2.x should work in 
Spark 10.x, assuming the JVM is still around.

2. It is difficult to swap in/out a different execution engine.

The purpose of this ticket is to create a stable interface that addresses the 
above two.



> API design: data sources and sinks
> --
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> The existing (in 2.0) source/sink interface for structured streaming depends 
> on RDDs. This dependency has two issues:
> 1. The RDD interface is wide and difficult to stabilize across versions. This 
> is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
> Ideally, a source/sink implementation created for Spark 2.x should work in 
> Spark 10.x, assuming the JVM is still around.
> 2. It is difficult to swap in/out a different execution engine.
> The purpose of this ticket is to create a stable interface that addresses the 
> above two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) API design: window and session specification

2016-09-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495011#comment-15495011
 ] 

Reynold Xin commented on SPARK-10816:
-

I guess window specification was done, but session remains to be done? 

cc [~marmbrus] [~tdas] [~zsxwing]

> API design: window and session specification
> 
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495006#comment-15495006
 ] 

Reynold Xin commented on SPARK-16407:
-

The source/sink interface currently depends on RDDs, doesn't it? In that case, 
it has two issues:

1. The RDD interface is wide and difficult to stabilize across versions. This 
is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
Ideally, a source/sink implementation created for Spark 2.x should work in 
Spark 10.x, assuming the JVM is still around.

2. It is difficult to swap in/out a different execution engine.

Actually I'm going to move the above into SPARK-10815 and just continue the 
discussion there.


> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2016-09-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15689:

Target Version/s: 2.2.0  (was: 2.1.0)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-09-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494955#comment-15494955
 ] 

Reynold Xin commented on SPARK-16534:
-

[~maver1ck] thanks for the comment. That's a great point.

That said, there is a big difference in whether we can implement a feature and 
allow users to use it for demo/learning, versus whether we realistically do a 
good job so it is good for use in production 24/7. It is very easy to promise 
the former and just ignore the latter.

In this context, it would be a lot simpler to have the structured streaming 
architecture working in production 24/7 in the long term than the dstream 
architecture, **for non JVM languages**. To clarify, I wasn't suggesting 
streaming in Python should be killed, but rather dstream in Python isn't a good 
architecture for running 24/7 streaming jobs.


> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494795#comment-15494795
 ] 

Frederick Reiss commented on SPARK-16407:
-

With respect, I'm not seeing a whole lot of flux in those APIs, or in 
Structured Streaming in general.  SPARK-8360, the JIRA for source and sink API 
design, has no description, no comments, and no changes since November of 2015. 
 The Structured Streaming design doc hasn't changed appreciably since May.  
Even short-term tasks in Structured Streaming are very thin on the ground, and 
PR traffic is minimal. Is something going on behind closed doors that other 
members of the community are not aware of?

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494715#comment-15494715
 ] 

Josh Rosen commented on SPARK-17544:


3.0.1 release fixing this should be available now: 
https://github.com/databricks/spark-avro/releases/tag/v3.0.1

> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
> {noformat}
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
>   at 
> 

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-15 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494681#comment-15494681
 ] 

Gang Wu commented on SPARK-17477:
-

Just confirmed that this also doesn't work with vectorized reader. What I did 
is as follows:

1. Created a flat hive table with schema "name: String, id: Long". But the 
parquet file which contains 100 rows is using "name: String, id: Int".
2. Then just did a query "select * from table" and show the result. It works 
fine with DataFrame.count and DataFrame .printSchema()

Got the following exception:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int 

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Nathan Beyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494666#comment-15494666
 ] 

Nathan Beyer commented on SPARK-17545:
--

As a workaround, the following format can be set as an option for dataframe 
reads:
{code}spark.read.option("dateFormat", 
"-MM-dd'T'HH:mm:ss.SSSXX").csv(path){code}

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

2016-09-15 Thread Nathan Beyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Beyer updated SPARK-17545:
-
Summary: Spark SQL Catalyst doesn't handle ISO 8601 date without colon in 
offset  (was: Spark SQL Catalyst doesn't handle ISO 8601 date with colon in 
offset)

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> ---
>
> Key: SPARK-17545
> URL: https://issues.apache.org/jira/browse/SPARK-17545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17458:
--
Assignee: (was: Herman van Hovell)

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17473:


Assignee: Apache Spark

> jdbc docker tests are failing with java.lang.AbstractMethodError:
> -
>
> Key: SPARK-17473
> URL: https://issues.apache.org/jira/browse/SPARK-17473
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Suresh Thalamati
>Assignee: Apache Spark
>
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0  -Phive-thriftserver 
> -Phive -DskipTests clean install
> build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 
>  compile test
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
> support was removed in 8.0
> Discovery starting.
> Discovery completed in 200 milliseconds.
> Run starting. Expected test count is: 10
> MySQLIntegrationSuite:
> Error:
> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 9.31.117.25, 51868)
> *** RUN ABORTED ***
>   java.lang.AbstractMethodError:
>   at 
> org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17473:


Assignee: (was: Apache Spark)

> jdbc docker tests are failing with java.lang.AbstractMethodError:
> -
>
> Key: SPARK-17473
> URL: https://issues.apache.org/jira/browse/SPARK-17473
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Suresh Thalamati
>
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0  -Phive-thriftserver 
> -Phive -DskipTests clean install
> build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 
>  compile test
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
> support was removed in 8.0
> Discovery starting.
> Discovery completed in 200 milliseconds.
> Run starting. Expected test count is: 10
> MySQLIntegrationSuite:
> Error:
> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 9.31.117.25, 51868)
> *** RUN ABORTED ***
>   java.lang.AbstractMethodError:
>   at 
> org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494597#comment-15494597
 ] 

Apache Spark commented on SPARK-17473:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/15114

> jdbc docker tests are failing with java.lang.AbstractMethodError:
> -
>
> Key: SPARK-17473
> URL: https://issues.apache.org/jira/browse/SPARK-17473
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Suresh Thalamati
>
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0  -Phive-thriftserver 
> -Phive -DskipTests clean install
> build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 
>  compile test
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
> support was removed in 8.0
> Discovery starting.
> Discovery completed in 200 milliseconds.
> Run starting. Expected test count is: 10
> MySQLIntegrationSuite:
> Error:
> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 9.31.117.25, 51868)
> *** RUN ABORTED ***
>   java.lang.AbstractMethodError:
>   at 
> org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494591#comment-15494591
 ] 

Andrew Ray commented on SPARK-17458:


[~hvanhovell]: My JIRA username is a1ray.

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ray updated SPARK-17458:
---
Comment: was deleted

(was: [~hvanhovell] It's a1ray)

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16264) Allow the user to use operators on the received DataFrame

2016-09-15 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494587#comment-15494587
 ] 

Jakob Odersky commented on SPARK-16264:
---

I just came across this issue through a comment in the ForeachSink. I 
understand why Sinks would be better off by not knowing about the type of 
QueryExecution, however I'm not quite sure what you mean by "having something 
similar to foreachwriter". Is the idea to have only a single foreach sink and 
expose all custom user sinks as foreach writers?

> Allow the user to use operators on the received DataFrame
> -
>
> Key: SPARK-16264
> URL: https://issues.apache.org/jira/browse/SPARK-16264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>
> Currently Sink cannot apply any operators on the given DataFrame because new 
> DataFrame created by the operator will use QueryExecution rather than 
> IncrementalExecution.
> There are two options to fix this one:
> 1. Merge IncrementalExecution into QueryExecution so that QueryExecution can 
> also deal with streaming operators.
> 2. Make Dataset operators inherits the QueryExecution(IncrementalExecution is 
> just a subclass of IncrementalExecution) from it's parent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17544.

Resolution: Invalid

This is caused by a bug in the third-party {{spark-avro}} data source. I'm 
going to close this JIRA ticket, so let's continue discussion at 
https://github.com/databricks/spark-avro/issues/156.

I have submitted a pull request 
(https://github.com/databricks/spark-avro/pull/174) to fix this issue and will 
package a new release of the Avro connector library as soon as my patch is 
reviewed and merged.

> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
> {noformat}
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at 

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494475#comment-15494475
 ] 

Michael Armbrust commented on SPARK-16407:
--

Sure, but the bar for compatibility is different for things in explicit user 
facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves 
implementing interfaces defined in packages that are documented as internal 
(org.apache.spark.sql.catalyst.*, org.apache.spark.sql.execution.*)

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494475#comment-15494475
 ] 

Michael Armbrust edited comment on SPARK-16407 at 9/15/16 8:42 PM:
---

Sure, but the bar for compatibility is different for things in explicit user 
facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves 
implementing interfaces defined in packages that are documented as internal 
({{org.apache.spark.sql.catalyst.\*}}, {{org.apache.spark.sql.execution.\*}})


was (Author: marmbrus):
Sure, but the bar for compatibility is different for things in explicit user 
facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves 
implementing interfaces defined in packages that are documented as internal 
(org.apache.spark.sql.catalyst.*, org.apache.spark.sql.execution.*)

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17508:


Assignee: Apache Spark

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Assignee: Apache Spark
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494418#comment-15494418
 ] 

Apache Spark commented on SPARK-17508:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/15113

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17508:


Assignee: (was: Apache Spark)

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361
 ] 

Andrew Ray edited comment on SPARK-17458 at 9/15/16 8:09 PM:
-

[~hvanhovell] It's a1ray


was (Author: a1ray):
It's a1ray

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17458:
--
Assignee: Herman van Hovell

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361
 ] 

Andrew Ray commented on SPARK-17458:


It's a1ray

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17458.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

I cannot find the JIRA username of the assignee (Andrew Ray). I'd would be 
great if someone can provide this.


> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2016-09-15 Thread Egor Pahomov (JIRA)
Egor Pahomov created SPARK-17557:


 Summary: SQL query on parquet table 
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary
 Key: SPARK-17557
 URL: https://issues.apache.org/jira/browse/SPARK-17557
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov


Working on 1.6.2, broken on 2.0

{code}
select * from logs.a where year=2016 and month=9 and day=14 limit 100
{code}

{code}
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17556:
---

 Summary: Executor side broadcast for broadcast joins
 Key: SPARK-17556
 URL: https://issues.apache.org/jira/browse/SPARK-17556
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Reynold Xin


Currently in Spark SQL, in order to perform a broadcast join, the driver must 
collect the result of an RDD and then broadcast it. This introduces some extra 
latency. It might be possible to broadcast directly from executors.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17364) Can not query hive table starting with number

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17364.
---
   Resolution: Fixed
 Assignee: Sean Zhong
Fix Version/s: 2.1.0
   2.0.1

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>Assignee: Sean Zhong
> Fix For: 2.0.1, 2.1.0
>
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures

2016-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17484.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15085
[https://github.com/apache/spark/pull/15085]

> Race condition when cancelling a job during a cache write can lead to block 
> fetch failures
> --
>
> Key: SPARK-17484
> URL: https://issues.apache.org/jira/browse/SPARK-17484
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.1, 2.1.0
>
>
> On a production cluster, I observed the following weird behavior where a 
> block manager cached a block, the store failed due to a task being killed / 
> cancelled, and then a subsequent task incorrectly attempted to read the 
> cached block from the machine where the write failed, eventually leading to a 
> complete job failure.
> Here's the executor log snippet from the machine performing the failed cache:
> {code}
> 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory 
> (estimated size 976.8 MB, free 9.8 GB)
> 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed
> 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID 
> 127)
> {code}
> Here's the exception from the reader in the failed job:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 
> (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: 
> Failed to fetch block after 1 fetch failures. Most recent failure cause:
>   at 
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565)
>   at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> {code}
> I believe that there's a race condition in how we handle cleanup after failed 
> cache stores. Here's an excerpt from {{BlockManager.doPut()}}
> {code}
> var blockWasSuccessfullyStored: Boolean = false
> val result: Option[T] = try {
>   val res = putBody(putBlockInfo)
>   blockWasSuccessfullyStored = res.isEmpty
>   res
> } finally {
>   if (blockWasSuccessfullyStored) {
> if (keepReadLock) {
>   blockInfoManager.downgradeLock(blockId)
> } else {
>   blockInfoManager.unlock(blockId)
> }
>   } else {
> blockInfoManager.removeBlock(blockId)
> logWarning(s"Putting block $blockId failed")
>   }
>   }
> {code}
> The only way that I think this "successfully stored followed by immediately 
> failed" case could appear in our logs is if the local memory store write 
> succeeds and then an exception (perhaps InterruptedException) causes us to 
> enter the {{finally}} block's error-cleanup path. The problem is that the 
> {{finally}} block only cleans up the block's metadata rather than performing 
> the full cleanup path which would also notify the master that the block is no 
> longer available at this host.
> The fact that the Spark task was not resilient in the face of remote block 
> fetches is a separate issue which I'll report and fix separately. The scope 
> of this JIRA, however, is the fact that Spark still attempted reads from a 
> machine which was missing the block.
> In order to fix this problem, I think that the {{finally}} block should 
> perform more thorough cleanup and should send a "block removed" status update 
> to the master following any error during the write. This is necessary because 
> the body of {{doPut()}} may have already notified the master of block 
> availability. In addition, we can extend the block serving code path to 
> automatically update the master with "block deleted" statuses whenever the 
> block manager receives invalid requests for blocks that it doesn't have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17483) Minor refactoring and cleanup in BlockManager block status reporting and block removal

2016-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17483:
---
Fix Version/s: 2.0.1

> Minor refactoring and cleanup in BlockManager block status reporting and 
> block removal
> --
>
> Key: SPARK-17483
> URL: https://issues.apache.org/jira/browse/SPARK-17483
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.1, 2.1.0
>
>
> As a precursor to fixing a block fetch bug, I'd like to split a few small 
> refactorings in BlockManager into their own patch (hence this JIRA).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17429) spark sql length(1) return error

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17429.
---
   Resolution: Fixed
 Assignee: cen yuhai
Fix Version/s: 2.1.0

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 2.1.0
>
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17114.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0
   2.0.1

> Adding a 'GROUP BY 1' where first column is literal results in wrong answer
> ---
>
> Key: SPARK-17114
> URL: https://issues.apache.org/jira/browse/SPARK-17114
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Herman van Hovell
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Consider the following example:
> {code}
> sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable")
> // The following query should return an empty result set because the `IN` 
> filter condition is always false for this single-row table.
> val withoutGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
> """)
> assert(withoutGroupBy.collect().isEmpty, "original query returned wrong 
> answer")
> // After adding a 'GROUP BY 1' the query result should still be empty because 
> we'd be grouping an empty table:
> val withGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
>   GROUP BY 1
> """)
> assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong 
> answer")
> {code}
> Here, this fails the second assertion by returning a single row. It appears 
> that running {{group by 1}} where column 1 is a constant causes filter 
> conditions to be ignored.
> Both PostgreSQL and SQLite return empty result sets for the query containing 
> the {{GROUP BY}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write

2016-09-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17547.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

Issue resolved by pull request 15104
[https://github.com/apache/spark/pull/15104]

> Temporary shuffle data files may be leaked following exception in write
> ---
>
> Key: SPARK-17547
> URL: https://issues.apache.org/jira/browse/SPARK-17547
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.5.3, 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> SPARK-8029 modified shuffle writers to first stage their data to a temporary 
> file in the same directory as the final destination file and then to 
> atomically rename the file at the end of the write job. However, this change 
> introduced the potential for the temporary output file to be leaked if an 
> exception occurs during the write because the shuffle writers' existing error 
> cleanup code doesn't handle this new temp file.
> This is easy to fix: we just need to add a {{finally}} block to ensure that 
> the temporary file is guaranteed to be either moved or deleted before 
> existing the shuffle write method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17379) Upgrade netty-all to 4.0.41.Final (4.1.5-Final not compatible)

2016-09-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17379.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Upgrade netty-all to 4.0.41.Final (4.1.5-Final not compatible)
> --
>
> Key: SPARK-17379
> URL: https://issues.apache.org/jira/browse/SPARK-17379
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Trivial
> Fix For: 2.1.0
>
>
> We should use the newest version of netty based on info here: 
> http://netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html, especially 
> interested in the static initialiser deadlock fix: 
> https://github.com/netty/netty/pull/5730
> Lots more fixes mentioned so will create the pull request - again a case of 
> updating the pom and then the dependency files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17451) CoarseGrainedExecutorBackend should inform driver before self-kill

2016-09-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17451.
--
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.1.0

> CoarseGrainedExecutorBackend should inform driver before self-kill
> --
>
> Key: SPARK-17451
> URL: https://issues.apache.org/jira/browse/SPARK-17451
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> `CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While 
> this does not have any issue, from the driver UI there is no specific reason 
> captured for this. `exitExecutor` [0] already has the reason which could be 
> passed on to the driver.
> [0] : 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L151



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-15 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493936#comment-15493936
 ] 

holdenk commented on SPARK-16407:
-

I think its important to keep in mind that these APIs are already exposed - you 
can pass in a custom sink by full class name as a string. So while the current 
Sink API does indeed expose a micro-batch API that seems somewhat orthogonal to 
allowing people to specify Sources/Sinks programmatically - the Source/Sink 
APIs/interface its self needs to be improved but once it still makes sense to 
allow people to specify their own Sources/Sinks programmatically.

It might make sense to go back in and mark them as experimental - but if 
anything that is a reason to keep improving them.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17549:


Assignee: (was: Apache Spark)

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
> {noformat}
> And basically a lot of that going on making the output unreadable, so I just 
> killed the 

[jira] [Commented] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493929#comment-15493929
 ] 

Apache Spark commented on SPARK-17549:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15112

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
> 

[jira] [Assigned] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17549:


Assignee: Apache Spark

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
> {noformat}
> And basically a lot of that going on making the output unreadable, 

[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493777#comment-15493777
 ] 

Apache Spark commented on SPARK-17458:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/15111

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17458:


Assignee: (was: Apache Spark)

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17458:


Assignee: Apache Spark

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Apache Spark
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-09-15 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493395#comment-15493395
 ] 

Jonathan Taws edited comment on SPARK-15917 at 9/15/16 4:05 PM:


Hi Andrew,

Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me 
a few days to try them out. 

By the way, do you have any pointers to where I could look into to produce a 
meaning warning when there isn't enough resources to comply with all the 
parameters ? 


was (Author: jonathantaws):
Hi Andrew,

Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me 
a few days to try them out. 

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17544:
--
Description: 
I have an application that loops through a text file to find files in S3 and 
then reads them in, performs some ETL processes, and then writes them out. This 
works for around 80 loops until I get this:  
{noformat}
16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
 for reading
16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
Timeout waiting for connection from pool
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
 Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
 Source)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at 
org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:452)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:458)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:451)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at 
com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26)
at 

[jira] [Created] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon

2016-09-15 Thread Brad Willard (JIRA)
Brad Willard created SPARK-17555:


 Summary: ExternalShuffleBlockResolver fails randomly with External 
Shuffle Service  and Dynamic Resource Allocation on Mesos running under Marathon
 Key: SPARK-17555
 URL: https://issues.apache.org/jira/browse/SPARK-17555
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.5.2
 Environment: Mesos using docker with external shuffle service running 
on Marathon. Running code from pyspark shell in client mode.
Reporter: Brad Willard


External Shuffle Service throws these errors about 90% of the time. It seems to 
die between stages and work inconsistently with these style of errors about 
missing files. I've tested this same behavior with all the spark versions 
listed on the jira using the pre-build hadoop 2.6 distributions from the apache 
spark download page. I also want to mention everything works successfully with 
dynamic resource allocation turned off.

I have read other related bugs and have tried some of the 
workaround/suggestions. Seems like some people have blamed the switch from akka 
to netty which got me testing this in the 1.5* range with no luck. I'm 
currently running these config option (informed by reading other bugs on jira 
that seemed related to my problem). These settings have helped it work 
sometimes instead of never.

{code}
spark.shuffle.service.port  7338
spark.shuffle.io.numConnectionsPerPeer  4
spark.shuffle.io.connectionTimeout  18000s
spark.shuffle.service.enabled   true
spark.dynamicAllocation.enabled true
{code}

on the driver for pyspark submit I'm sending along this config

{code}
--conf 
spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1
 \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.mesos.coarse=true \
--conf spark.cores.max=100 \
--conf spark.executor.uri=$SPARK_EXECUTOR_URI \
--conf spark.shuffle.service.port=7338 \
--executor-memory 15g
{code}

Under Marathon I'm pinning each external shuffle service to an agent and 
starting the service like this.

{code}
$SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f 
$SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService*
{code}

On startup it seems like all is well
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
  /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:52:12)
SparkContext available as sc, HiveContext available as sqlContext.
>>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now 
>>> TASK_RUNNING
16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered app 
a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now 
TASK_RUNNING
16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered app 
a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor 
NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID 
55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1
16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor 
55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1)
16/09/15 11:35:56 INFO BlockManagerMasterEndpoint: Registering block manager 
mesos-agent002.[redacted]:46247 with 10.6 GB RAM, 
BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1, 
mesos-agent002.[redacted], 46247)
16/09/15 11:35:59 INFO CoarseMesosSchedulerBackend: Registered executor 
NettyRpcEndpointRef(null) (mesos-agent004.[redacted]:42738) with ID 
55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0
16/09/15 11:35:59 INFO ExecutorAllocationManager: New executor 
55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 has registered (new total is 2)
16/09/15 11:35:59 INFO BlockManagerMasterEndpoint: Registering block manager 
mesos-agent004.[redacted]:11262 with 10.6 GB RAM, 
BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0, 
mesos-agent004.[redacted], 11262)

{code}


I'm running a simple sort command on a spark data frame loaded from hdfs from a 
parquet file. These are the errors. They happen shortly after startup as agents 
are being allocated by mesos. 

{code}
16/09/15 14:49:10 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
from mesos-agent004:7338
java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file: 
/tmp/blockmgr-29f93bea-9a87-41de-9046-535401e6d4fd/30/shuffle_0_0_0.index
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:286)
at 

[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-15 Thread Brady Auen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493555#comment-15493555
 ] 

Brady Auen commented on SPARK-17544:


That looks like my issue too, thanks Josh! 

> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
>   
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324)
>   at 
> 

[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-09-15 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493399#comment-15493399
 ] 

Jonathan Taws commented on SPARK-15917:
---

I didn't create a specific pull request for the moment as I was waiting for 
your insight to clarify the expected behavior, but I can create a PR now and 
update it with the latest modifications as I work on it. 

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-09-15 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493395#comment-15493395
 ] 

Jonathan Taws commented on SPARK-15917:
---

Hi Andrew,

Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me 
a few days to try them out. 

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-15 Thread Srinivas Rishindra Pothireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493075#comment-15493075
 ] 

Srinivas Rishindra Pothireddi commented on SPARK-17538:
---

Hi [~srowen],

I updated the description of the jira as per your suggestion. Can you please 
let me know if you need anymore information

> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>
> I have a production job in spark 1.6.2 that registers several dataframes as 
> tables. 
> After testing the job in spark 2.0.0, I found that one of the dataframes is 
> not getting registered as a table.
> Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, 
> "anonymousTable")
> line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, 
> AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")
> my stacktrace
>  File "anonymousFile.py", line 354, in anonymousMethod
> df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( 
> AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py",
>  line 350, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py",
>  line 541, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
>  line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61'
> The same code is working perfectly fine in spark-1.6.2 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-15 Thread Srinivas Rishindra Pothireddi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-17538:
--
Description: 
I have a production job in spark 1.6.2 that registers several dataframes as 
tables. 
After testing the job in spark 2.0.0, I found that one of the dataframes is not 
getting registered as a table.


Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, 
"anonymousTable")
line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, 
AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")

my stacktrace

 File "anonymousFile.py", line 354, in anonymousMethod
df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( 
AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")
  File 
"/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py",
 line 350, in sql
return self.sparkSession.sql(sqlQuery)
  File 
"/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py",
 line 541, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File 
"/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
 line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61'


The same code is working perfectly fine in spark-1.6.2 

 

  was:
I have a production job in spark 1.6.2 that registers four dataframes as 
tables. After testing the job in spark 2.0.0 one of the dataframes is not 
getting registered as a table.

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 1.6.2 is

temp1,temp2,temp3,temp4

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 2.0.0 is
temp1,temp2,temp3

so when the table 'temp4' is used by the job at a later stage an 
AnalysisException is raised in spark 2.0.0

There are no changes in the code whatsoever. 


 

 


> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>
> I have a production job in spark 1.6.2 that registers several dataframes as 
> tables. 
> After testing the job in spark 2.0.0, I found that one of the dataframes is 
> not getting registered as a table.
> Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, 
> "anonymousTable")
> line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, 
> AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")
> my stacktrace
>  File "anonymousFile.py", line 354, in anonymousMethod
> df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( 
> AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable")
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py",
>  line 350, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py",
>  line 541, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
>  line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61'
> The same code is working perfectly fine in spark-1.6.2 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-09-15 Thread Oussama Mekni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492994#comment-15492994
 ] 

Oussama Mekni commented on SPARK-16297:
---

some updates about ?

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-15 Thread Srinivas Rishindra Pothireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492937#comment-15492937
 ] 

Srinivas Rishindra Pothireddi commented on SPARK-17538:
---

Hi [~srowen], I will fix this as soon as possible as you suggested.


> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>
> I have a production job in spark 1.6.2 that registers four dataframes as 
> tables. After testing the job in spark 2.0.0 one of the dataframes is not 
> getting registered as a table.
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 1.6.2 is
> temp1,temp2,temp3,temp4
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 2.0.0 is
> temp1,temp2,temp3
> so when the table 'temp4' is used by the job at a later stage an 
> AnalysisException is raised in spark 2.0.0
> There are no changes in the code whatsoever. 
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression

2016-09-15 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492931#comment-15492931
 ] 

Weichen Xu commented on SPARK-17281:


because currently, the AFTSuvivalRegression use `treeAggregate` and the `depth` 
default is 2.
but, in many case the AFTSuvivalRegression only trains on low dimemsion data, 
in such case, the `treeAggregateDepth` set to 1 has better performance.
so, I think add this parameter is useful.
thanks!

> Add treeAggregateDepth parameter for AFTSurvivalRegression
> --
>
> Key: SPARK-17281
> URL: https://issues.apache.org/jira/browse/SPARK-17281
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add treeAggregateDepth parameter for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17406) Event Timeline will be very slow when there are too many executor events

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492919#comment-15492919
 ] 

Apache Spark commented on SPARK-17406:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15110

> Event Timeline will be very slow when there are too many executor events
> 
>
> Key: SPARK-17406
> URL: https://issues.apache.org/jira/browse/SPARK-17406
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.2, 2.0.0
>Reporter: cen yuhai
>Assignee: cen yuhai
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: timeline1.png, timeline2.png
>
>
> The job page will be too slow to open when there are thousands of executor 
> events(added or removed). I found that in ExecutorsTab file, executorIdToData 
> will not remove elements, it will increase all the time. Before this pr, it 
> looks like timeline1.png. After this pr, it looks like timeline2.png(we can 
> set how many events will be displayed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17501) Re-register BlockManager again and again

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17501:


Assignee: (was: Apache Spark)

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Priority: Minor
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at 
> 

[jira] [Assigned] (SPARK-17501) Re-register BlockManager again and again

2016-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17501:


Assignee: Apache Spark

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Assignee: Apache Spark
>Priority: Minor
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at 
> 

[jira] [Commented] (SPARK-17501) Re-register BlockManager again and again

2016-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492860#comment-15492860
 ] 

Apache Spark commented on SPARK-17501:
--

User 'cenyuhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/15109

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Priority: Minor
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> 

[jira] [Updated] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17281:
--
Priority: Minor  (was: Major)

What's the use case, just consistency? why not on more jobs than just these? 

> Add treeAggregateDepth parameter for AFTSurvivalRegression
> --
>
> Key: SPARK-17281
> URL: https://issues.apache.org/jira/browse/SPARK-17281
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add treeAggregateDepth parameter for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17501) Re-register BlockManager again and again

2016-09-15 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484185#comment-15484185
 ] 

cen yuhai edited comment on SPARK-17501 at 9/15/16 9:19 AM:


I can't hardly reproduce this error. But maybe I found the root cause. In 
HeatbeatReceiver, executor is recorded by executorLastSeen. But Blockmanager is 
recorded by blockManagerInfo in BlockManagerMasterEndpoint.It should not 
register BlockManager,I think just put it into executorLastSeen which will 
resolve this problem.


was (Author: cenyuhai):
I can't hardly reproduce this error. But maybe I found the root cause. In 
HeatbeatReceiver, executor is record by executorLastSeen. But Blockmanager is 
record by blockManagerInfo in BlockManagerMasterEndpoint.It should not register 
BlockManager,Executor need to send RegisterExecutor.

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Priority: Minor
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> 

[jira] [Updated] (SPARK-17523) Cannot get Spark build info from spark-core package which built in Windows

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17523:
--
Fix Version/s: (was: 2.0.1)
  Component/s: (was: Spark Core)

> Cannot get Spark build info from spark-core package which built in Windows
> --
>
> Key: SPARK-17523
> URL: https://issues.apache.org/jira/browse/SPARK-17523
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Yun Tang
>  Labels: windows
>
> Currently, if we build Spark, it will generate a 
> 'spark-version-info.properties' and merged into spark-core_2.11-*.jar. 
> However, the script 'build/spark-build-info' which generates this file can 
> only be executed with bash environment.
> Without this file, errors like below will happen when submitting Spark 
> application, which break the whole submitting phrase at beginning.
> {code:java}
> ERROR ApplicationMaster: Uncaught exception: 
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:394)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:247)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:759)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:757)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: java.util.concurrent.ExecutionException: Boxed Error
>   at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
>   at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:648)
> Caused by: java.lang.ExceptionInInitializerError
>   at org.apache.spark.package$.(package.scala:91)
>   at org.apache.spark.package$.(package.scala)
>   at 
> org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:187)
>   at 
> org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:187)
>   at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
>   at org.apache.spark.SparkContext.logInfo(SparkContext.scala:76)
>   at org.apache.spark.SparkContext.(SparkContext.scala:187)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2287)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:822)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:814)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:630)
> Caused by: org.apache.spark.SparkException: Error while locating file 
> spark-version-info.properties
>   at 
> org.apache.spark.package$SparkBuildInfo$.liftedTree1$1(package.scala:75)
>   at org.apache.spark.package$SparkBuildInfo$.(package.scala:61)
>   at org.apache.spark.package$SparkBuildInfo$.(package.scala)
>   ... 19 more
> Caused by: java.lang.NullPointerException
>   at java.util.Properties$LineReader.readLine(Properties.java:434)
>   at java.util.Properties.load0(Properties.java:353)
>   at 

[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again

2016-09-15 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-17501:
--
Priority: Minor  (was: Major)

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>Priority: Minor
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
> 

[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17538:
--
 Shepherd:   (was: Matei Zaharia)
Flags:   (was: Important)
Affects Version/s: (was: 2.0.1)
   (was: 2.1.0)
 Target Version/s:   (was: 2.0.1, 2.1.0)
   Labels:   (was: pyspark)
 Priority: Major  (was: Critical)
Fix Version/s: (was: 2.0.1)
   (was: 2.1.0)

[~sririshindra] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  
There's a lot wrong with how you filled this out.

> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>
> I have a production job in spark 1.6.2 that registers four dataframes as 
> tables. After testing the job in spark 2.0.0 one of the dataframes is not 
> getting registered as a table.
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 1.6.2 is
> temp1,temp2,temp3,temp4
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 2.0.0 is
> temp1,temp2,temp3
> so when the table 'temp4' is used by the job at a later stage an 
> AnalysisException is raised in spark 2.0.0
> There are no changes in the code whatsoever. 
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17554) spark.executor.memory option not working

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17554.
---
Resolution: Invalid

Questions should go to user@. Without seeing how you're running the job or what 
you are looking at specifically in the UI it's hard to say. The parameter does 
work correctly in all of my usages.

> spark.executor.memory option not working
> 
>
> Key: SPARK-17554
> URL: https://issues.apache.org/jira/browse/SPARK-17554
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Sankar Mittapally
>
> Hi,
>  I am new to spark, I have spark cluster with 5 slaves(Each one have 2 cores 
> and 4g RAM). In spark cluster dashboard I am seeing memory per node is 1gb, I 
> tried to increase it to 2g by using this parameter spark.executor.memory  2g 
> in defaults.conf but it didn't work. I want to increase the memory. Please 
> let me know how to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17505) Add setBins for BinaryClassificationMetrics in mlllb/evaluation

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17505.
---
Resolution: Won't Fix

> Add setBins for BinaryClassificationMetrics in mlllb/evaluation
> ---
>
> Key: SPARK-17505
> URL: https://issues.apache.org/jira/browse/SPARK-17505
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Peng Meng
>Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Add a setBins method for BinaryClassificationMetrics.
> BinaryClassificationMetrics is a class in mllib/Evaluation. numBins is a key 
> attribute of it. If numBins greater than 0, then the curves (ROC curve, PR 
> curve) computed internally will be down-sampled to this many "bins". 
> It is useful to let the user set the numBins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17554) spark.executor.memory option not working

2016-09-15 Thread Sankar Mittapally (JIRA)
Sankar Mittapally created SPARK-17554:
-

 Summary: spark.executor.memory option not working
 Key: SPARK-17554
 URL: https://issues.apache.org/jira/browse/SPARK-17554
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Sankar Mittapally


Hi,

 I am new to spark, I have spark cluster with 5 slaves(Each one have 2 cores 
and 4g RAM). In spark cluster dashboard I am seeing memory per node is 1gb, I 
tried to increase it to 2g by using this parameter spark.executor.memory  2g in 
defaults.conf but it didn't work. I want to increase the memory. Please let me 
know how to do that.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-15 Thread Srinivas Rishindra Pothireddi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-17538:
--
Labels: pyspark  (was: )

> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Critical
>  Labels: pyspark
> Fix For: 2.0.1, 2.1.0
>
>
> I have a production job in spark 1.6.2 that registers four dataframes as 
> tables. After testing the job in spark 2.0.0 one of the dataframes is not 
> getting registered as a table.
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 1.6.2 is
> temp1,temp2,temp3,temp4
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 2.0.0 is
> temp1,temp2,temp3
> so when the table 'temp4' is used by the job at a later stage an 
> AnalysisException is raised in spark 2.0.0
> There are no changes in the code whatsoever. 
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17553) On Master Change the running Apps goes to wait state even though resource are available

2016-09-15 Thread vimal dinakaran (JIRA)
vimal dinakaran created SPARK-17553:
---

 Summary: On Master Change the running Apps goes to wait state even 
though resource are available
 Key: SPARK-17553
 URL: https://issues.apache.org/jira/browse/SPARK-17553
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.6.2
Reporter: vimal dinakaran
Priority: Minor


When a zookeeper node goes down the master election changes and all the running 
apps are transferred to the new master. This works fine. But in the spark UI 
the app status goes to waiting state. The apps are actually running which can 
be seen in the streaming ui . But the master ui page it is shown as waiting. 
This is misleading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17534) Increase timeouts for DirectKafkaStreamSuite tests

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17534:
--
Assignee: Adam Roberts

> Increase timeouts for DirectKafkaStreamSuite tests
> --
>
> Key: SPARK-17534
> URL: https://issues.apache.org/jira/browse/SPARK-17534
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Trivial
>
> The current tests in DirectKafkaStreamSuite rely on a certain number of 
> messages being received within a given timeframe
> On machines with four+ cores and better clock speeds, this doesn't pose a 
> problem, but on a two core x86 box I regularly see timeouts within two tests 
> within this suite.
> To avoid other users hitting the same problem and needlessly doing their own 
> investigations, let's increase the timeouts. By using 1 sec instead of 0.2 
> sec batch durations and increasing the timeout to be 100 seconds not 20, we 
> consistently see the tests passing even on less powerful hardware
> I see the problem consistently using a machine with 2x
> Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD
> Pull request to follow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17534) Increase timeouts for DirectKafkaStreamSuite tests

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17534:
--
Priority: Trivial  (was: Minor)

> Increase timeouts for DirectKafkaStreamSuite tests
> --
>
> Key: SPARK-17534
> URL: https://issues.apache.org/jira/browse/SPARK-17534
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Trivial
>
> The current tests in DirectKafkaStreamSuite rely on a certain number of 
> messages being received within a given timeframe
> On machines with four+ cores and better clock speeds, this doesn't pose a 
> problem, but on a two core x86 box I regularly see timeouts within two tests 
> within this suite.
> To avoid other users hitting the same problem and needlessly doing their own 
> investigations, let's increase the timeouts. By using 1 sec instead of 0.2 
> sec batch durations and increasing the timeout to be 100 seconds not 20, we 
> consistently see the tests passing even on less powerful hardware
> I see the problem consistently using a machine with 2x
> Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD
> Pull request to follow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17536) Minor performance improvement to JDBC batch inserts

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17536.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15098
[https://github.com/apache/spark/pull/15098]

> Minor performance improvement to JDBC batch inserts
> ---
>
> Key: SPARK-17536
> URL: https://issues.apache.org/jira/browse/SPARK-17536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: John Muller
>Priority: Trivial
>  Labels: perfomance
> Fix For: 2.1.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> JDBC batch inserts currently are set to repeatedly retrieve the number of 
> fields inside the row iterator:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598
> val numFields = rddSchema.fields.length
> This value does not change and can be set prior to the loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17536) Minor performance improvement to JDBC batch inserts

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17536:
--
Assignee: John Muller

> Minor performance improvement to JDBC batch inserts
> ---
>
> Key: SPARK-17536
> URL: https://issues.apache.org/jira/browse/SPARK-17536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: John Muller
>Assignee: John Muller
>Priority: Trivial
>  Labels: perfomance
> Fix For: 2.1.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> JDBC batch inserts currently are set to repeatedly retrieve the number of 
> fields inside the row iterator:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598
> val numFields = rddSchema.fields.length
> This value does not change and can be set prior to the loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17406) Event Timeline will be very slow when there are too many executor events

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17406:
--
Assignee: cen yuhai

> Event Timeline will be very slow when there are too many executor events
> 
>
> Key: SPARK-17406
> URL: https://issues.apache.org/jira/browse/SPARK-17406
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.2, 2.0.0
>Reporter: cen yuhai
>Assignee: cen yuhai
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: timeline1.png, timeline2.png
>
>
> The job page will be too slow to open when there are thousands of executor 
> events(added or removed). I found that in ExecutorsTab file, executorIdToData 
> will not remove elements, it will increase all the time. Before this pr, it 
> looks like timeline1.png. After this pr, it looks like timeline2.png(we can 
> set how many events will be displayed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17406) Event Timeline will be very slow when there are too many executor events

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17406.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14969
[https://github.com/apache/spark/pull/14969]

> Event Timeline will be very slow when there are too many executor events
> 
>
> Key: SPARK-17406
> URL: https://issues.apache.org/jira/browse/SPARK-17406
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.2, 2.0.0
>Reporter: cen yuhai
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: timeline1.png, timeline2.png
>
>
> The job page will be too slow to open when there are thousands of executor 
> events(added or removed). I found that in ExecutorsTab file, executorIdToData 
> will not remove elements, it will increase all the time. Before this pr, it 
> looks like timeline1.png. After this pr, it looks like timeline2.png(we can 
> set how many events will be displayed)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17521:
--
Assignee: Jianfei Wang

> Error when I use sparkContext.makeRDD(Seq())
> 
>
> Key: SPARK-17521
> URL: https://issues.apache.org/jira/browse/SPARK-17521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Assignee: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix
> Fix For: 2.0.1, 2.1.0
>
>
> when i use sc.makeRDD below
> ```
> val data3 = sc.makeRDD(Seq())
> println(data3.partitions.length)
> ```
> I got an error:
> Exception in thread "main" java.lang.IllegalArgumentException: Positive 
> number of slices required
> We can fix this bug just modify the last line ,do a check of seq.size
> 
>   def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
> assertNotStopped()
> val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
> new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
>   }
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17521.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15077
[https://github.com/apache/spark/pull/15077]

> Error when I use sparkContext.makeRDD(Seq())
> 
>
> Key: SPARK-17521
> URL: https://issues.apache.org/jira/browse/SPARK-17521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jianfei Wang
>Priority: Trivial
>  Labels: easyfix
> Fix For: 2.0.1, 2.1.0
>
>
> when i use sc.makeRDD below
> ```
> val data3 = sc.makeRDD(Seq())
> println(data3.partitions.length)
> ```
> I got an error:
> Exception in thread "main" java.lang.IllegalArgumentException: Positive 
> number of slices required
> We can fix this bug just modify the last line ,do a check of seq.size
> 
>   def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
> assertNotStopped()
> val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
> new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
>   }
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17524:
--
Assignee: Adam Roberts
Priority: Trivial  (was: Minor)

> RowBasedKeyValueBatchSuite always uses 64 mb page size
> --
>
> Key: SPARK-17524
> URL: https://issues.apache.org/jira/browse/SPARK-17524
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Trivial
> Fix For: 2.1.0
>
>
> The appendRowUntilExceedingPageSize test at 
> sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
>  always uses the default page size which is 64 MB for running the test
> Users with less powerful machines (e.g. those with two cores) may opt to 
> choose a smaller spark.buffer.pageSize value in order to prevent problems 
> acquiring memory
> If this size is reduced, let's say to 1 MB, this test will fail, here is the 
> problem scenario
> We run with 1,048,576 page size (1 mb)
> Default is 67,108,864 size (64 mb)
> Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>
> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger 
> than 1 MB (which is when we see the failure)
> The failure is at
> Assert.assertEquals(batch.numRows(), numRows);
> This minor improvement has the test use whatever the user has specified 
> (looks for spark.buffer.pageSize) to prevent this problem from occurring for 
> anyone testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17524.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15079
[https://github.com/apache/spark/pull/15079]

> RowBasedKeyValueBatchSuite always uses 64 mb page size
> --
>
> Key: SPARK-17524
> URL: https://issues.apache.org/jira/browse/SPARK-17524
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
> Fix For: 2.1.0
>
>
> The appendRowUntilExceedingPageSize test at 
> sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
>  always uses the default page size which is 64 MB for running the test
> Users with less powerful machines (e.g. those with two cores) may opt to 
> choose a smaller spark.buffer.pageSize value in order to prevent problems 
> acquiring memory
> If this size is reduced, let's say to 1 MB, this test will fail, here is the 
> problem scenario
> We run with 1,048,576 page size (1 mb)
> Default is 67,108,864 size (64 mb)
> Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>
> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger 
> than 1 MB (which is when we see the failure)
> The failure is at
> Assert.assertEquals(batch.numRows(), numRows);
> This minor improvement has the test use whatever the user has specified 
> (looks for spark.buffer.pageSize) to prevent this problem from occurring for 
> anyone testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17507) check weight vector size in ANN

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17507.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15060
[https://github.com/apache/spark/pull/15060]

> check weight vector size in ANN
> ---
>
> Key: SPARK-17507
> URL: https://issues.apache.org/jira/browse/SPARK-17507
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> check weight vector(specified by user) size in ANN.
> and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17507) check weight vector size in ANN

2016-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17507:
--
Assignee: Weichen Xu
Priority: Trivial  (was: Major)

> check weight vector size in ANN
> ---
>
> Key: SPARK-17507
> URL: https://issues.apache.org/jira/browse/SPARK-17507
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Trivial
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> check weight vector(specified by user) size in ANN.
> and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >