[jira] [Assigned] (SPARK-12860) speed up safe projection for primitive types

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12860:


Assignee: Apache Spark

> speed up safe projection for primitive types
> 
>
> Key: SPARK-12860
> URL: https://issues.apache.org/jira/browse/SPARK-12860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2016-01-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103604#comment-15103604
 ] 

Xiao Li commented on SPARK-10777:
-

[~kevinyu98] My PR https://github.com/apache/spark/pull/10678 also resolves 
this issue. Thanks!

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12860) speed up safe projection for primitive types

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12860:


Assignee: (was: Apache Spark)

> speed up safe projection for primitive types
> 
>
> Key: SPARK-12860
> URL: https://issues.apache.org/jira/browse/SPARK-12860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12860) speed up safe projection for primitive types

2016-01-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103603#comment-15103603
 ] 

Apache Spark commented on SPARK-12860:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10790

> speed up safe projection for primitive types
> 
>
> Key: SPARK-12860
> URL: https://issues.apache.org/jira/browse/SPARK-12860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12860) speed up safe projection for primitive types

2016-01-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12860:
---

 Summary: speed up safe projection for primitive types
 Key: SPARK-12860
 URL: https://issues.apache.org/jira/browse/SPARK-12860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11518:


Assignee: (was: Apache Spark)

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11518:


Assignee: Apache Spark

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Assignee: Apache Spark
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103589#comment-15103589
 ] 

Apache Spark commented on SPARK-11518:
--

User 'tritab' has created a pull request for this issue:
https://github.com/apache/spark/pull/10789

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12669) Organize options for default values

2016-01-16 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103584#comment-15103584
 ] 

Hossein Falaki commented on SPARK-12669:


This looks like a good categorization. We have full coverage of Formatting 
ones. and Line parsing options. What option names do you recommend for other 
parsing options?

> Organize options for default values
> ---
>
> Key: SPARK-12669
> URL: https://issues.apache.org/jira/browse/SPARK-12669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> CSV data source in SparkSQL should be able to differentiate empty string, 
> null, NaN, “N/A” (maybe data type dependent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add

2016-01-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103576#comment-15103576
 ] 

yuhao yang commented on SPARK-11507:


A fix has been merged into Breeze. 
https://github.com/scalanlp/breeze/commit/d255b66d7e7720f8447a49c78e762d21b18835c3.
 

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.5.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103476#comment-15103476
 ] 

Jason Hubbard commented on SPARK-2243:
--

It seems feasible to me to implement such functionality and only have one 
application master. My understanding of the application master for yarn client 
may be incorrect, but I believe it is only requesting resources from the 
resource manager do their shouldn't be a huge limitation there, whether it's 
desirable could could be a different argument.  Either way, it still means N-1 
JVMs on the driver side.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java

[jira] [Comment Edited] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-16 Thread Jon Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103460#comment-15103460
 ] 

Jon Maurer edited comment on SPARK-11518 at 1/16/16 9:57 PM:
-

Try adding quotes: 

cmd /V /E /C "%~dp0spark-submit2.cmd" %*


was (Author: tri...@gmail.com):
Try adding quotes: 

rem cmd /V /E /C "%~dp0spark-submit2.cmd" %*

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-16 Thread Jon Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103460#comment-15103460
 ] 

Jon Maurer commented on SPARK-11518:


Try adding quotes: 

rem cmd /V /E /C "%~dp0spark-submit2.cmd" %*

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103451#comment-15103451
 ] 

Sean Owen commented on SPARK-2243:
--

What do you mean? each context operates independently no matter where it is. N 
contexts = N requests to the resource manager, even in 1 JVM. This is 
essentially why I'm not following this line of argument.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeR

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103446#comment-15103446
 ] 

Jason Hubbard commented on SPARK-2243:
--

The difference between N contexts in N JVMs and N contexts in 1 JVM is N-1 
JVMs, N-1 resource manager application requests, and N-1 application masters, 
plus all overhead and complexity associated with this.  There are trade-offs on 
both approaches, but I think it is important to allow the designer to weigh and 
decide on each option.

I guess I'm not familiar with the app-container like things you are referring, 
but typically most applications I see make these assumptions at the task level 
and not necessarily at the driver level.  Again, it is trade-offs that would be 
great to allow the designer to weigh IMHO.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broa

[jira] [Commented] (SPARK-12669) Organize options for default values

2016-01-16 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103441#comment-15103441
 ] 

Mohit Jaggi commented on SPARK-12669:
-

Based on my experience working with CSV files, I think the following set of 
options make sense. What do people think? Also, what is a good way to organize 
these options? I like https://github.com/typesafehub/config 

refer: https://github.com/databricks/spark-csv/pull/124/files

Option by Categories:
1. Line parsing Options
  a. Bad line handling: skip the line, fail completely or repair the line
  b. Line repairing methods: fill with "filler value" which can be configured 
per data type 

2. Real Number parsing
 There are defaults that can be overridden or augmented
  a. NaN value: default "NaN", "Double.NaN"
  b. Infinity: default "Inf"
  c. -Infinity: default "-Inf"
  d. nulls: default "null"

3. Integer Parsing
  a. nulls: default "null"
 
4. String Parsing
  a. nulls: default "null"
  b. empty strings: default ""

5. Formatting
  a. field delimiter: default comma
  b. record delimiter: default new line...due to Hadoop Input Format's behavior 
we probably can't allow arbitrary record delimiters
  c. escape character: default backslash
  d. quote character: default quote
  e. ignore leading white space: default true
  f. ignore trailing white space: default true
  


> Organize options for default values
> ---
>
> Key: SPARK-12669
> URL: https://issues.apache.org/jira/browse/SPARK-12669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> CSV data source in SparkSQL should be able to differentiate empty string, 
> null, NaN, “N/A” (maybe data type dependent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103429#comment-15103429
 ] 

Sean Owen commented on SPARK-2243:
--

But the question isn't multiple JVMs or multiple contexts -- it's multiple 
contexts in one JVM. I don't see how N contexts is better or worse in this 
respect in 1 JVM vs N JVMs.

My opinion is based on the fact that: generally, other app container-like 
things don't want you to assume sharing one JVM across unrelated apps, and as 
it happens Spark also never has.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAcces

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103419#comment-15103419
 ] 

Jason Hubbard commented on SPARK-2243:
--

There is a limit to how small you can make the heap before causing OOM and
causing gc issue and poor performance making it too big. You got it exactly
right, jobs have different task profiles.

Having multiple jvms and multiple yarn resource requests will take longer
and have more overhead than a single jvm and a single application master.

It may be a non goal to you but that doesn't make it generally undesirable.




> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccesso

[jira] [Updated] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12859:
--
Priority: Trivial  (was: Major)

This can't be considered "Major". It's a mild cosmetic problem

> Names of input streams with receivers don't fit in Streaming page
> -
>
> Key: SPARK-12859
> URL: https://issues.apache.org/jira/browse/SPARK-12859
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Attachments: spark-streaming-webui-names-too-long-to-fit.png
>
>
> Since the column for the names of input streams with receivers (under Input 
> Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
> attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103407#comment-15103407
 ] 

Sean Owen commented on SPARK-2243:
--

Executors have the same heap size, yes, but the total heap available becomes 
larger. You can cache more, run more tasks, etc. That's directionally the same 
purpose. 

Making a new context is coarser than making an executor. I understand you can 
make a context that makes smaller executors, but is this really better than 
simply having more, smaller executors? There is this general problem of jobs 
having potentially quite different tasks -- some I/O intensive, some 
memory-intensive, some CPU-intensive. Executors are one-size-fits-all in one 
application. 

How does having contexts in the same or many JVMs affect how long they take to 
start? This is why it seems orthogonal.

Sharing data across contexts in one JVM seems like a non-goal to me. Just like, 
say, my JavaEE web app isn't supposed to share static state with other web apps 
completely outside the container.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.

[jira] [Commented] (SPARK-12747) Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'

2016-01-16 Thread Brandon Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103393#comment-15103393
 ] 

Brandon Bradley commented on SPARK-12747:
-

Got a similar error for DecimalType.SYSTEM_DEFAULT.

> Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'
> --
>
> Key: SPARK-12747
> URL: https://issues.apache.org/jira/browse/SPARK-12747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Brandon Bradley
>  Labels: JDBC
>
> Hello,
> I'm getting this exception when trying to use DataFrame.jdbc.write on a 
> DataFrame with column ArrayType(DoubleType).
> {noformat}
> org.postgresql.util.PSQLException: Unable to find server array type for 
> provided name double precision
> {noformat}
> Driver is definitely on the driver and executor classpath as I have other 
> code that works without ArrayType. I'm not sure how to proceed in debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103348#comment-15103348
 ] 

Jason Hubbard commented on SPARK-2243:
--

Dynamic allocation allows executors to be started and stopped but it doesn't 
change the size of the heap.  Like you mention, its coarse grain and having a 
separate context is required to make this finer grained.  It seems possible to 
allow executors to have separate heap sizes, but this may be as challenging as 
multiple contexts in a single JVM and limits the different settings you can 
configure per job to only heap size.

I don't totally agree the long-running context is orthogonal.  It's certainly 
possible to have dozens to 100s of simultaneous jobs running, that's a 
significant amount of separate JVMs spinning up and YARN allocation requests.  
The mixed used case of wanting to limit the overhead of JVM and YARN allocation 
and allowing information to be shared between contexts while also allowing some 
unique settings would certainly be a reason to want multiple spark contexts in 
the same jvm.

That's interesting, I didn't realize you couldn't share RDDs between contexts, 
I thought that was possible.  It does make it less interesting, but I believe 
there is still is some desirable reasons to allow it.  Having said that, like I 
mentioned before, it does seem like this is to complex of a solution with not 
enough benefit.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx

[jira] [Resolved] (SPARK-12796) initial prototype: projection/filter/range

2016-01-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12796.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10735
[https://github.com/apache/spark/pull/10735]

> initial prototype: projection/filter/range
> --
>
> Key: SPARK-12796
> URL: https://issues.apache.org/jira/browse/SPARK-12796
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-16 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-12859:

Attachment: spark-streaming-webui-names-too-long-to-fit.png

> Names of input streams with receivers don't fit in Streaming page
> -
>
> Key: SPARK-12859
> URL: https://issues.apache.org/jira/browse/SPARK-12859
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
> Attachments: spark-streaming-webui-names-too-long-to-fit.png
>
>
> Since the column for the names of input streams with receivers (under Input 
> Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
> attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-16 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12859:
---

 Summary: Names of input streams with receivers don't fit in 
Streaming page
 Key: SPARK-12859
 URL: https://issues.apache.org/jira/browse/SPARK-12859
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 2.0.0
Reporter: Jacek Laskowski


Since the column for the names of input streams with receivers (under Input 
Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103298#comment-15103298
 ] 

Sean Owen commented on SPARK-2243:
--

Yes, dynamic allocation operates in units of executors. You can make these 
smallish, but they are inevitably fairly coarse units.

You can have multiple contexts; they just have to occupy their own JVMs. I 
think this is what much of the discussion comes down to. The intended usage is 
to manage resources via a cluster resource managers and spark-submit 
applications to that cluster. Multiple contexts in one JVM is miniature 
duplicate of this system and just not really the intended usage.

A long-running context to avoid the overhead of starting a context is an 
orthogonal issue. This by itself is unrelated to whether several of those 
contexts should be in one JVM.

I misspoke above; multiple contexts in one JVM doesn't let you share RDDs. It 
might let you share some singletons and static state. This makes it less 
interesting IMHO. Sharing RDDs isn't enabled by having multiple contexts in one 
JVM.

Yes, on the positive side, separating contexts for logically separate 
applications gives you isolation, which is probably a good thing.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManag

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103289#comment-15103289
 ] 

Jason Hubbard commented on SPARK-2243:
--

I think what Richard is referring to and what we have seen as well, is it would 
be nice to have multiple contexts to change settings like executor memory.  
Allowing different executor memory would allow better utilization of the 
cluster as some tasks require more memory while other tasks require less.  I 
haven't found that dynamic execution allows the executor memory to be 
different.  We like having the ability to share RDDs but also embedding the 
driver in our application to easily pass information between the application 
and spark.  The second part is solvable with serializing the information and 
having a separate process, but this increases the complexity a bit.  The 
overhead of starting multiple JVMs and spark  jobs is also a bit of a concern 
since we are running on YARN and allocating resources takes a bit of time, but 
it's typically marginal.

Either way, this has been open for quite some time and it seems like the 
complexity of the change is more than the benefit of having multiple contexts 
in one JVM, especially when we start looking at the ability to store RDDs off 
heap although that has some costs and complexity associated with it as well.

Is anyone familiar enough with spark job server?  It used to have the ability 
to run multiple spark contexts in one JVM but at one point someone was saying 
it was broken.  From the documentation it does look like that abandoned the 
idea mentioning separate JVM per context for isolation.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spa

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103272#comment-15103272
 ] 

Sean Owen commented on SPARK-2243:
--

You say you're concerned with over-utilizing a cluster for steps that don't 
require much resource. This is what dynamic allocation is for: the number of 
executors increases and decreases with load. If one context is already using 
all cluster resources, yes, that doesn't do anything. But then, neither does a 
second context; the cluster is already fully used.

I don't know what overhead you're referring to, but certainly one context 
running N jobs is busier than N contexts running N jobs. Its overhead is 
higher, but the total overhead is lower. This is more an effect than a cause 
that would make you choose one architecture over another.

Generally, Spark has always assumed one context per JVM and I don't see that 
changing, which is why I finally closed this. I don't see any support for 
making this happen.

There's no reason you can't have multiple contexts, and in fact, you should 
have one context per application. The question is whether several should live 
in one JVM. The reasons I can see for that are: a little less overhead from N 
JVMs versus 1, and more importantly, being able to share RDDs and such across 
distinct jobs. But, your use case sounds like many unrelated jobs. It sounds 
like you simply want to run many JVMs to run your many contexts. Yes you pay 
some resource penalty, but on the upside you get better isolation.


> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Richard Marscher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103268#comment-15103268
 ] 

Richard Marscher commented on SPARK-2243:
-

I fail to see how dynamic allocation would help, can you clarify?

We already are constantly using 100% of the cluster resources and have a fixed 
# of JVM driver hosts. If a given context has 32 cores available across 
executors and is constantly processing jobs with stages with 32+ tasks, it will 
always be busy so I don't see why it would scale down with dynamic allocation. 
Meanwhile, since we have to share this context it will mix in jobs/stages/tasks 
for the separate DAG I mentioned. 

Another use case. After observation of months in production, there seems to be 
overhead and cost to sharing a SparkContext between jobs as opposed to running 
the same number of jobs fanned out across different contexts started on 
separate JVMs. And yes this includes trying out different scheduler and pool 
settins (fair vs fifo). If this weren't the case, we could just run 1 big spark 
context on 1 JVM and share it for all our jobs. Since it's not the case we need 
to have X many separate JVMs solely because each one can only have a single 
SparkContext.

Anyway, I don't mind this issue being closed as Won't Fix, but if feels like 
the entire comment chain is dancing around the underlying reason. Use cases are 
valid it just seems like the conclusion is they aren't critical enough in 
comparison to the changes to the Spark code to support them. That's fine, but 
can we just admit that?

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
> 

[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Richard Marscher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103268#comment-15103268
 ] 

Richard Marscher edited comment on SPARK-2243 at 1/16/16 5:25 PM:
--

I fail to see how dynamic allocation would help, can you clarify?

We already are constantly using 100% of the cluster resources and have a fixed 
# of JVM driver hosts. If a given context has 32 cores available across 
executors and is constantly processing jobs with stages with 32+ tasks, it will 
always be busy so I don't see why it would scale down with dynamic allocation. 
Meanwhile, since we have to share this context it will mix in jobs/stages/tasks 
for the separate DAG I mentioned. 

Another use case. After observation of months in production, there seems to be 
overhead and cost to sharing a SparkContext between jobs as opposed to running 
the same number of jobs fanned out across different contexts started on 
separate JVMs. And yes this includes trying out different scheduler and pool 
settings (fair vs fifo). If this weren't the case, we could just run 1 big 
spark context on 1 JVM and share it for all our jobs. Since it's not the case 
we need to have X many separate JVMs solely because each one can only have a 
single SparkContext.

Anyway, I don't mind this issue being closed as Won't Fix, but if feels like 
the entire comment chain is dancing around the underlying reason. Use cases are 
valid it just seems like the conclusion is they aren't critical enough in 
comparison to the changes to the Spark code to support them. That's fine, but 
can we just admit that?


was (Author: rmarscher):
I fail to see how dynamic allocation would help, can you clarify?

We already are constantly using 100% of the cluster resources and have a fixed 
# of JVM driver hosts. If a given context has 32 cores available across 
executors and is constantly processing jobs with stages with 32+ tasks, it will 
always be busy so I don't see why it would scale down with dynamic allocation. 
Meanwhile, since we have to share this context it will mix in jobs/stages/tasks 
for the separate DAG I mentioned. 

Another use case. After observation of months in production, there seems to be 
overhead and cost to sharing a SparkContext between jobs as opposed to running 
the same number of jobs fanned out across different contexts started on 
separate JVMs. And yes this includes trying out different scheduler and pool 
settins (fair vs fifo). If this weren't the case, we could just run 1 big spark 
context on 1 JVM and share it for all our jobs. Since it's not the case we need 
to have X many separate JVMs solely because each one can only have a single 
SparkContext.

Anyway, I don't mind this issue being closed as Won't Fix, but if feels like 
the entire comment chain is dancing around the underlying reason. Use cases are 
valid it just seems like the conclusion is they aren't critical enough in 
comparison to the changes to the Spark code to support them. That's fine, but 
can we just admit that?

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInput

[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized

2016-01-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103259#comment-15103259
 ] 

Apache Spark commented on SPARK-7780:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10788

> The intercept in LogisticRegressionWithLBFGS should not be regularized
> --
>
> Key: SPARK-7780
> URL: https://issues.apache.org/jira/browse/SPARK-7780
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: DB Tsai
>
> The intercept in Logistic Regression represents a prior on categories which 
> should not be regularized. In MLlib, the regularization is handled through 
> `Updater`, and the `Updater` penalizes all the components without excluding 
> the intercept which resulting poor training accuracy with regularization.
> The new implementation in ML framework handles this properly, and we should 
> call the implementation in ML from MLlib since majority of users are still 
> using MLlib api. 
> Note that both of them are doing feature scalings to improve the convergence, 
> and the only difference is ML version doesn't regularize the intercept. As a 
> result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12858) Remove duplicated code in metrics

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12858:


Assignee: Apache Spark

> Remove duplicated code in metrics
> -
>
> Key: SPARK-12858
> URL: https://issues.apache.org/jira/browse/SPARK-12858
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Benjamin Fradet
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed there is some duplicated code in the sinks regarding the poll 
> period.
> Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12858) Remove duplicated code in metrics

2016-01-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103225#comment-15103225
 ] 

Apache Spark commented on SPARK-12858:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/10787

> Remove duplicated code in metrics
> -
>
> Key: SPARK-12858
> URL: https://issues.apache.org/jira/browse/SPARK-12858
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Benjamin Fradet
>Priority: Minor
>
> I noticed there is some duplicated code in the sinks regarding the poll 
> period.
> Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12858) Remove duplicated code in metrics

2016-01-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12858:


Assignee: (was: Apache Spark)

> Remove duplicated code in metrics
> -
>
> Key: SPARK-12858
> URL: https://issues.apache.org/jira/browse/SPARK-12858
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Benjamin Fradet
>Priority: Minor
>
> I noticed there is some duplicated code in the sinks regarding the poll 
> period.
> Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12858) Remove duplicated code in metrics

2016-01-16 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12858:
---

 Summary: Remove duplicated code in metrics
 Key: SPARK-12858
 URL: https://issues.apache.org/jira/browse/SPARK-12858
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Benjamin Fradet
Priority: Minor


I noticed there is some duplicated code in the sinks regarding the poll period.
Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2698) RDD pages shows negative bytes remaining for some executors

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2698.
--
Resolution: Not A Problem

We fixed a number of negative values in the UI in about 1.5; reopen if this is 
still an issue

> RDD pages shows negative bytes remaining for some executors
> ---
>
> Key: SPARK-2698
> URL: https://issues.apache.org/jira/browse/SPARK-2698
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Hossein Falaki
>Priority: Minor
> Attachments: spark ui.png
>
>
> The RDD page shows negative bytes remaining for some executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2401) AdaBoost.MH, a multi-class multi-label classifier

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2401.
--
Resolution: Duplicate

> AdaBoost.MH, a multi-class multi-label classifier
> -
>
> Key: SPARK-2401
> URL: https://issues.apache.org/jira/browse/SPARK-2401
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Gang Bai
>Priority: Trivial
>
> Multi-class multi-label classifiers are very useful in web page profiling, 
> audience segmentation etc. The goal of a multi-class multi-label classifier 
> is to tag a sample data point with a subset of labels from a finite, 
> pre-specified set, e.g. tagging a visitor with a set of interests. Given a 
> set of L labels, a data point can be tagged with one of the 2^L possible 
> subsets. The main challenges in training a multi-class multi-label classifier 
> are the exponentially large label space. 
> This JIRA is created to track the effort of solving the training problem of 
> multi-class, multi-label classifiers by implementing AdaBoost.MH on Apache 
> Spark. It will not be an easy task. I will start from a basic DecisionStump 
> weak learner and a simple Hamming tree resembling DecisionStumps into a meta 
> weak learner, and the iterative boosting procedure. I will be reusing modules 
> of Alexander Ulanov's multi-class and multi-label metrics evaluation and 
> Manish Amde's decision tree/boosting/ensemble implementations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2389.
--
Resolution: Won't Fix

> globally shared SparkContext / shared Spark "application"
> -
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2335.
--
Resolution: Duplicate

Sounds like this is subsumed by discussion of the approximate version

> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2262) Extreme Learning Machines (ELM) for MLLib

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2262.
--
Resolution: Won't Fix

> Extreme Learning Machines (ELM) for MLLib
> -
>
> Key: SPARK-2262
> URL: https://issues.apache.org/jira/browse/SPARK-2262
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>  Labels: features
>
> MLLib has a gap in the NN space.   There's some good reason for this, as 
> batching gradient updates in traditional backprop training is known to not 
> perform well.
> However, Extreme Learning Machines(ELM)  combine support for nonlinear 
> activation functions in a hidden layer with a batch-friendly linear training. 
>  There is also a body of ELM literature on various avenues for extension, 
> including multi-category classification, multiple hidden layers and adaptive 
> addition/deletion of hidden nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2243.
--
Resolution: Won't Fix

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInput

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103187#comment-15103187
 ] 

Sean Owen commented on SPARK-2243:
--

I don't think this itself is a good argument for lots of contexts. Dynamic 
allocation fits better.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream

[jira] [Resolved] (SPARK-2095) sc.getExecutorCPUCounts()

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2095.
--
Resolution: Won't Fix

> sc.getExecutorCPUCounts()
> -
>
> Key: SPARK-2095
> URL: https://issues.apache.org/jira/browse/SPARK-2095
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> We can get the amount of total and free memory (via getExecutorMemoryStatus) 
> and blocks stored (via getExecutorStorageStatus) on the executors. I would 
> also like to be able to query the available CPU per executor. This would be 
> useful in dynamically deciding the number of partitions at the start of an 
> operation. What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2008) Enhance spark-ec2 to be able to add and remove slaves to an existing cluster

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2008.
--
Resolution: Won't Fix

> Enhance spark-ec2 to be able to add and remove slaves to an existing cluster
> 
>
> Key: SPARK-2008
> URL: https://issues.apache.org/jira/browse/SPARK-2008
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Per [the discussion 
> here|http://apache-spark-user-list.1001560.n3.nabble.com/Having-spark-ec2-join-new-slaves-to-existing-cluster-td3783.html]:
> {quote}
> I would like to be able to use spark-ec2 to launch new slaves and add them to 
> an existing, running cluster. Similarly, I would also like to remove slaves 
> from an existing cluster.
> Use cases include:
> * Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves.
> * During scheduled batch processing, I want to add some new slaves, perhaps 
> on spot instances. When that processing is done, I want to kill them. (Cruel, 
> I know.)
> I gather this is not possible at the moment. spark-ec2 appears to be able to 
> launch new slaves for an existing cluster only if the master is stopped. I 
> also do not see any ability to remove slaves from a cluster.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1910) Add onBlockComplete API to receiver

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103185#comment-15103185
 ] 

Sean Owen commented on SPARK-1910:
--

[~hshreedharan] is this still active, do you think?

> Add onBlockComplete API to receiver
> ---
>
> Key: SPARK-1910
> URL: https://issues.apache.org/jira/browse/SPARK-1910
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Hari Shreedharan
>
> This can allow the receiver to ACK all data that has already been 
> successfully stored by the block generator. This means the receiver's store 
> methods must now receive the block Id, so the receiver can recognize which 
> events are the ones that have been stored



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1972) Add support for setting and visualizing custom task-related metrics

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1972.
--
Resolution: Won't Fix

> Add support for setting and visualizing custom task-related metrics
> ---
>
> Key: SPARK-1972
> URL: https://issues.apache.org/jira/browse/SPARK-1972
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Kalpit Shah
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Various RDDs may want to set/track custom metrics for improved monitoring and 
> performance tuning. For e.g.:
> 1. A Task involving a JdbcRDD may want to track some metric related to JDBC 
> execution.
> 2. A Task involving a user-defined RDD may want to track some metric specific 
> to user's application.
> We currently use TaskMetrics for tracking task-related metrics, which 
> provides no way of tracking custom task-related metrics. It is not good to 
> introduce a new field in TaskMetric everytime we want to track a custom 
> metric. That approach would be cumbersome and ugly. Besides, some of these 
> custom metrics may only make sense for a specific RDD-subclass. Therefore, we 
> need TaskMetrics to provide a generic way to allow RDD-subclasses to track 
> custom metrics when computing partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1747) check for Spark on Yarn ApplicationMaster split brain

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103184#comment-15103184
 ] 

Sean Owen commented on SPARK-1747:
--

[~tgraves] is there a chance this is going to result in a change?

> check for Spark on Yarn ApplicationMaster split brain
> -
>
> Key: SPARK-1747
> URL: https://issues.apache.org/jira/browse/SPARK-1747
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> On yarn there is a possibility that applications can end up with an issue 
> referred to as "split brain".  This problem is that you have one Application 
> Master running, something happens like a network split that the AM can no 
> longer talk to the ResourceManager. After some time the ResourceManager will 
> start a new application attempt assuming the old one failed and you end up 
> with 2 application masters.  Note the network split could prevent it from 
> talking to the RM but it could still be running along contacting regular 
> executors. 
> If the previous AM does not need any more resources from the RM it could try 
> to commit. This could cause lots of problems where the second AM finishes and 
> tries to commit too. This could potentially result in data corruption.
> I believe this same issue can happen on Spark since its using the hadoop 
> output formats.  One instance that has this issue is the FileOutputCommitter. 
>  It first writes to a temporary directory (task commit) and then  moves the 
> file to the final directory (job commit).  The first AM could finish the job 
> commit, tell the user its done, the user starts another down stream job, but 
> then the second AM comes in to do the job commit and files the down stream 
> job are processing could disappear until the second AM finishes the job 
> commit. 
> This was fixed in MR by https://issues.apache.org/jira/browse/MAPREDUCE-4832



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1270) An optimized gradient descent implementation

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1270.
--
Resolution: Won't Fix

> An optimized gradient descent implementation
> 
>
> Key: SPARK-1270
> URL: https://issues.apache.org/jira/browse/SPARK-1270
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Xusen Yin
>  Labels: GradientDescent, MLLib,
>
> Current implementation of GradientDescent is inefficient in some aspects, 
> especially in high-latency network. I propose a new implementation of 
> GradientDescent, which follows a parallelism model called 
> GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
> Xing's SSP. With a few modifications of runMiniBatchSGD, the 
> GradientDescentWithLocalUpdate can outperform the original sequential version 
> by about 4x without sacrificing accuracy, and can be easily adopted by most 
> classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1200) Make it possible to use unmanaged AM in yarn-client mode

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1200.
--
Resolution: Won't Fix

I think Sandy (Ryza) reported this too but I doubt this is still on the radar

> Make it possible to use unmanaged AM in yarn-client mode
> 
>
> Key: SPARK-1200
> URL: https://issues.apache.org/jira/browse/SPARK-1200
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 0.9.0
>Reporter: Sandy Pérez González
>Assignee: Sandy Ryza
>
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1153.
--
Resolution: Won't Fix

No activity in 2 years

> Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
> --
>
> Key: SPARK-1153
> URL: https://issues.apache.org/jira/browse/SPARK-1153
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 0.9.0
>Reporter: Deepak Nulu
>
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys. Others might have a different 
> type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
> will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-967.
-
Resolution: Not A Problem

I think this is obsolete, or no longer a problem; these scripts always respond 
to the local SPARK_HOME now.

> start-slaves.sh uses local path from master on remote slave nodes
> -
>
> Key: SPARK-967
> URL: https://issues.apache.org/jira/browse/SPARK-967
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 0.8.0, 0.8.1, 0.9.0
>Reporter: Evgeniy Tsvigun
>Priority: Trivial
>  Labels: script, starter
>
> If a slave node has home path other than master, start-slave.sh fails to 
> start a worker instance, for other nodes behaves as expected, in my case: 
> $ ./bin/start-slaves.sh 
> node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No 
> such file or directory
> node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 4796. Stop it first.
> node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 61348. Stop it first.
> I don't mention /usr/home anywhere, the only environment variable I set is 
> $SPARK_HOME, relative to $HOME on every node, which makes me think some 
> script takes `pwd` on master and tries to use it on slaves. 
> Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0
> OS:  FreeBSD 8.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1018) take and collect don't work on HadoopRDD

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1018.
--
Resolution: Not A Problem

> take and collect don't work on HadoopRDD
> 
>
> Key: SPARK-1018
> URL: https://issues.apache.org/jira/browse/SPARK-1018
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1
>Reporter: Diana Carroll
>  Labels: hadoop
>
> I am reading a simple text file using hadoopFile as follows:
> var hrdd1 = 
> sc.hadoopFile("/home/training/testdata.txt",classOf[TextInputFormat], 
> classOf[LongWritable], classOf[Text])
> Testing using this simple text file:
> 001 this is line 1
> 002 this is line two
> 003 yet another line
> the data read is correct, as I can tell using println 
> scala> hrdd1.foreach(println):
> (0,001 this is line 1)
> (19,002 this is line two)
> (40,003 yet another line)
> But neither collect nor take work properly.  Take prints out the key (byte 
> offset) of the last (non-existent) line repeatedly:
> scala> hrdd1.take(4):
> res146: Array[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] 
> = Array((61,), (61,), (61,))
> Collect is even worse: it complains:
> java.io.NotSerializableException: org.apache.hadoop.io.LongWritable at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
> The problem appears to be the LongWritable in both cases, because if I map to 
> a new RDD, converting the values from Text objects to strings, it works:
> scala> hrdd1.map(pair => (pair._1.toString,pair._2.toString)).take(4)
> res148: Array[(java.lang.String, java.lang.String)] = Array((0,001 this is 
> line 1), (19,002 this is line two), (40,003 yet another line))
> Seems to me either rdd.collect and rdd.take ought to handle non-serializable 
> types gracefully, or hadoopFile should return a mapped RDD that converts the 
> hadoop types into the appropriate serializable Java objects.  (Or at very 
> least the docs for the API should indicate that the usual RDD methods don't 
> work on HadoopRDDs).
> BTW, this behavior is the same for both the old and new API versions of 
> hadoopFile.  It also is the same whether the file is from HDFS or a plain old 
> text file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-593) Send last few lines of failed standalone mode or Mesos task to master

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-593.
-
Resolution: Won't Fix

> Send last few lines of failed standalone mode or Mesos task to master
> -
>
> Key: SPARK-593
> URL: https://issues.apache.org/jira/browse/SPARK-593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.7.0
>Reporter: Denny Britz
>Assignee: Denny Britz
>Priority: Minor
>
> Send last few lines of failed standalone mode or Mesos task to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-925) Allow ec2 scripts to load default options from a json file

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-925.
-
Resolution: Won't Fix

> Allow ec2 scripts to load default options from a json file
> --
>
> Key: SPARK-925
> URL: https://issues.apache.org/jira/browse/SPARK-925
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.8.0
>Reporter: Shay Seng
>Priority: Minor
>
> The option list for ec2 script can be a little irritating to type in, 
> especially things like path to identity-file, region , zone, ami etc.
> It would be nice if ec2 script looks for an options.json file in the 
> following order: (1) CWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py
> Something like:
> def get_defaults_from_options():
>   # Check to see if a options.json file exists, if so load it. 
>   # However, values in the options.json file can only overide values in opts
>   # if the Opt values are None or ""
>   # i.e. commandline options take presidence 
>   defaults = 
> {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 
> 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 
> 'ami':'','slaves':1, 'instance-type':'m1.large'}
>   # Look for options.json in directory cluster was called from
>   # Had to modify the spark_ec2 wrapper script since it mangles the pwd
>   startwd = os.environ['STARTWD']
>   if os.path.exists(os.path.join(startwd,"options.json")):
>   optionspath = os.path.join(startwd,"options.json")
>   else:
>   optionspath = os.path.join(os.getcwd(),"options.json")
>   
>   try:
> print "Loading options file: ", optionspath  
> with open (optionspath) as json_data:
> jdata = json.load(json_data)
> for k in jdata:
>   defaults[k]=jdata[k]
>   except IOError:
> print 'Warning: options.json file not loaded'
>   # Check permissions on identity-file, if defined, otherwise launch will 
> fail late and will be irritating
>   if defaults['identity-file']!='':
> st = os.stat(defaults['identity-file'])
> user_can_read = bool(st.st_mode & stat.S_IRUSR)
> grp_perms = bool(st.st_mode & stat.S_IRWXG)
> others_perm = bool(st.st_mode & stat.S_IRWXO)
> if (not user_can_read):
>   print "No read permission to read ", defaults['identify-file']
>   sys.exit(1)
> if (grp_perms or others_perm):
>   print "Permissions are too open, please chmod 600 file ", 
> defaults['identify-file']
>   sys.exit(1)
>   # if defaults contain AWS access id or private key, set it to environment. 
>   # required for use with boto to access the AWS console 
>   if defaults['aws-access-key-id'] != '':
> os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] 
>   if defaults['aws-secret-access-key'] != '':   
> os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key']
>   return defaults  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12857) Streaming tab in web UI uses records and events interchangeably

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12857:
--
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

It looks like "records" is used extensively throughout the streaming code, so 
this may be the best thing to standardize on. Indeed the event count comes from 
a field called "numRecords" for example. I'd look at the docs too to try to 
standardize all references in one go.

CC [~tdas] but not sure if he's around to look at this

> Streaming tab in web UI uses records and events interchangeably
> ---
>
> Key: SPARK-12857
> URL: https://issues.apache.org/jira/browse/SPARK-12857
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> In *Streaming* tab in web UI you can find (note _records_):
> {code}
> 708 completed batches, 6 records
> {code}
> However later in the Streaming tab I find no other uses of _records_ but 
> _events_, e.g. events/sec or in Input Size columns for Active and Completed 
> Batches. That can be confusing.
> But, in details of a batch, i.e. 
> http://localhost:4040/streaming/batch/?id=[id], you can find _records_ again, 
> i.e. "Input data size: 3 records".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9297) covar_pop and covar_samp aggregate functions

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9297:
-
Assignee: Liang-Chi Hsieh

> covar_pop and covar_samp aggregate functions
> 
>
> Key: SPARK-9297
> URL: https://issues.apache.org/jira/browse/SPARK-9297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12856) speed up hashCode of unsafe array

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12856:
--
Assignee: Wenchen Fan

> speed up hashCode of unsafe array
> -
>
> Key: SPARK-12856
> URL: https://issues.apache.org/jira/browse/SPARK-12856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7703) Task failure caused by block fetch failure in BlockManager.doGetRemote() when using TorrentBroadcast

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7703.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.6.0)

Please resolve as duplicate then

> Task failure caused by block fetch failure in BlockManager.doGetRemote() when 
> using TorrentBroadcast
> 
>
> Key: SPARK-7703
> URL: https://issues.apache.org/jira/browse/SPARK-7703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>
> I am from IBM Platform Symphony team and we are working to integration Spark 
> with our EGO to provide a fine-grained dynamic allocation Resource Manager. 
> We found a defect in current implementation of BlockManager.doGetRemote():
> {noformat}
>   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
> Option[Any] = {
> require(blockId != null, "BlockId is null")
> val locations = Random.shuffle(master.getLocations(blockId)) 
> <--- Issue2: locations may be out of date
> for (loc <- locations) {
>   logDebug(s"Getting remote block $blockId from $loc")
>   val data = blockTransferService.fetchBlockSync(
> loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() 
>  <--- Issue1: This statement is not in try/catch
>   if (data != null) {
> if (asBlockResult) {
>   return Some(new BlockResult(
> dataDeserialize(blockId, data),
> DataReadMethod.Network,
> data.limit()))
> } else {
>   return Some(data)
> }
>   }
>   logDebug(s"The value of block $blockId is null")
> }
> logDebug(s"Block $blockId not found")
> None
>   }
> {noformat}
> * Issue 1: Although the block fetch uses "for" to try all available 
> locations, the fetch method is not guarded by a "Try" block. When exception 
> occurs, this method will directly throw the error instead of trying other 
> block locations. The uncaught exception will cause task failure.
> * Issue 2: Constant "location" is acquired before fetching, however in a 
> dynamic allocation environment the block locations may change.
> We hit the above 2 issues in our use case, where Executors exit after all its 
> assigned tasks are done. We *occasionally* get the following error (issue 1.):
> {noformat}
> 15/05/13 10:28:35 INFO Executor: Running task 27.0 in stage 0.0 (TID 27)
> 15/05/13 10:28:35 DEBUG Executor: Task 26's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 28's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 27's epoch is 0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0 not registered locally
> 15/05/13 10:28:35 INFO TorrentBroadcast: Started reading broadcast variable 0
> 15/05/13 10:28:35 DEBUG TorrentBroadcast: Reading piece broadcast_0_piece0 of 
> broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0_piece0 not registered 
> locally
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(c390c311-bd97-4a99-bcb9-b32fd3dede17, sparkbj01, 37599)
> 15/05/13 10:28:35 TRACE NettyBlockTransferService: Fetch blocks from 
> sparkbj01:37599 (executor id c390c311-bd97-4a99-bcb9-b32fd3dede17)
> 15/05/13 10:28:35 DEBUG TransportClientFactory: Creating new connection to 
> sparkbj01/9.111.254.195:37599
> 15/05/13 10:28:35 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to sparkbj01/9.111.254.195:37599
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
>   at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
>   at 
> org.apache.spark.storag

[jira] [Reopened] (SPARK-7703) Task failure caused by block fetch failure in BlockManager.doGetRemote() when using TorrentBroadcast

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-7703:
--

> Task failure caused by block fetch failure in BlockManager.doGetRemote() when 
> using TorrentBroadcast
> 
>
> Key: SPARK-7703
> URL: https://issues.apache.org/jira/browse/SPARK-7703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
> Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Spark 1.3.1 Release
>Reporter: Hailong Wen
>
> I am from IBM Platform Symphony team and we are working to integration Spark 
> with our EGO to provide a fine-grained dynamic allocation Resource Manager. 
> We found a defect in current implementation of BlockManager.doGetRemote():
> {noformat}
>   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
> Option[Any] = {
> require(blockId != null, "BlockId is null")
> val locations = Random.shuffle(master.getLocations(blockId)) 
> <--- Issue2: locations may be out of date
> for (loc <- locations) {
>   logDebug(s"Getting remote block $blockId from $loc")
>   val data = blockTransferService.fetchBlockSync(
> loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() 
>  <--- Issue1: This statement is not in try/catch
>   if (data != null) {
> if (asBlockResult) {
>   return Some(new BlockResult(
> dataDeserialize(blockId, data),
> DataReadMethod.Network,
> data.limit()))
> } else {
>   return Some(data)
> }
>   }
>   logDebug(s"The value of block $blockId is null")
> }
> logDebug(s"Block $blockId not found")
> None
>   }
> {noformat}
> * Issue 1: Although the block fetch uses "for" to try all available 
> locations, the fetch method is not guarded by a "Try" block. When exception 
> occurs, this method will directly throw the error instead of trying other 
> block locations. The uncaught exception will cause task failure.
> * Issue 2: Constant "location" is acquired before fetching, however in a 
> dynamic allocation environment the block locations may change.
> We hit the above 2 issues in our use case, where Executors exit after all its 
> assigned tasks are done. We *occasionally* get the following error (issue 1.):
> {noformat}
> 15/05/13 10:28:35 INFO Executor: Running task 27.0 in stage 0.0 (TID 27)
> 15/05/13 10:28:35 DEBUG Executor: Task 26's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 28's epoch is 0
> 15/05/13 10:28:35 DEBUG Executor: Task 27's epoch is 0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0 not registered locally
> 15/05/13 10:28:35 INFO TorrentBroadcast: Started reading broadcast variable 0
> 15/05/13 10:28:35 DEBUG TorrentBroadcast: Reading piece broadcast_0_piece0 of 
> broadcast_0
> 15/05/13 10:28:35 DEBUG BlockManager: Getting local block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Block broadcast_0_piece0 not registered 
> locally
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> as bytes
> 15/05/13 10:28:35 DEBUG BlockManager: Getting remote block broadcast_0_piece0 
> from BlockManagerId(c390c311-bd97-4a99-bcb9-b32fd3dede17, sparkbj01, 37599)
> 15/05/13 10:28:35 TRACE NettyBlockTransferService: Fetch blocks from 
> sparkbj01:37599 (executor id c390c311-bd97-4a99-bcb9-b32fd3dede17)
> 15/05/13 10:28:35 DEBUG TransportClientFactory: Creating new connection to 
> sparkbj01/9.111.254.195:37599
> 15/05/13 10:28:35 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to sparkbj01/9.111.254.195:37599
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
>   at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:599)
>   at 
> org.apache.spark.

[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12846:
--
Target Version/s:   (was: 2.0.0)
 Component/s: Documentation

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-01-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103150#comment-15103150
 ] 

Sean Owen commented on SPARK-12846:
---

The title isn't descriptive [~jeffzhang] -- what docs? it's already linked to a 
JIRA, don't bother in the title

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12834:
--
Component/s: PySpark

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Xusen Yin
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12831) akka.remote.OversizedPayloadException on DirectTaskResult

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12831:
--
Target Version/s:   (was: 1.5.2)

> akka.remote.OversizedPayloadException on DirectTaskResult
> -
>
> Key: SPARK-12831
> URL: https://issues.apache.org/jira/browse/SPARK-12831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Brett Stime
>
> Getting the following error in my executor logs:
> ERROR akka.ErrorMonitor: Transient association error (association remains 
> live)
> akka.remote.OversizedPayloadException: Discarding oversized payload sent to 
> Actor[akka.tcp://sparkDriver@172.21.25.199:51562/user/CoarseGrainedScheduler#-2039547722]:
>  max allowed size 134217728 bytes, actual size of encoded class 
> org.apache.spark.rpc.akka.AkkaMessage was 134419636 bytes.
> Seems like the quick fix would be to make AkkaUtils.reservedSizeBytes a 
> little bigger--maybe proportional to spark.akka.frameSize and/or user 
> configurable.
> A more robust solution might be to catch OversizedPayloadException and retry 
> using the BlockManager.
> I should also mention that this has the effect of stalling the entire job (my 
> use case also requires fairly liberal timeouts). For now, I'll see if setting 
> spark.akka.frameSize a little smaller gives me more proportional overhead.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12831) akka.remote.OversizedPayloadException on DirectTaskResult

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12831:
--
Component/s: Spark Core

> akka.remote.OversizedPayloadException on DirectTaskResult
> -
>
> Key: SPARK-12831
> URL: https://issues.apache.org/jira/browse/SPARK-12831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Brett Stime
>
> Getting the following error in my executor logs:
> ERROR akka.ErrorMonitor: Transient association error (association remains 
> live)
> akka.remote.OversizedPayloadException: Discarding oversized payload sent to 
> Actor[akka.tcp://sparkDriver@172.21.25.199:51562/user/CoarseGrainedScheduler#-2039547722]:
>  max allowed size 134217728 bytes, actual size of encoded class 
> org.apache.spark.rpc.akka.AkkaMessage was 134419636 bytes.
> Seems like the quick fix would be to make AkkaUtils.reservedSizeBytes a 
> little bigger--maybe proportional to spark.akka.frameSize and/or user 
> configurable.
> A more robust solution might be to catch OversizedPayloadException and retry 
> using the BlockManager.
> I should also mention that this has the effect of stalling the entire job (my 
> use case also requires fairly liberal timeouts). For now, I'll see if setting 
> spark.akka.frameSize a little smaller gives me more proportional overhead.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12827) Configurable bind address for WebUI

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12827:
--
Component/s: Web UI
 Issue Type: Improvement  (was: Bug)

> Configurable bind address for WebUI
> ---
>
> Key: SPARK-12827
> URL: https://issues.apache.org/jira/browse/SPARK-12827
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Zee Chen
>Priority: Minor
>
> WebUI is currently hard coded to bind to all interfaces:
> {code}
> serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
> {code}
> make it configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12825) Spark-submit Jar URL loading fails on redirect

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12825:
--
Component/s: Spark Submit

> Spark-submit Jar URL loading fails on redirect
> --
>
> Key: SPARK-12825
> URL: https://issues.apache.org/jira/browse/SPARK-12825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Alex Nederlof
>Priority: Minor
>
> When you use spark-submit, and pass the jar as a URL, it fails when the URL 
> redirects. 
> The log prints: 
> {code}
> 16/01/14 14:26:43 INFO Utils: Fetching http://myUrl/my.jar to 
> /opt/spark/spark-1.6.0-bin-hadoop2.6/work/driver-20160114142642-0010/fetchFileTemp8495494631100918254.tmp
> {code}
> However, that file doesn't exist, but a file called "redirect" is created, 
> with the appropriate content. 
> After that, the driver fails with
> {code}
> 16/01/14 14:26:43 WARN Worker: Driver driver-20160114142642-0010 failed with 
> unrecoverable exception: java.lang.Exception: Did not see expected jar my.jar 
> in /opt/spark/spark-1.6.0-bin-hadoop2.6/work/driver-20160114142642-0010
> {code}
> Here's the related code:
> https://github.com/apache/spark/blob/56cdbd654d54bf07a063a03a5c34c4165818eeb2/core/src/main/scala/org/apache/spark/util/Utils.scala#L583-L603
> My Scala chops aren't up to this challenge, otherwise I would have made a 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11293:
--
Fix Version/s: (was: 1.6.0)

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11293:
--
Target Version/s:   (was: 1.6.0)

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12857) Streaming tab in web UI uses records and events interchangeably

2016-01-16 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12857:
---

 Summary: Streaming tab in web UI uses records and events 
interchangeably
 Key: SPARK-12857
 URL: https://issues.apache.org/jira/browse/SPARK-12857
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Minor


In *Streaming* tab in web UI you can find (note _records_):

{code}
708 completed batches, 6 records
{code}

However later in the Streaming tab I find no other uses of _records_ but 
_events_, e.g. events/sec or in Input Size columns for Active and Completed 
Batches. That can be confusing.

But, in details of a batch, i.e. 
http://localhost:4040/streaming/batch/?id=[id], you can find _records_ again, 
i.e. "Input data size: 3 records".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12722) Typo in Spark Pipeline example

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12722:
--
Assignee: Jeff Lam

> Typo in Spark Pipeline example
> --
>
> Key: SPARK-12722
> URL: https://issues.apache.org/jira/browse/SPARK-12722
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Tom Chan
>Assignee: Jeff Lam
>Priority: Trivial
>  Labels: starter
> Fix For: 1.6.1, 2.0.0
>
>
> There is a typo in the Pipeline example,
> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
> Namely, the line
> val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
> should be
> val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
> I was trying to do a PR but somehow there is error when I try to build the 
> documentation locally, so I hesitate to submit a PR. Someone who is already 
> contributing to documentation should be able to fix it in no time. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12722) Typo in Spark Pipeline example

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12722.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10769
[https://github.com/apache/spark/pull/10769]

> Typo in Spark Pipeline example
> --
>
> Key: SPARK-12722
> URL: https://issues.apache.org/jira/browse/SPARK-12722
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Tom Chan
>Priority: Trivial
>  Labels: starter
> Fix For: 2.0.0, 1.6.1
>
>
> There is a typo in the Pipeline example,
> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline
> Namely, the line
> val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
> should be
> val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
> I was trying to do a PR but somehow there is error when I try to build the 
> documentation locally, so I hesitate to submit a PR. Someone who is already 
> contributing to documentation should be able to fix it in no time. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12748) Failed to create HiveContext in SparkSql

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12748.
---
Resolution: Not A Problem

> Failed to create HiveContext in SparkSql
> 
>
> Key: SPARK-12748
> URL: https://issues.apache.org/jira/browse/SPARK-12748
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: Ujjal Satpathy
>Priority: Critical
>
> Hi,
> I am trying to create HiveContext using Java API in Spark Sql (ver 1.6.0).
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
> But its not creating any hivecontext and throwing below exception:
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@17a1ba8d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12748) Failed to create HiveContext in SparkSql

2016-01-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-12748:
---

> Failed to create HiveContext in SparkSql
> 
>
> Key: SPARK-12748
> URL: https://issues.apache.org/jira/browse/SPARK-12748
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: Ujjal Satpathy
>Priority: Critical
>
> Hi,
> I am trying to create HiveContext using Java API in Spark Sql (ver 1.6.0).
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
> But its not creating any hivecontext and throwing below exception:
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@17a1ba8d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12748) Failed to create HiveContext in SparkSql

2016-01-16 Thread Ujjal Satpathy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103092#comment-15103092
 ] 

Ujjal Satpathy commented on SPARK-12748:


Issue is resolved

> Failed to create HiveContext in SparkSql
> 
>
> Key: SPARK-12748
> URL: https://issues.apache.org/jira/browse/SPARK-12748
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: Ujjal Satpathy
>Priority: Critical
>
> Hi,
> I am trying to create HiveContext using Java API in Spark Sql (ver 1.6.0).
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
> But its not creating any hivecontext and throwing below exception:
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@17a1ba8d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12748) Failed to create HiveContext in SparkSql

2016-01-16 Thread Ujjal Satpathy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ujjal Satpathy closed SPARK-12748.
--
Resolution: Fixed

> Failed to create HiveContext in SparkSql
> 
>
> Key: SPARK-12748
> URL: https://issues.apache.org/jira/browse/SPARK-12748
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
>Reporter: Ujjal Satpathy
>Priority: Critical
>
> Hi,
> I am trying to create HiveContext using Java API in Spark Sql (ver 1.6.0).
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
> But its not creating any hivecontext and throwing below exception:
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@17a1ba8d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10735) CatalystTypeConverters MatchError converting RDD with custom object to dataframe

2016-01-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103076#comment-15103076
 ] 

Apache Spark commented on SPARK-10735:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10786

> CatalystTypeConverters MatchError converting RDD with custom object to 
> dataframe
> 
>
> Key: SPARK-10735
> URL: https://issues.apache.org/jira/browse/SPARK-10735
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Priority: Critical
>
> In spark 1.5.0 we are now seeing an exception when converting an RDD with 
> custom object to a dataframe.  Note this works with Spark 1.4.1.
> RDD
> where BasicData class has a field ArrayList where Beacon is a user 
> defined class now converting RDD to DataFrame is causing the issue:
> {code}
> 15/09/21 18:53:16 ERROR executor.Executor: Managed memory leak detected; size 
> = 2097152 bytes, TID = 408
> 15/09/21 18:53:16 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 
> (TID 408)
> scala.MatchError: foo.Beacon@5c289b39 (of class foo.Beacon)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
>   
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:245)
>   
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:164)
>
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:148)
>
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(SQLContext.scala:494)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(SQLContext.scala:494)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(SQLContext.scala:494)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(SQLContext.scala:492)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:372)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
>

[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables

2016-01-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12845:
---
Description: 
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.

During inner join there is no problem with NULL values.
On LEFT / RIGHT join predicates could be pushed down only if there are on 
correct side. 
On FULL OUTER join we should deal with null values.

  was:
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.

During inner join there is no problem with NULL values.
On LEFT / RIGHT join predicates could be pushed down only if there are on right 
side. 
On OUTER join we should deal with null values.


> During join Spark should pushdown predicates on joining column to both tables
> -
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.
> During inner join there is no problem with NULL values.
> On LEFT / RIGHT join predicates could be pushed down only if there are on 
> correct side. 
> On FULL OUTER join we should deal with null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables

2016-01-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12845:
---
Description: 
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.

During inner join there is no problem with NULL values.
On LEFT / RIGHT join predicates could be pushed down only if there are on right 
side. 
On OUTER join we should deal with null values.

  was:
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.


> During join Spark should pushdown predicates on joining column to both tables
> -
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.
> During inner join there is no problem with NULL values.
> On LEFT / RIGHT join predicates could be pushed down only if there are on 
> right side. 
> On OUTER join we should deal with null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables

2016-01-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12845:
---
Summary: During join Spark should pushdown predicates on joining column to 
both tables  (was: During join Spark should pushdown predicates to both tables)

> During join Spark should pushdown predicates on joining column to both tables
> -
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12856) speed up hashCode of unsafe array

2016-01-16 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12856.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10784
[https://github.com/apache/spark/pull/10784]

> speed up hashCode of unsafe array
> -
>
> Key: SPARK-12856
> URL: https://issues.apache.org/jira/browse/SPARK-12856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-01-16 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103065#comment-15103065
 ] 

Mridul Muralidharan commented on SPARK-6166:


We actually dont care about number of sockets or connections - but number of 
block requests which is going out.
That is what we want to put a limit to - think of it as a corresponding limit 
to number of active requests as we have for number of outstanding bytes in 
flight.

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Assignee: Shixiong Zhu
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org