[jira] [Commented] (SPARK-4475) PySpark failed to initialize if localhost can not be resolved

2015-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332161#comment-14332161
 ] 

Sean Owen commented on SPARK-4475:
--

When does localhost fail to resolve? see the discussion on the PR. It is not 
clear that 127.0.0.1 is better, since even Python SocketServer examples don't 
use it.

 PySpark failed to initialize if localhost can not be resolved
 -

 Key: SPARK-4475
 URL: https://issues.apache.org/jira/browse/SPARK-4475
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Davies Liu

 {code}
 Traceback (most recent call last):
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, line 44, 
 in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 
 107, in __init__
 conf)
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 
 159, in _do_init
 self._accumulatorServer = accumulators._start_update_server()
   File /home/hduser/Downloads/spark-1.1.0/python/pyspark/accumulators.py, 
 line 251, in _start_update_server
 server = AccumulatorServer((localhost, 0), _UpdateRequestHandler)
   File /usr/lib/python2.7/SocketServer.py, line 408, in __init__
 self.server_bind()
   File /usr/lib/python2.7/SocketServer.py, line 419, in server_bind
 self.socket.bind(self.server_address)
   File /usr/lib/python2.7/socket.py, line 224, in meth
 return getattr(self._sock,name)(*args)
 socket.gaierror: [Errno -5] No address associated with hostname
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5710) Combines two adjacent `Cast` expressions into one

2015-02-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5710:
-
 Priority: Minor  (was: Major)
Affects Version/s: (was: 1.3.0)
   1.2.1

The original suggested PR couldn't be merged, but I wonder if another is going 
to be proposed for the narrower change, to merge identical neighboring casts?

 Combines two adjacent `Cast` expressions into one
 -

 Key: SPARK-5710
 URL: https://issues.apache.org/jira/browse/SPARK-5710
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: guowei
Priority: Minor

 A plan after `analyzer` with `typeCoercionRules` may produce many `cast` 
 expressions. we can combine the adjacent ones.
 For example. 
 create table test(a decimal(3,1));
 explain select * from test where a*2-11;
 == Physical Plan ==
 Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), 
 DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType())  
 1)
  HiveTableScan [a#5], (MetastoreRelation default, test, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out

2015-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332249#comment-14332249
 ] 

Apache Spark commented on SPARK-5751:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4720

 Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes 
 times out
 --

 Key: SPARK-5751
 URL: https://issues.apache.org/jira/browse/SPARK-5751
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical
  Labels: flaky-test

 The Test JDBC query execution test case times out occasionally, all other 
 test cases are just fine. The failure output only contains service startup 
 command line without any log output. Guess somehow the test case misses the 
 log file path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects

2015-02-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4808:
-
Labels: backport-needed  (was: )

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler
  Labels: backport-needed

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects

2015-02-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4808:
-
Target Version/s: 1.2.2, 1.4.0, 1.3.1  (was: 1.3.0, 1.2.2, 1.4.0)

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler
  Labels: backport-needed

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects

2015-02-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4808:
-
Target Version/s: 1.3.0, 1.2.2, 1.4.0  (was: 1.3.0, 1.4.0)

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler
  Labels: backport-needed

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4958) Bake common tools like ganglia into Spark AMI

2015-02-22 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-4958.
-
   Resolution: Duplicate
Fix Version/s: (was: 1.3.0)

Closing this as a duplicate of SPARK-3821 since we're covering the addition of 
stuff like Ganglia to the AMIs in that issue.

 Bake common tools like ganglia into Spark AMI
 -

 Key: SPARK-4958
 URL: https://issues.apache.org/jira/browse/SPARK-4958
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming

2015-02-22 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5943:
--

 Summary: Update the API to remove several warns in building for 
Spark Streaming
 Key: SPARK-5943
 URL: https://issues.apache.org/jira/browse/SPARK-5943
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Saisai Shao
Priority: Minor


old {{awaitTermination(timeout: Long)}} is deprecated and updated to 
{{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here 
change the related code to reduce warns about this while compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects

2015-02-22 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332266#comment-14332266
 ] 

Andrew Or commented on SPARK-4808:
--

[~sowen] not fully. The PR you linked to is just a temporary fix. The real fix 
probably won't make it into 1.3

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332303#comment-14332303
 ] 

Nicholas Chammas commented on SPARK-3821:
-

For those wanting to use the work being done as part of this issue before it 
gets merged upstream, I posted some [instructions on Stack 
Overflow|http://stackoverflow.com/a/28639669/877069] in response to a related 
question.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is on

2015-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332074#comment-14332074
 ] 

Apache Spark commented on SPARK-5942:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4719

 DataFrame should not do query optimization when dataFrameEagerAnalysis is on
 

 Key: SPARK-5942
 URL: https://issues.apache.org/jira/browse/SPARK-5942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 DataFrame will force query optimization to happen right away for the commands 
 and queries with side effects.
 However, I think we should not do that when dataFrameEagerAnalysis is on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off

2015-02-22 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-5942:
---
Summary: DataFrame should not do query optimization when 
dataFrameEagerAnalysis is off  (was: DataFrame should not do query optimization 
when dataFrameEagerAnalysis is on)

 DataFrame should not do query optimization when dataFrameEagerAnalysis is off
 -

 Key: SPARK-5942
 URL: https://issues.apache.org/jira/browse/SPARK-5942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 DataFrame will force query optimization to happen right away for the commands 
 and queries with side effects.
 However, I think we should not do that when dataFrameEagerAnalysis is on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS

2015-02-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5669.
--
  Resolution: Won't Fix
   Fix Version/s: (was: 1.3.0)
Target Version/s:   (was: 1.3.0, 1.1.2, 1.2.2)

I think this is basically 'WontFix' since the net result is that this change 
only went into master, on the grounds that there's an argument the licensing is 
OK. However this is made moot anyway by SPARK-5814.

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker

 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off

2015-02-22 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-5942:
---
Description: 
DataFrame will force query optimization to happen right away for the commands 
and queries with side effects.

However, I think we should not do that when dataFrameEagerAnalysis is off.

  was:
DataFrame will force query optimization to happen right away for the commands 
and queries with side effects.

However, I think we should not do that when dataFrameEagerAnalysis is on.


 DataFrame should not do query optimization when dataFrameEagerAnalysis is off
 -

 Key: SPARK-5942
 URL: https://issues.apache.org/jira/browse/SPARK-5942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 DataFrame will force query optimization to happen right away for the commands 
 and queries with side effects.
 However, I think we should not do that when dataFrameEagerAnalysis is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is on

2015-02-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5942:
--

 Summary: DataFrame should not do query optimization when 
dataFrameEagerAnalysis is on
 Key: SPARK-5942
 URL: https://issues.apache.org/jira/browse/SPARK-5942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


DataFrame will force query optimization to happen right away for the commands 
and queries with side effects.

However, I think we should not do that when dataFrameEagerAnalysis is on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects

2015-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332079#comment-14332079
 ] 

Sean Owen commented on SPARK-4808:
--

[~andrewor14] Is this resolved for 1.2.2 / 1.3.0?
https://github.com/apache/spark/pull/4420#issuecomment-75178396

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications

2015-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332085#comment-14332085
 ] 

Sean Owen commented on SPARK-4605:
--

It's only my opinion, but I believe this (and other proposed new packages) is 
the kind of thing that can exist outside core Spark and doesn't net-net benefit 
everyone if pushed in the core distribution. Doing so has the benefit of making 
it more readily available, but also of un-blessing similar efforts in favor of 
an official one. It makes an already exceptionally complex build even bigger. 
On the one hand you get synchronized releases; on the other hand, you have to 
synchronize releases. Of course you can make this argument about a lot of 
packages, perhaps even ones already in Spark.

 Proposed Contribution: Spark Kernel to enable interactive Spark applications
 

 Key: SPARK-4605
 URL: https://issues.apache.org/jira/browse/SPARK-4605
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Chip Senkbeil
 Attachments: Kernel Architecture Widescreen.pdf, Kernel 
 Architecture.pdf


 Project available on Github: https://github.com/ibm-et/spark-kernel
 
 This architecture is describing running kernel code that was demonstrated at 
 the StrataConf in Barcelona, Spain.
 
 Enables applications to interact with a Spark cluster using Scala in several 
 ways:
 * Defining and running core Spark Tasks
 * Collecting results from a cluster without needing to write to external data 
 store
 ** Ability to stream results using well-defined protocol
 * Arbitrary Scala code definition and execution (without submitting 
 heavy-weight jars)
 Applications can be hosted and managed separate from the Spark cluster using 
 the kernel as a proxy to communicate requests.
 The Spark Kernel implements the server side of the IPython Kernel protocol, 
 the rising “de-facto” protocol for language (Python, Haskell, etc.) execution.
 Inherits a suite of industry adopted clients such as the IPython Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming

2015-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332328#comment-14332328
 ] 

Apache Spark commented on SPARK-5943:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4722

 Update the API to remove several warns in building for Spark Streaming
 --

 Key: SPARK-5943
 URL: https://issues.apache.org/jira/browse/SPARK-5943
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Saisai Shao
Priority: Minor

 old {{awaitTermination(timeout: Long)}} is deprecated and updated to 
 {{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here 
 change the related code to reduce warns about this while compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332394#comment-14332394
 ] 

Nicholas Chammas commented on SPARK-5944:
-

cc [~davies], [~joshrosen]

 Python release docs say SNAPSHOT + Author is missing
 

 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor

 http://spark.apache.org/docs/latest/api/python/index.html
 As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
 1.2.1.
 Furthermore, in the footer it says Copyright 2014, Author. It should 
 probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-22 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-5944:
---

 Summary: Python release docs say SNAPSHOT + Author is missing
 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor


http://spark.apache.org/docs/latest/api/python/index.html

As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
1.2.1.

Furthermore, in the footer it says Copyright 2014, Author. It should probably 
say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5946) Add Python API for Kafka direct stream

2015-02-22 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5946:
--

 Summary: Add Python API for Kafka direct stream
 Key: SPARK-5946
 URL: https://issues.apache.org/jira/browse/SPARK-5946
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao


Add the Python API for Kafka direct stream. Currently only adds 
{{createDirectStream}} API, no {{createRDD}} API, since it needs some Python 
wraps of Java object, will improve this according to the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5946) Add Python API for Kafka direct stream

2015-02-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333018#comment-14333018
 ] 

Apache Spark commented on SPARK-5946:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4723

 Add Python API for Kafka direct stream
 --

 Key: SPARK-5946
 URL: https://issues.apache.org/jira/browse/SPARK-5946
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Add the Python API for Kafka direct stream. Currently only adds 
 {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python 
 wraps of Java object, will improve this according to the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's

2015-02-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-911:
-
Affects Version/s: (was: 1.0.0)

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
 Fix For: 1.4.0


 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-22 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332490#comment-14332490
 ] 

Imran Rashid commented on SPARK-5928:
-

so I'm pretty sure the issue is not just large records, its from large shuffle 
blocks.  I tried a slight variations on my original failing program, with the 
same shuffle volume but smaller individual records, and both fail.

{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ i =
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  (2 * i, arr)
}
rdd.partitionBy(new org.apache.spark.HashPartitioner(2)).count()
{code}

or
 
{code}
val rdd = sc.parallelize(1 to 2e6.toInt, 2).mapPartitionsWithIndex{ 
case(part,itr) =
  if(part == 0) {
itr.map{ ignore =
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
  } else {
Iterator()
  }
}
rdd.map{x = (scala.util.Random.nextInt % 1000 * 2, 
x)}.groupByKey(2).count()
{code}


However, the failure mode is slightly different, and I don't fully understand.  
It does always have a {{FetchFailedException:  Adjusted frame length exceeds 
2147483647}}.  But then during the retry, it decides to change from two tasks 
in the second stage, to just one task.  I don't understand why the number of 
tasks would change.  I understand that one of my executors might get removed, 
and so both tasks might get executed locally -- but I don't see why it would 
convert it into one task.

In any case, these programs should have smaller records, but still have a 2gb 
remote fetch shuffle block.  Note that I'm trying to play some games w/ getting 
data from one partition to another to make a remote fetch -- but I can't 
actually guarantee that.  Sometimes the programs succeed when they are local 
fetches.  So they don't exhibit the issue everytime, but they do fail 
regularly.  Sometimes they go on to succeed (I guess after one of the executors 
is removed completely, and so there are no remote fetches anymore), and 
sometimes they fail repeatedly.

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 

[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332438#comment-14332438
 ] 

Nicholas Chammas commented on SPARK-765:


Seems like a good idea. [~joshrosen] I assume this is still to be done, right?

 Test suite should run Spark example programs
 

 Key: SPARK-765
 URL: https://issues.apache.org/jira/browse/SPARK-765
 Project: Spark
  Issue Type: New Feature
  Components: Examples
Reporter: Josh Rosen

 The Spark test suite should also run each of the Spark example programs (the 
 PySpark suite should do the same).  This should be done through a shell 
 script or other mechanism to simulate the environment setup used by end users 
 that run those scripts.
 This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1001) Memory leak when reading sequence file and then sorting

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332444#comment-14332444
 ] 

Nicholas Chammas commented on SPARK-1001:
-

It's probably tough given how long ago this was reported, but can anyone 
confirm whether this is still an issue on the latest release (1.2.1)?

 Memory leak when reading sequence file and then sorting
 ---

 Key: SPARK-1001
 URL: https://issues.apache.org/jira/browse/SPARK-1001
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 0.8.0
Reporter: Matthew Cheah
  Labels: Hadoop, Memory

 Spark appears to build up a backlog of unreachable byte arrays when an RDD is 
 constructed from a sequence file, and then that RDD is sorted.
 I have a class that wraps a Java ArrayList, that can be serialized and 
 written to a Hadoop SequenceFile (I.e. Implements the Writable interface). 
 Let's call it WritableDataRow. It can take a Java List as its argument to 
 wrap around, and also has a copy constructor.
 Setup: 10 slaves, launched via EC2, 65.9GB RAM each, dataset is 100GB of 
 text, 120GB when in sequence file format (not using compression to compact 
 the bytes). CDH4.2.0-backed hadoop cluster.
 First, building the RDD from a CSV and then sorting on index 1 works fine:
 {code}
 scala import scala.collection.JavaConversions._ // Other imports here as well
 import scala.collection.JavaConversions._
 scala val rddAsTextFile = sc.textFile(s3n://some-bucket/events-*.csv)
 rddAsTextFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
 console:14
 scala val rddAsWritableDataRows = rddAsTextFile.map(x = new 
 WritableDataRow(x.split(\\|).toList))
 rddAsWritableDataRows: 
 org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
  = MappedRDD[2] at map at console:19
 scala val rddAsKeyedWritableDataRows = rddAsWritableDataRows.map(x = 
 (x.getContents().get(1).toString(), x));
 rddAsKeyedWritableDataRows: org.apache.spark.rdd.RDD[(String, 
 com.palantir.finance.datatable.server.spark.WritableDataRow)] = MappedRDD[4] 
 at map at console:22
 scala val orderedFunct = new 
 org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
 WritableDataRow)](rddAsKeyedWritableDataRows)
 orderedFunct: 
 org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 org.apache.spark.rdd.OrderedRDDFunctions@587acb54
 scala orderedFunct.sortByKey(true).count(); // Actually triggers the 
 computation, as stated in a different e-mail thread
 res0: org.apache.spark.rdd.RDD[(String, 
 com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 MapPartitionsRDD[8] at sortByKey at console:27
 {code}
 The above works without too many surprises. I then save it as a Sequence File 
 (using JavaPairRDD as a way to more easily call saveAsHadoopFile(), and this 
 is how it's done in our Java-based application):
 {code}
 scala val pairRDD = new JavaPairRDD(rddAsWritableDataRows.map(x = 
 (NullWritable.get(), x)));
 pairRDD: 
 org.apache.spark.api.java.JavaPairRDD[org.apache.hadoop.io.NullWritable,com.palantir.finance.datatable.server.spark.WritableDataRow]
  = org.apache.spark.api.java.JavaPairRDD@8d2e9d9
 scala pairRDD.saveAsHadoopFile(hdfs://hdfs-master-url:9010/blah, 
 classOf[NullWritable], classOf[WritableDataRow], 
 classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[NullWritable, 
 WritableDataRow]]);
 …
 2013-12-11 20:09:14,444 [main] INFO  org.apache.spark.SparkContext - Job 
 finished: saveAsHadoopFile at console:26, took 1052.116712748 s
 {code}
 And now I want to get the RDD from the sequence file and sort THAT, and this 
 is when I monitor Ganglia and ps aux and notice the memory usage climbing 
 ridiculously:
 {code}
 scala val rddAsSequenceFile = 
 sc.sequenceFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], 
 classOf[WritableDataRow]).map(x = new WritableDataRow(x._2)); // Invokes 
 copy constructor to get around re-use of writable objects
 rddAsSequenceFile: 
 org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
  = MappedRDD[19] at map at console:19
 scala val orderedFunct = new 
 org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
 WritableDataRow)](rddAsSequenceFile.map(x = 
 (x.getContents().get(1).toString(), x)))
 orderedFunct: 
 org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 org.apache.spark.rdd.OrderedRDDFunctions@6262a9a6
 scalaorderedFunct.sortByKey().count();
 {code}
 (On the necessity to copy writables 

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-02-22 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332480#comment-14332480
 ] 

Imran Rashid commented on SPARK-5928:
-

Thanks Kay, that does make sense.  I think that gave me enough info to at least 
summarize in another issue -- I just opened SPARK-5945, please add / correct 
anything I have written there.

Good question about the record size -- I'm pretty sure that is not the issue, 
but I am hitting some other strange behavior which I'm trying to understand, 
I'll update this issue soon with more info.

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 

[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-02-22 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5945:

Description: 
While investigating SPARK-5928, I noticed some very strange behavior in the way 
spark retries stages after a FetchFailedException.  It seems that on a 
FetchFailedException, instead of simply killing the task and retrying, Spark 
aborts the stage and retries.  If it just retried the task, the task might fail 
4 times and then trigger the usual job killing mechanism.  But by killing the 
stage instead, the max retry logic is skipped (it looks to me like there is no 
limit for retries on a stage).

After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
fetch fails, we assume that the block manager we are fetching from has failed, 
and that it will succeed if we retry the stage w/out that block manager.  In 
that case, it wouldn't make any sense to retry the task, since its doomed to 
fail every time, so we might as well kill the whole stage.  But this raises two 
questions:


1) Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one?  
SPARK-5928 shows that there are at least some cases where that assumption is 
wrong.  Even if we fix that case, this logic seems brittle to the next case we 
find.  I guess the idea is that this behavior is what gives us the R in RDD 
... but it seems like its not really that robust and maybe should be 
reconsidered.

2) Should stages only be retried a limited number of times?  It would be pretty 
easy to put in a limited number of retries per stage.  Though again, we 
encounter issues with keeping things resilient.  Theoretically one stage could 
have many retries, but due to failures in different stages further downstream, 
so we might need to track the cause of each retry as well to still have the 
desired behavior.

In general it just seems there is some flakiness in the retry logic.  This is 
the only reproducible example I have at the moment, but I vaguely recall 
hitting other cases of strange behavior w/ retries when trying to run long 
pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
fails, but the executor eventually comes back and the stage gets retried again, 
but the same GC issues happen the second time around, etc.

Copied from SPARK-5928, here's the example program that can regularly produce a 
loop of stage failures.  Note that it will only fail from a remote fetch, so it 
can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
--num-executors 2 --executor-memory 4000m}}

{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
rdd.map { x = (1, x)}.groupByKey().count()
{code}

  was:
While investigating SPARK-5928, I noticed some very strange behavior in the way 
spark retries stages after a FetchFailedException.  It seems that on a 
FetchFailedException, instead of simply killing the task and retrying, Spark 
aborts the stage and retries.  If it just retried the task, the task might fail 
4 times and then trigger the usual job killing mechanism.  But by killing the 
stage instead, the max retry logic is skipped (it looks to me like there is no 
limit for retries on a stage).

After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
fetch fails, we assume that the block manager we are fetching from has failed, 
and that it will succeed if we retry the stage w/out that block manager.  In 
that case, it wouldn't make any sense to retry the task, since its doomed to 
fail every time, so we might as well kill the whole stage.  But this raises two 
questions:


1) Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one?  
SPARK-5928 shows that there are at least some cases where that assumption is 
wrong.  Even if we fix that case, this logic seems brittle to the next case we 
find.  I guess the idea is that this behavior is what gives us the R in RDD 
... but it seems like its not really that robust and maybe should be 
reconsidered.

2) Should stages only be retried a limited number of times?  It would be pretty 
easy to put in a limited number of retries per stage.  Though again, we 
encounter issues with keeping things resilient.  Theoretically one stage could 
have many retries, but due to failures in different stages further downstream, 
so we might need to track the logic.

In general it just seems there is some flakiness in the retry logic.  This is 
the only reproducible example I have at the moment, but I vaguely recall 
hitting other cases of strange behavior w/ retries when trying to run 

[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-02-22 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5945:

Description: 
While investigating SPARK-5928, I noticed some very strange behavior in the way 
spark retries stages after a FetchFailedException.  It seems that on a 
FetchFailedException, instead of simply killing the task and retrying, Spark 
aborts the stage and retries.  If it just retried the task, the task might fail 
4 times and then trigger the usual job killing mechanism.  But by killing the 
stage instead, the max retry logic is skipped (it looks to me like there is no 
limit for retries on a stage).

After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
fetch fails, we assume that the block manager we are fetching from has failed, 
and that it will succeed if we retry the stage w/out that block manager.  In 
that case, it wouldn't make any sense to retry the task, since its doomed to 
fail every time, so we might as well kill the whole stage.  But this raises two 
questions:


1) Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one?  
SPARK-5928 shows that there are at least some cases where that assumption is 
wrong.  Even if we fix that case, this logic seems brittle to the next case we 
find.  I guess the idea is that this behavior is what gives us the R in RDD 
... but it seems like its not really that robust and maybe should be 
reconsidered.

2) Should stages only be retried a limited number of times?  It would be pretty 
easy to put in a limited number of retries per stage.  Though again, we 
encounter issues with keeping things resilient.  Theoretically one stage could 
have many retries, but due to failures in different stages further downstream, 
so we might need to track the logic.

In general it just seems there is some flakiness in the retry logic.  This is 
the only reproducible example I have at the moment, but I vaguely recall 
hitting other cases of strange behavior w/ retries when trying to run long 
pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
fails, but the executor eventually comes back and the stage gets retried again, 
but the same GC issues happen the second time around, etc.

Copied from SPARK-5928, here's the example program that can regularly produce a 
loop of stage failures.  Note that it will only fail from a remote fetch, so it 
can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
--num-executors 2 --executor-memory 4000m}}

{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
rdd.map { x = (1, x)}.groupByKey().count()
{code}

  was:
While investigating SPARK-5928, I noticed some very strange behavior in the way 
spark retries stages after a FetchFailedException.  It seems that on a 
FetchFailedException, instead of simply killing the task and retrying, Spark 
aborts the stage and retries.  If it just retried the task, the task might fail 
4 times and then trigger the usual job killing mechanism.  But by killing the 
stage instead, the max retry logic is skipped (it looks to me like there is no 
limit for retries on a stage).

After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
fetch fails, we assume that the block manager we are fetching from has failed, 
and that it will succeed if we retry the stage w/out that block manager.  In 
that case, it wouldn't make any sense to retry the task, since its doomed to 
fail every time, so we might as well kill the whole stage.  But this raises two 
questions:


1) Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one?  
SPARK-5928 shows that there are at least some cases where that assumption is 
wrong.  Even if we fix that case, this logic seems brittle to the next case we 
find.  I guess the idea is that this behavior is what gives us the R in RDD 
... but it seems like its not really that robust and maybe should be 
reconsidered.

2) Should stages only be retried a limited number of times?  It would be pretty 
easy to put in a limited number of retries per stage.  Though again, we 
encounter issues with also keeping things resilient.  Theoretically one stage 
could have many retries, but due to failures in different stages further 
downstream, so we might need to track the logic.

In general it just seems there is some flakiness in the retry logic.  This is 
the only reproducible example I have at the moment, but I vaguely recall 
hitting other cases of strange behavior w/ retries when trying to run long 
pipelines.  Eg., if one executor is stuck in a GC 

[jira] [Updated] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package

2015-02-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5253:
-
Summary: LinearRegression with L1/L2 (elastic net) using OWLQN in new ML 
package  (was: LinearRegression with L1/L2 (elastic net) using OWLQN in new ML 
pacakge)

 LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
 ---

 Key: SPARK-5253
 URL: https://issues.apache.org/jira/browse/SPARK-5253
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1050) Investigate AnyRefMap

2015-02-22 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1050:

Target Version/s: 1.4.0

Since Spark can now be built with Scala 2.11, I believe this issue can now be 
looked into. Tagging this with a 1.4.0 target so it can be reviewed by then.

 Investigate AnyRefMap
 -

 Key: SPARK-1050
 URL: https://issues.apache.org/jira/browse/SPARK-1050
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Priority: Minor

 Once Spark is built using Scala 2.11, we should investigate (particularly on 
 heavily-used, performance-critical code paths) replacing usage of HashMap 
 with the new, higher-performance AnyRefMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-02-22 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-5945:
---

 Summary: Spark should not retry a stage infinitely on a 
FetchFailedException
 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid


While investigating SPARK-5928, I noticed some very strange behavior in the way 
spark retries stages after a FetchFailedException.  It seems that on a 
FetchFailedException, instead of simply killing the task and retrying, Spark 
aborts the stage and retries.  If it just retried the task, the task might fail 
4 times and then trigger the usual job killing mechanism.  But by killing the 
stage instead, the max retry logic is skipped (it looks to me like there is no 
limit for retries on a stage).

After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
fetch fails, we assume that the block manager we are fetching from has failed, 
and that it will succeed if we retry the stage w/out that block manager.  In 
that case, it wouldn't make any sense to retry the task, since its doomed to 
fail every time, so we might as well kill the whole stage.  But this raises two 
questions:


1) Is it really safe to assume that a FetchFailedException means that the 
BlockManager has failed, and ti will work if we just try another one?  
SPARK-5928 shows that there are at least some cases where that assumption is 
wrong.  Even if we fix that case, this logic seems brittle to the next case we 
find.  I guess the idea is that this behavior is what gives us the R in RDD 
... but it seems like its not really that robust and maybe should be 
reconsidered.

2) Should stages only be retried a limited number of times?  It would be pretty 
easy to put in a limited number of retries per stage.  Though again, we 
encounter issues with also keeping things resilient.  Theoretically one stage 
could have many retries, but due to failures in different stages further 
downstream, so we might need to track the logic.

In general it just seems there is some flakiness in the retry logic.  This is 
the only reproducible example I have at the moment, but I vaguely recall 
hitting other cases of strange behavior w/ retries when trying to run long 
pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
fails, but the executor eventually comes back and the stage gets retried again, 
but the same GC issues happen the second time around, etc.

Copied from SPARK-5928, here's the example program that can regularly produce a 
loop of stage failures.  Note that it will only fail from a remote fetch, so it 
can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
--num-executors 2 --executor-memory 4000m}}

{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
  val n = 3e3.toInt
  val arr = new Array[Byte](n)
  //need to make sure the array doesn't compress to something small
  scala.util.Random.nextBytes(arr)
  arr
}
rdd.map { x = (1, x)}.groupByKey().count()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1050) Investigate AnyRefMap

2015-02-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-1050.
--
Resolution: Won't Fix

 Investigate AnyRefMap
 -

 Key: SPARK-1050
 URL: https://issues.apache.org/jira/browse/SPARK-1050
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Priority: Minor

 Once Spark is built using Scala 2.11, we should investigate (particularly on 
 heavily-used, performance-critical code paths) replacing usage of HashMap 
 with the new, higher-performance AnyRefMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1050) Investigate AnyRefMap

2015-02-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-1050.
--
Resolution: Fixed

I'm going to close this for now. If somebody has time to review the new Scala 
hash map, we can reopen it. In general the performance characteristics of the 
Scala collection library is not very satisfying.

I have some thoughts on performance optimizations in general and I will post 
something soon.

 Investigate AnyRefMap
 -

 Key: SPARK-1050
 URL: https://issues.apache.org/jira/browse/SPARK-1050
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Priority: Minor

 Once Spark is built using Scala 2.11, we should investigate (particularly on 
 heavily-used, performance-critical code paths) replacing usage of HashMap 
 with the new, higher-performance AnyRefMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-1050) Investigate AnyRefMap

2015-02-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-1050:


 Investigate AnyRefMap
 -

 Key: SPARK-1050
 URL: https://issues.apache.org/jira/browse/SPARK-1050
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Priority: Minor

 Once Spark is built using Scala 2.11, we should investigate (particularly on 
 heavily-used, performance-critical code paths) replacing usage of HashMap 
 with the new, higher-performance AnyRefMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org