date:20141117


[ 
https://issues.apache.org/jira/browse/SPARK-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214400#comment-14214400
 ] 

Apache Spark commented on SPARK-4446:
-

User 'Leolh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3306

 MetadataCleaner schedule task with a wrong param for delay time .
 -

 Key: SPARK-4446
 URL: https://issues.apache.org/jira/browse/SPARK-4446
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Leo

 MetadataCleaner schedule task with a wrong param for delay time .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214406#comment-14214406
 ] 

Apache Spark commented on SPARK-4306:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3307

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
Assignee: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines


[ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214407#comment-14214407
 ] 

Apache Spark commented on SPARK-2208:
-

User 'XuefengWu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3300

 local metrics tests can fail on fast machines
 -

 Key: SPARK-2208
 URL: https://issues.apache.org/jira/browse/SPARK-2208
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
  Labels: starter

 I'm temporarily disabling this check. I think the issue is that on fast 
 machines the fetch wait time can actually be zero, even across all tasks.
 We should see if we can write this in a different way to make sure there is a 
 delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

Sandy Ryza created SPARK-4447:
-

 Summary: Remove layers of abstraction in YARN code no longer 
needed after dropping yarn-alpha
 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha


 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4447:
--
Description: 
For example, YarnRMClient and YarnRMClientImpl can be merged
YarnAllocator and YarnAllocationHandler can be merged

 Remove layers of abstraction in YARN code no longer needed after dropping 
 yarn-alpha
 

 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza

 For example, YarnRMClient and YarnRMClientImpl can be merged
 YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha


[ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214411#comment-14214411
 ] 

Sandy Ryza commented on SPARK-4447:
---

Planning to work on this.

 Remove layers of abstraction in YARN code no longer needed after dropping 
 yarn-alpha
 

 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-17 Thread Varadharajan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214415#comment-14214415
 ] 

Varadharajan commented on SPARK-4306:
-

[~matei] I'm really sorry. I'm quite occupied this week. I see, Davies has 
already created a pull request. Sorry for the inconvenience :(

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
Assignee: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4448) Support ConstantObjectInspector for unwrapping data

2014-11-17 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4448:


 Summary: Support ConstantObjectInspector for unwrapping data
 Key: SPARK-4448
 URL: https://issues.apache.org/jira/browse/SPARK-4448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


ClassCastException when retrieving primitive object from a 
ConstantObjectInspector by specifying parameters. This is probably a bug for 
Hive, we just provide a work around in HiveInspectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4448) Support ConstantObjectInspector for unwrapping data


[ 
https://issues.apache.org/jira/browse/SPARK-4448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214429#comment-14214429
 ] 

Apache Spark commented on SPARK-4448:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3308

 Support ConstantObjectInspector for unwrapping data
 ---

 Key: SPARK-4448
 URL: https://issues.apache.org/jira/browse/SPARK-4448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 ClassCastException when retrieving primitive object from a 
 ConstantObjectInspector by specifying parameters. This is probably a bug for 
 Hive, we just provide a work around in HiveInspectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted


[ 
https://issues.apache.org/jira/browse/SPARK-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1421#comment-1421
 ] 

Apache Spark commented on SPARK-4445:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/3310

 Don't display storage level in toDebugString unless RDD is persisted
 

 Key: SPARK-4445
 URL: https://issues.apache.org/jira/browse/SPARK-4445
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Blocker

 The current approach lists the storage level all the time, even if the RDD is 
 not persisted. The storage level should only be listed if the RDD is 
 persisted. We just need to guard it with a check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module

2014-11-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214461#comment-14214461
 ] 

Sean Owen commented on SPARK-4442:
--

You can already depend on just core's test code from other modules' test code. 
Is this more than that?

 Move common unit test utilities into their own package / module
 ---

 Key: SPARK-4442
 URL: https://issues.apache.org/jira/browse/SPARK-4442
 Project: Spark
  Issue Type: Improvement
Reporter: Josh Rosen
Priority: Minor

 We should move generally-useful unit test fixtures / utility methods to their 
 own test utilities set package / module to make them easier to find / use.
 See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for 
 one example of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3962) Mark spark dependency as provided in external libraries


 [ 
https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3962:
---
Issue Type: Bug  (was: Improvement)

 Mark spark dependency as provided in external libraries
 -

 Key: SPARK-3962
 URL: https://issues.apache.org/jira/browse/SPARK-3962
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Blocker

 Right now there is not an easy way for users to link against the external 
 streaming libraries and not accidentally pull Spark into their assembly jar. 
 We should mark Spark as provided in the external connector pom's so that 
 user applications can simply include those like any other dependency in the 
 user's jar.
 This is also the best format for third-party libraries that depend on Spark 
 (of which there will eventually be many) so it would be nice for our own 
 build to conform to this nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3962) Mark spark dependency as provided in external libraries


[ 
https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214480#comment-14214480
 ] 

Patrick Wendell commented on SPARK-3962:


I think this is causing the build to fail because of this bug:
https://jira.codehaus.org/browse/MSHADE-148

Would it be possible to remove the test-jar dependency on streaming for these 
modules (/cc @tdas)? If we need to have a moderate amount of code duplication I 
think it's fine because as it stands now these aren't really usable for people. 
I looked and I only found some minor dependencies that could be inlined into 
the modules.

 Mark spark dependency as provided in external libraries
 -

 Key: SPARK-3962
 URL: https://issues.apache.org/jira/browse/SPARK-3962
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Blocker

 Right now there is not an easy way for users to link against the external 
 streaming libraries and not accidentally pull Spark into their assembly jar. 
 We should mark Spark as provided in the external connector pom's so that 
 user applications can simply include those like any other dependency in the 
 user's jar.
 This is also the best format for third-party libraries that depend on Spark 
 (of which there will eventually be many) so it would be nice for our own 
 build to conform to this nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214486#comment-14214486
]

Sean Owen commented on SPARK-4402:
--

Can the Spark code go back and check this before any of it is called, at the
start of your program? no that isn't possible. It wouldn't even know which RDDs
may be executed at the outset, and, wouldn't be sure that the output dir isn't
cleared up by your code before output happens.

Here it seems to happen before the output operation starts, which is about as
early as possible. I suggest this is the correct behavior and is the current
behavior. It's even configurable whether it overwrites or fails when the output
dir exists.

Of course you can and should check the output directory in your program. In
fact your program is in a better position to know whether it should be an
error, warning, or whether you should just overwrite the output.

Output path validation of an action statement resulting in runtime exception

Key: SPARK-4402
URL: https://issues.apache.org/jira/browse/SPARK-4402
Project: Spark
Issue Type: Wish
Reporter: Vijay
Priority: Minor

Output path validation is happening at the time of statement execution as a
part of lazyevolution of action statement. But if the path already exists
then it throws a runtime exception. Hence all the processing completed till
that point is lost which results in resource wastage (processing time and CPU
usage).
If this I/O related validation is done before the RDD action operations then
this runtime exception can be avoided.
I believe similar validation/ feature is implemented in hadoop also.
Example:
SchemaRDD.saveAsTextFile() evaluated the path during runtime

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4449) specify port range in spark

2014-11-17 Thread wangfei (JIRA)

wangfei created SPARK-4449:
--

 Summary: specify port range in spark
 Key: SPARK-4449
 URL: https://issues.apache.org/jira/browse/SPARK-4449
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4449) specify port range in spark


[ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214587#comment-14214587
 ] 

Apache Spark commented on SPARK-4449:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3314

 specify port range in spark
 ---

 Key: SPARK-4449
 URL: https://issues.apache.org/jira/browse/SPARK-4449
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-17 Thread Vijay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214635#comment-14214635
 ] 

Vijay commented on SPARK-4402:
--

Thanks for the explanation.
It is clear now.

 Output path validation of an action statement resulting in runtime exception
 

 Key: SPARK-4402
 URL: https://issues.apache.org/jira/browse/SPARK-4402
 Project: Spark
  Issue Type: Wish
Reporter: Vijay
Priority: Minor

 Output path validation is happening at the time of statement execution as a 
 part of lazyevolution of action statement. But if the path already exists 
 then it throws a runtime exception. Hence all the processing completed till 
 that point is lost which results in resource wastage (processing time and CPU 
 usage).
 If this I/O related validation is done before the RDD action operations then 
 this runtime exception can be avoided.
 I believe similar validation/ feature is implemented in hadoop also.
 Example:
 SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4402) Output path validation of an action statement resulting in runtime exception

2014-11-17 Thread Vijay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vijay resolved SPARK-4402.
--
Resolution: Not a Problem

 Output path validation of an action statement resulting in runtime exception
 

 Key: SPARK-4402
 URL: https://issues.apache.org/jira/browse/SPARK-4402
 Project: Spark
  Issue Type: Wish
Reporter: Vijay
Priority: Minor

 Output path validation is happening at the time of statement execution as a 
 part of lazyevolution of action statement. But if the path already exists 
 then it throws a runtime exception. Hence all the processing completed till 
 that point is lost which results in resource wastage (processing time and CPU 
 usage).
 If this I/O related validation is done before the RDD action operations then 
 this runtime exception can be avoided.
 I believe similar validation/ feature is implemented in hadoop also.
 Example:
 SchemaRDD.saveAsTextFile() evaluated the path during runtime 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4411) Add kill link for jobs in the UI

2014-11-17 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214741#comment-14214741
 ] 

Thomas Graves commented on SPARK-4411:
--

Please make sure that the modify acls work apply to it.

 Add kill link for jobs in the UI
 --

 Key: SPARK-4411
 URL: https://issues.apache.org/jira/browse/SPARK-4411
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kay Ousterhout

 SPARK-4145 changes the default landing page for the UI to show jobs. We 
 should have a kill link for each job, similar to what we have for each 
 stage, so it's easier for users to kill slow jobs (and the semantics of 
 killing a job are slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4411) Add kill link for jobs in the UI

2014-11-17 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214741#comment-14214741
 ] 

Thomas Graves edited comment on SPARK-4411 at 11/17/14 3:34 PM:


Please make sure that the modify acls are applied to it in order to prevent 
someone killing someone elses job without permission.


was (Author: tgraves):
Please make sure that the modify acls work apply to it.

 Add kill link for jobs in the UI
 --

 Key: SPARK-4411
 URL: https://issues.apache.org/jira/browse/SPARK-4411
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kay Ousterhout

 SPARK-4145 changes the default landing page for the UI to show jobs. We 
 should have a kill link for each job, similar to what we have for each 
 stage, so it's easier for users to kill slow jobs (and the semantics of 
 killing a job are slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn

2014-11-17 Thread Rick Bischoff (JIRA)

Rick Bischoff created SPARK-4450:


 Summary: SparkSQL producing incorrect answer when using --master 
yarn
 Key: SPARK-4450
 URL: https://issues.apache.org/jira/browse/SPARK-4450
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: CDH 5.1
Reporter: Rick Bischoff


A simple summary program using 

spark-submit --master local  MyJob.py

vs.

spark-submit --master yarn MyJob.py

produces different answers--the output produced by local has been independently 
verified and is correct, but the output from yarn is incorrect.

It does not appear to happen with smaller files, only large files.

MyJob.py is 

from pyspark import SparkContext, SparkConf
from pyspark.sql import *

def maybeFloat(x):
Convert NULLs into 0s
if x=='': return 0.
else: return float(x)

def maybeInt(x):
Convert NULLs into 0s
if x=='': return 0
else: return int(x)

def mapColl(p):
return {
f1: p[0],
f2: p[1],
f3: p[2],
f4: int(p[3]),
f5: int(p[4]),
f6: p[5],
f7: p[6],
f8: p[7],
f9: p[8],
f10: maybeInt(p[9]),
f11: p[10],
f12: p[11],
f13: p[12],
f14: p[13],
f15: maybeFloat(p[14]),
f16: maybeInt(p[15]),
f17: maybeFloat(p[16]) }

sc = SparkContext()
sqlContext = SQLContext(sc)

lines = sc.textFile(sample.csv)
fields = lines.map(lambda l: mapColl(l.split(,)))

collTable = sqlContext.inferSchema(fields)
collTable.registerAsTable(sample)

test = sqlContext.sql(SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum  \
  + FROM sample  \
  + GROUP BY f9)
foo = test.collect()
print foo

sc.stop()






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4451) force to kill process after 5 seconds

2014-11-17 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-4451:
--

 Summary: force to kill process after 5 seconds
 Key: SPARK-4451
 URL: https://issues.apache.org/jira/browse/SPARK-4451
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic


When we restart history server or thrift server, sometimes the processes will 
not be launched as they were not totally stopping. So I think we'd better wait 
some seconds and kill it by force after that waiting.

I do this referring to Hadoop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4451) force to kill process after 5 seconds


[ 
https://issues.apache.org/jira/browse/SPARK-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214880#comment-14214880
 ] 

Apache Spark commented on SPARK-4451:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3316

 force to kill process after 5 seconds
 -

 Key: SPARK-4451
 URL: https://issues.apache.org/jira/browse/SPARK-4451
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic

 When we restart history server or thrift server, sometimes the processes will 
 not be launched as they were not totally stopping. So I think we'd better 
 wait some seconds and kill it by force after that waiting.
 I do this referring to Hadoop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

tianshuo created SPARK-4452:
---

 Summary: Enhance Sort-based Shuffle to avoid spilling small files
 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
Reporter: tianshuo


When an Aggregator is used with ExternalSorter in a task, spark will create 
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by multiple 
factors:
1. There seems to be a bug in setting the elementsRead variable in 
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
useless for triggering spilling, the pr to fix it is 
https://github.com/apache/spark/pull/3302

2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by 
Aggregator) in this case. Here is an example: Due to the usage of mapside 
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask 
as much memory as it can, which is totalMem/numberOfThreads. Then later on whe 
ExternalSorter is created in the same thread, the ShuffleMemoryManager could 
refuse to allocate more memory to it, since the memory is already given to the 
previous requested object(ExternalAppendOnlyMap). That causes the 
ExternalSorter keeps spilling small files(due to the lack of memory)

I'm currently working on a PR to addresses these two issues. It will include 
following changes

1. The ShuffleMemoryManager should not only track the memory usage for each 
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a 
spillable object. In this way, if a new object in a thread is requesting 
memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
happening. Previously spillable object triggers spilling by themself. So one 
may not trigger spilling even if another object in the same thread needs more 
memory. After this change The ShuffleMemoryManager could trigger the spilling 
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
after the iterator is returned. This should be changed so that even after the 
iterator is returned, the ShuffleMemoryManager can still spill it.

Currently, I have a working branch in progress: 
https://github.com/tsdeng/spark/tree/enhance_memory_manager 

Already made change 3 and have a prototype of change 1 and 2 to evict spillable 
from memory manager, still in progress.
I will send a PR when it's done.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214903#comment-14214903
]

Apache Spark commented on SPARK-4452:
-

User 'tsdeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3302

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager
could refuse to allocate more memory to it, since the memory is already given
to the previous requested object(ExternalAppendOnlyMap). That causes the
ExternalSorter keeps spilling small files(due to the lack of memory)
I'm currently working on a PR to addresses these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

tianshuo updated SPARK-4452:

Description:
When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by multiple
factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302

2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by
Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask
as much memory as it can, which is totalMem/numberOfThreads. Then later on whe
ExternalSorter is created in the same thread, the ShuffleMemoryManager could
refuse to allocate more memory to it, since the memory is already given to the
previous requested object(ExternalAppendOnlyMap). That causes the
ExternalSorter keeps spilling small files(due to the lack of memory)

I'm currently working on a PR to address these two issues. It will include
following changes

1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.

Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager

Already made change 3 and have a prototype of change 1 and 2 to evict spillable
from memory manager, still in progress.
I will send a PR when it's done.

was:
When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by multiple
factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302

I'm currently working on a PR to addresses these two issues. It will include
following changes

Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager

Already made change 3 and have a prototype of

[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

tianshuo updated SPARK-4452:

I'm currently working on a PR to address these two issues. It will include
following changes

Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager

Already made change 3 and have a prototype of change 1 and 2 to evict spillable
from memory manager, still in progress.
I will send a PR when it's done.

Any feedback or thoughts on this change is highly appreciated !

I'm currently working on a PR to address these two issues. It will include
following changes

Currently, I have a working branch in progress:

[jira] [Created] (SPARK-4453) Simplify Parquet record filter generation

2014-11-17 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4453:
-

 Summary: Simplify Parquet record filter generation
 Key: SPARK-4453
 URL: https://issues.apache.org/jira/browse/SPARK-4453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0, 1.0.0
Reporter: Cheng Lian


Current Parquet record filter code uses {{CatalystFilter}} and its sub-classes 
to represent all Spark SQL Parquet filter. Essentially, these classes combines 
the original Catalyst predicate expression together with the generated Parquet 
filter. {{ParquetFilters.findExpression}} then uses these classes to filter out 
all expressions that can be pushed down.

However, this {{findExpression}} function is not necessary at the first place, 
since we already know whether a predicate can be pushed down or not while 
trying to generate its corresponding filter.

With this in mind, the code size of Parquet record filter generation can be 
reduced significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators


[ 
https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214950#comment-14214950
 ] 

Apache Spark commented on SPARK-4213:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3317

 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
 -

 Key: SPARK-4213
 URL: https://issues.apache.org/jira/browse/SPARK-4213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 
 76386e1a23c)
Reporter: Terry Siu
Priority: Blocker
 Fix For: 1.2.0


 When I issue a hql query against a HiveContext where my predicate uses a 
 column of string type with one of LT, LTE, GT, or GTE operator, I get the 
 following error:
 scala.MatchError: StringType (of class 
 org.apache.spark.sql.catalyst.types.StringType$)
 Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, 
 StringType is absent from the corresponding functions for creating these 
 filters.
 To reproduce, in a Hive 0.13.1 shell, I created the following table (at a 
 specified DB):
 create table sparkbug (
   id int,
   event string
 ) stored as parquet;
 Insert some sample data:
 insert into table sparkbug select 1, '2011-06-18' from some table limit 1;
 insert into table sparkbug select 2, '2012-01-01' from some table limit 1;
 Launch a spark shell and create a HiveContext to the metastore where the 
 table above is located.
 import org.apache.spark.sql._
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.hive.HiveContext
 val hc = new HiveContext(sc)
 hc.setConf(spark.sql.shuffle.partitions, 10)
 hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 hc.setConf(spark.sql.parquet.compression.codec, snappy)
 import hc._
 hc.hql(select * from db.sparkbug where event = '2011-12-01')
 A scala.MatchError will appear in the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4453) Simplify Parquet record filter generation


[ 
https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214949#comment-14214949
 ] 

Apache Spark commented on SPARK-4453:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3317

 Simplify Parquet record filter generation
 -

 Key: SPARK-4453
 URL: https://issues.apache.org/jira/browse/SPARK-4453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0, 1.1.0
Reporter: Cheng Lian

 Current Parquet record filter code uses {{CatalystFilter}} and its 
 sub-classes to represent all Spark SQL Parquet filter. Essentially, these 
 classes combines the original Catalyst predicate expression together with the 
 generated Parquet filter. {{ParquetFilters.findExpression}} then uses these 
 classes to filter out all expressions that can be pushed down.
 However, this {{findExpression}} function is not necessary at the first 
 place, since we already know whether a predicate can be pushed down or not 
 while trying to generate its corresponding filter.
 With this in mind, the code size of Parquet record filter generation can be 
 reduced significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214962#comment-14214962
]

tianshuo commented on SPARK-4452:
-

Originally, we found this problem by seeing Too Many Files Open exception when
running SVD job on a big dataset. And it's caused by ExternalSorter dumping too
many small files.

With the fixes mentioned above(with the prototype), the problem is resolved.
I'm still working on finalizing the prototype.

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager
could refuse to allocate more memory to it, since the memory is already given
to the previous requested object(ExternalAppendOnlyMap). That causes the
ExternalSorter keeps spilling small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4454) Race condition in DAGScheduler

2014-11-17 Thread Rafal Kwasny (JIRA)

Rafal Kwasny created SPARK-4454:
---

 Summary: Race condition in DAGScheduler
 Key: SPARK-4454
 URL: https://issues.apache.org/jira/browse/SPARK-4454
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Rafal Kwasny
Priority: Minor


It seems to be a race condition in DAGScheduler that manifests on jobs with 
high concurrency:

{noformat}
 Exception in thread main java.util.NoSuchElementException: key not found: 35
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
at 
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:192)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler

2014-11-17 Thread Rafal Kwasny (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Kwasny updated SPARK-4454:

Description: 
It seems to be a race condition in DAGScheduler that manifests on jobs with 
high concurrency:

{noformat}
 Exception in thread main java.util.NoSuchElementException: key not found: 35
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275)
at 
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:192)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at

[jira] [Updated] (SPARK-2811) update algebird to 0.8.1


 [ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2811:
---
Assignee: Adam Pingel

 update algebird to 0.8.1
 

 Key: SPARK-2811
 URL: https://issues.apache.org/jira/browse/SPARK-2811
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Adam Pingel

 First algebird_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2811) update algebird to 0.8.1


 [ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2811.

   Resolution: Fixed
Fix Version/s: 1.2.0

 update algebird to 0.8.1
 

 Key: SPARK-2811
 URL: https://issues.apache.org/jira/browse/SPARK-2811
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Adam Pingel
 Fix For: 1.2.0


 First algebird_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

tianshuo updated SPARK-4452:

2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by
Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask
as much memory as it can, which is totalMem/numberOfThreads. Then later on when
ExternalSorter is created in the same thread, the ShuffleMemoryManager could
refuse to allocate more memory to it, since the memory is already given to the
previous requested object(ExternalAppendOnlyMap). That causes the
ExternalSorter keeps spilling small files(due to the lack of memory)

I'm currently working on a PR to address these two issues. It will include
following changes

Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager

Already made change 3 and have a prototype of change 1 and 2 to evict spillable
from memory manager, still in progress.
I will send a PR when it's done.

Any feedback or thoughts on this change is highly appreciated !

I'm currently working on a PR to address these two issues. It will include
following changes

Currently, I have a working branch in progress:

[jira] [Created] (SPARK-4455) Exclude dependency on hbase-annotations module

2014-11-17 Thread Ted Yu (JIRA)

Ted Yu created SPARK-4455:
-

 Summary: Exclude dependency on hbase-annotations module
 Key: SPARK-4455
 URL: https://issues.apache.org/jira/browse/SPARK-4455
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


As Patrick mentioned in the thread 'Has anyone else observed this build break?' 
:

The error I've seen is this when building the examples project:
{code}
spark-examples_2.10: Could not resolve dependencies for project
org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
{code}
The reason for this error is that hbase-annotations is using a
system scoped dependency in their hbase-annotations pom, and this
doesn't work with certain JDK layouts such as that provided on Mac OS:

http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom

hbase-annotations module is transitively brought in through other HBase 
modules, we should exclude it from related modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4444) Drop VD type parameter from EdgeRDD

2014-11-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Drop VD type parameter from EdgeRDD
 ---

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker
 Fix For: 1.2.0


 Due to vertex attribute caching, EdgeRDD previously took two type parameters: 
 ED and VD. However, this is an implementation detail that should not be 
 exposed in the interface, so this PR drops the VD type parameter.
 This requires removing the filter method from the EdgeRDD interface, because 
 it depends on vertex attribute caching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4409) Additional (but limited) Linear Algebra Utils

[
https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214999#comment-14214999
]

Apache Spark commented on SPARK-4409:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/3319

Additional (but limited) Linear Algebra Utils
-

Key: SPARK-4409
URL: https://issues.apache.org/jira/browse/SPARK-4409
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Burak Yavuz
Priority: Minor

This ticket is to discuss the addition of a very limited number of local
matrix manipulation and generation methods that would be helpful in the
further development for algorithms on top of BlockMatrix (SPARK-3974), such
as Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:
For `Matrix`
- map: maps the values in the matrix with a given function. Produces a new
matrix.
- update: the values in the matrix are updated with a given function.
Occurs in place.
Factory methods for `DenseMatrix`:
- *zeros: Generate a matrix consisting of zeros
- *ones: Generate a matrix consisting of ones
- *eye: Generate an identity matrix
- *rand: Generate a matrix consisting of i.i.d. uniform random numbers
- *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
- *diag: Generate a diagonal matrix from a supplied vector
*These methods already exist in the factory methods for `Matrices`, however
for cases where we require a `DenseMatrix`, you constantly have to add
`.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I
propose moving these functions to factory methods for `DenseMatrix` where the
putput will be a `DenseMatrix` and the factory methods for `Matrices` will
call these functions directly and output a generic `Matrix`.
Factory methods for `SparseMatrix`:
- speye: Identity matrix in sparse format. Saves a ton of memory when
dimensions are large, especially in Multi Model Training, where each row
requires being multiplied by a scalar.
- sprand: Generate a sparse matrix with a given density consisting of
i.i.d. uniform random numbers.
- sprandn: Generate a sparse matrix with a given density consisting of
i.i.d. gaussian random numbers.
- diag: Generate a diagonal matrix from a supplied vector, but is memory
efficient, because it just stores the diagonal. Again, very helpful in Multi
Model Training.
Factory methods for `Matrices`:
- Include all the factory methods given above, but return a generic
`Matrix` rather than `SparseMatrix` or `DenseMatrix`.
- horzCat: Horizontally concatenate matrices to form one larger matrix.
Very useful in both Multi Model Training, and for the repartitioning of
BlockMatrix.
- vertCat: Vertically concatenate matrices to form one larger matrix. Very
useful for the repartitioning of BlockMatrix.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215018#comment-14215018
]

Sandy Ryza commented on SPARK-4452:
---

I haven't thought the implications out fully, but it worries me that data
structures wouldn't be in charge of their own spilling. It seems like for
different data structures, different spilling patterns and sizes could be more
efficient, and placing control in the hands of the shuffle memory manager
negates this.

Another fix (simpler but maybe less flexible) worth considering would be to
define a static fraction of memory that goes to each of the aggregator and the
external sorter, right? One way that could be implemented is that each time
the aggregator gets memory, it allocates some for the sorter.

Also, for SPARK-2926, we are working on a tiered merger that would avoid the
Too many open files problem by only merging a limited number of files at once.

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-17 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215019#comment-14215019
 ] 

Ryan Williams commented on SPARK-3630:
--

[~aash] I've not seen this since my previous reports, could be that I just 
wasn't actually running 1.2.0, I'll let you know here if I see it again, thanks!

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215022#comment-14215022
 ] 

Apache Spark commented on SPARK-4434:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3320

 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
 -

 Key: SPARK-4434
 URL: https://issues.apache.org/jira/browse/SPARK-4434
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Andrew Or
Priority: Blocker

 When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
 application JAR URL, e.g.
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 {code}
 In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 {code}
 I tried changing my URL to conform to the new format, but this either 
 resulted in an error or a job that failed:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 {code}
 If I omit the extra slash:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Sending launch command to spark://joshs-mbp.att.net:7077
 Driver successfully submitted as driver-20141116143235-0002
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20141116143235-0002 is ERROR
 Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
 java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
 {code}
 This bug effectively prevents users from using {{spark-submit}} in cluster 
 mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza updated SPARK-4452:
--
Affects Version/s: 1.1.0

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4455) Exclude dependency on hbase-annotations module


[ 
https://issues.apache.org/jira/browse/SPARK-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215028#comment-14215028
 ] 

Apache Spark commented on SPARK-4455:
-

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3286

 Exclude dependency on hbase-annotations module
 --

 Key: SPARK-4455
 URL: https://issues.apache.org/jira/browse/SPARK-4455
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu

 As Patrick mentioned in the thread 'Has anyone else observed this build 
 break?' :
 The error I've seen is this when building the examples project:
 {code}
 spark-examples_2.10: Could not resolve dependencies for project
 org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
 find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
 {code}
 The reason for this error is that hbase-annotations is using a
 system scoped dependency in their hbase-annotations pom, and this
 doesn't work with certain JDK layouts such as that provided on Mac OS:
 http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
 hbase-annotations module is transitively brought in through other HBase 
 modules, we should exclude it from related modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1358) Continuous integrated test should be involved in Spark ecosystem

2014-11-17 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215042#comment-14215042
 ] 

shane knapp commented on SPARK-1358:


[~aash] -- depending on the hardware reqs, we might be able to make this 
happen.  we have some extra hardware, so a small (3-5 server) dedicated spark 
e2e test cluster could be deployed.  we're still in the process of building out 
our new infrastructure and will have a better idea of what we will have 
available for this really soon.

 Continuous integrated test should be involved in Spark ecosystem 
 -

 Key: SPARK-1358
 URL: https://issues.apache.org/jira/browse/SPARK-1358
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: xiajunluan

 Currently, Spark only contains unit test and performance test, but I think it 
 is not enough for customer to evaluate status about their cluster and spark 
 version they  will used, and it is necessary to build continuous integrated 
 test for spark development , it could included 
 1. complex applications test cases for spark/spark streaming/graphx
 2. stresss test cases
 3. fault tolerance test cases
 4..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215047#comment-14215047
]

Sandy Ryza commented on SPARK-4452:
---

A third possible fix would be to have the shuffle memory manager consider
fairness in terms of spillable data structures instead of threads.

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4456) Document why spilling depends on both elements read and memory used

Sandy Ryza created SPARK-4456:
-

 Summary: Document why spilling depends on both elements read and 
memory used
 Key: SPARK-4456
 URL: https://issues.apache.org/jira/browse/SPARK-4456
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Minor


It's not clear to me why the number of records should have an effect on when a 
data structure spills.  It would probably be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4393) Memory leak in connection manager timeout thread


[ 
https://issues.apache.org/jira/browse/SPARK-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215054#comment-14215054
 ] 

Apache Spark commented on SPARK-4393:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3321

 Memory leak in connection manager timeout thread
 

 Key: SPARK-4393
 URL: https://issues.apache.org/jira/browse/SPARK-4393
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.2.0


 This JIRA tracks a fix for a memory leak in ConnectionManager's TimerTasks, 
 originally reported in [a 
 comment|https://issues.apache.org/jira/browse/SPARK-3633?focusedCommentId=14208318page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14208318]
  on SPARK-3633.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4457) Document how to build for Hadoop versions greater than 2.4

Sandy Ryza created SPARK-4457:
-

 Summary: Document how to build for Hadoop versions greater than 2.4
 Key: SPARK-4457
 URL: https://issues.apache.org/jira/browse/SPARK-4457
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215058#comment-14215058
 ] 

Joseph K. Bradley commented on SPARK-3717:
--

I took a look at the rowToColumnStore method.  It may work, but I don't think 
it will scale well.  The main problems are that it will create an enormous 
number of small objects, and that it does not take advantage of sparse vectors. 
 It would be ideal to batch values and to maintain sparsity.  I have an 
implementation almost done which I can push soon.

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4457) Document how to build for Hadoop versions greater than 2.4


[ 
https://issues.apache.org/jira/browse/SPARK-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215060#comment-14215060
 ] 

Apache Spark commented on SPARK-4457:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3322

 Document how to build for Hadoop versions greater than 2.4
 --

 Key: SPARK-4457
 URL: https://issues.apache.org/jira/browse/SPARK-4457
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215061#comment-14215061
 ] 

Apache Spark commented on SPARK-4434:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3320

 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
 -

 Key: SPARK-4434
 URL: https://issues.apache.org/jira/browse/SPARK-4434
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Andrew Or
Priority: Blocker

 When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
 application JAR URL, e.g.
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 {code}
 In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 {code}
 I tried changing my URL to conform to the new format, but this either 
 resulted in an error or a job that failed:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 {code}
 If I omit the extra slash:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Sending launch command to spark://joshs-mbp.att.net:7077
 Driver successfully submitted as driver-20141116143235-0002
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20141116143235-0002 is ERROR
 Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
 java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
 {code}
 This bug effectively prevents users from using {{spark-submit}} in cluster 
 mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4439) Expose RandomForest in Python


[ 
https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215067#comment-14215067
 ] 

Apache Spark commented on SPARK-4439:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3320

 Expose RandomForest in Python
 -

 Key: SPARK-4439
 URL: https://issues.apache.org/jira/browse/SPARK-4439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4458) Skip compilation of tests classes when using make-distribution

2014-11-17 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-4458:


 Summary: Skip compilation of tests classes when using 
make-distribution
 Key: SPARK-4458
 URL: https://issues.apache.org/jira/browse/SPARK-4458
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.1.0, 1.0.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4458) Skip compilation of tests classes when using make-distribution

2014-11-17 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4458:
-

The make-distribution generates Spark distributions, and therefore does not 
require building of test classes. Maven -DskipTests only eliminates running of 
tests, but does not eliminate compilation of tests (see [Maven 
docs](http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skipTests)).
 To prevent compilation of tests (reduces compilation time), we need to add 
-Dmaven.test.skip (again see [Maven 
docs](http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skip)).
 

 Skip compilation of tests classes when using make-distribution
 --

 Key: SPARK-4458
 URL: https://issues.apache.org/jira/browse/SPARK-4458
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4458) Skip compilation of tests classes when using make-distribution


[ 
https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215087#comment-14215087
 ] 

Apache Spark commented on SPARK-4458:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3324

 Skip compilation of tests classes when using make-distribution
 --

 Key: SPARK-4458
 URL: https://issues.apache.org/jira/browse/SPARK-4458
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks

2014-11-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-4266:
--
Assignee: Kay Ousterhout

 Avoid expensive JavaScript for StagePages with huge numbers of tasks
 

 Key: SPARK-4266
 URL: https://issues.apache.org/jira/browse/SPARK-4266
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Critical

 Some of the new javascript added to handle hiding metrics significantly slows 
 the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, 
 it took over a minute for the page to finish loading in Chrome on my laptop). 
  There are at least two issues here:
 (1) The new table striping java script is much slower than the old CSS.  The 
 fancier javascript is only needed for the stage summary table, so we should 
 change the task table back to using CSS so that it doesn't slow the page load 
 for jobs with lots of tasks.
 (2) The javascript associated with hiding metrics is expensive when jobs have 
 lots of tasks, I think because the jQuery selectors have to traverse a much 
 larger DOM.   The ID selectors are much more efficient, so we should consider 
 switching to these, and/or avoiding this code in additional-metrics.js:
 $(input:checkbox:not(:checked)).each(function() {
 var column = table . + $(this).attr(name);
 $(column).hide();
 });
 by initially hiding the data when we generate the page in the render function 
 instead, which should be easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4180:
---
Fix Version/s: 1.2.0

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Blocker
 Fix For: 1.2.0


 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors

2014-11-17 Thread Alok Saldanha (JIRA)

Alok Saldanha created SPARK-4459:


 Summary: JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with 
typechecking errors
 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.1.0, 1.0.2
Reporter: Alok Saldanha


I believe this issue is essentially the same as SPARK-668.
Original error: 
{code}
[ERROR] 
/Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
 no suitable method found for 
groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
[ERROR] method 
org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
 is not applicable
[ERROR] (inferred type does not conform to equality constraint(s)
{code}
from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
{code}
211  /**
212* Return an RDD of grouped elements. Each group consists of a key and a 
sequence of elements
213* mapping to that key.
214*/
215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
216 implicit val ctagK: ClassTag[K] = fakeClassTag
217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
219   }
{code}
Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
{code}
  45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
  46(implicit val kClassTag: ClassTag[K], implicit val 
vClassTag: ClassTag[V])
  47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
{code}
The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
means the combined signature for groupBy in the JavaPairRDD is 

{code}
groupBy[K](f: JFunction[Tuple2[K,V], K])
{code}
which imposes an unfortunate correlation between the Tuple2 and the return type 
of the grouping function, namely that the return type of the grouping function 
must be the same as the first type of the JavaPairRDD.


If we compare the method signature to flatMap:

{code}
105   /**
106*  Return a new RDD by first applying a function to all elements of this
107*  RDD, and then flattening the results.
108*/
109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
110 import scala.collection.JavaConverters._
111 def fn = (x: T) = f.call(x).asScala
112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
113   }
{code}

we see there should be an easy fix by changing the type parameter of the 
groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors

2014-11-17 Thread Alok Saldanha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215202#comment-14215202
 ] 

Alok Saldanha commented on SPARK-4459:
--

I created a standalone gist to demonstrate the problem, please see 
https://gist.github.com/alokito/40878fc25af21984463f

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
Reporter: Alok Saldanha

 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215222#comment-14215222
]

tianshuo commented on SPARK-4452:
-

Hi, [~sandyr]:
Your concern about data structures wouldn' be in charge of their spilling is
legit. That's why I'm trying to make a incremental change:
1. The data structure still asks the ShuffleMemoryManager and decides if to
spill itself.
2. But ShuffleMemoryManager can also trigger the spill of an object if the
memory quota of a thread is used up.

2 happens as a last resort when memory is not enough for the requesting object.
Also as you mentioned in the third solution, if the shuffle manager consider
fairness among objs，it has to have a way to trigger the spilling of an object
in a situation where current allocation is not fair. The memory manager has
more of a global knowledge about memory allocation, so giving spilling ability
to the manager could lead to more optimal memory allocation. The the spilling
can only be triggered from the object itself, like currently, one obj may not
be aware of the memory usage of other objs and keep holding the memory.

My point is the data structure should be able to trigger spilling by itself,
but it should also be able to handle when shuffleManager asks it to spill. I'm
also considering the obj can reject to spill itself do address the concern you
mentioned.

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215233#comment-14215233
]

tianshuo commented on SPARK-4452:
-

Currently, the two instances of Spillable, ExternalSorter and
ExternalAppendOnlyMap does not seem to have complex logic to decide when to
spill. They all ask as much memory as it can. I also consider this general
strategy of asking as much memory as it can would apply to most cases. So I
guess it would be better if the memory manager could trigger the spilling. Also
make the spillable to reject a spilling request could possibly address your
concern?

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4460) RandomForest classification uses wrong threshold

2014-11-17 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4460:


 Summary: RandomForest classification uses wrong threshold
 Key: SPARK-4460
 URL: https://issues.apache.org/jira/browse/SPARK-4460
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


RandomForest was modified to use WeightedEnsembleModel, but it needs to use a 
different threshold for prediction than GradientBoosting does.

Fix: WeightedEnsembleModel.scala:70 should use threshold 0.0 for boosting and 
0.5 for forests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215222#comment-14215222
]

tianshuo edited comment on SPARK-4452 at 11/17/14 10:00 PM:

2 happens as a last resort when memory is not enough for the requesting object.
Also as you mentioned in the third solution, if the shuffle manager consider
fairness among objs，it has to have a way to trigger the spilling of an object
in a situation where current allocation is not fair. The memory manager has
more of a global knowledge about memory allocation, so giving spilling ability
to the manager could lead to more optimal memory allocation. If the spilling
can only be triggered from the object itself, like currently, one obj may not
be aware of the memory usage of other objs and keep holding the memory.

was (Author: tianshuo):
Hi, [~sandyr]:
Your concern about data structures wouldn' be in charge of their spilling is
legit. That's why I'm trying to make a incremental change:
1. The data structure still asks the ShuffleMemoryManager and decides if to
spill itself.
2. But ShuffleMemoryManager can also trigger the spill of an object if the
memory quota of a thread is used up.

Enhance Sort-based Shuffle to avoid spilling small files

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger

[jira] [Updated] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza updated SPARK-4452:
--
Summary: Shuffle data structures can starve others on the same thread for
memory (was: Enhance Sort-based Shuffle to avoid spilling small files)

Shuffle data structures can starve others on the same thread for memory

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215269#comment-14215269
]

Sandy Ryza commented on SPARK-4452:
---

Updated the title to reflect the specific problem.

Shuffle data structures can starve others on the same thread for memory

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: tianshuo

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
This happens when using the sort-based shuffle. The issue is caused by
multiple factors:
1. There seems to be a bug in setting the elementsRead variable in
ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable)
useless for triggering spilling, the pr to fix it is
https://github.com/apache/spark/pull/3302
2. Current ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. This avoids problem 2 from
happening. Previously spillable object triggers spilling by themself. So one
may not trigger spilling even if another object in the same thread needs more
memory. After this change The ShuffleMemoryManager could trigger the spilling
of an object if it needs to
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager
Already made change 3 and have a prototype of change 1 and 2 to evict
spillable from memory manager, still in progress.
I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1


[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215279#comment-14215279
 ] 

Apache Spark commented on SPARK-4434:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3326

 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
 -

 Key: SPARK-4434
 URL: https://issues.apache.org/jira/browse/SPARK-4434
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Andrew Or
Priority: Blocker

 When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
 application JAR URL, e.g.
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 {code}
 In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 {code}
 I tried changing my URL to conform to the new format, but this either 
 resulted in an error or a job that failed:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 {code}
 If I omit the extra slash:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Sending launch command to spark://joshs-mbp.att.net:7077
 Driver successfully submitted as driver-20141116143235-0002
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20141116143235-0002 is ERROR
 Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
 java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
 {code}
 This bug effectively prevents users from using {{spark-submit}} in cluster 
 mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors


[ 
https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215301#comment-14215301
 ] 

Apache Spark commented on SPARK-4459:
-

User 'alokito' has created a pull request for this issue:
https://github.com/apache/spark/pull/3327

 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
 

 Key: SPARK-4459
 URL: https://issues.apache.org/jira/browse/SPARK-4459
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
Reporter: Alok Saldanha

 I believe this issue is essentially the same as SPARK-668.
 Original error: 
 {code}
 [ERROR] 
 /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105]
  no suitable method found for 
 groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long)
 [ERROR] method 
 org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K)
  is not applicable
 [ERROR] (inferred type does not conform to equality constraint(s)
 {code}
 from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
 {code}
 211  /**
 212* Return an RDD of grouped elements. Each group consists of a key and 
 a sequence of elements
 213* mapping to that key.
 214*/
 215   def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = {
 216 implicit val ctagK: ClassTag[K] = fakeClassTag
 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag
 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag)))
 219   }
 {code}
 Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala:
 {code}
   45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
   46(implicit val kClassTag: ClassTag[K], implicit 
 val vClassTag: ClassTag[V])
   47   extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] {
 {code}
 The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which 
 means the combined signature for groupBy in the JavaPairRDD is 
 {code}
 groupBy[K](f: JFunction[Tuple2[K,V], K])
 {code}
 which imposes an unfortunate correlation between the Tuple2 and the return 
 type of the grouping function, namely that the return type of the grouping 
 function must be the same as the first type of the JavaPairRDD.
 If we compare the method signature to flatMap:
 {code}
 105   /**
 106*  Return a new RDD by first applying a function to all elements of 
 this
 107*  RDD, and then flattening the results.
 108*/
 109   def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = {
 110 import scala.collection.JavaConverters._
 111 def fn = (x: T) = f.call(x).asScala
 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U])
 113   }
 {code}
 we see there should be an easy fix by changing the type parameter of the 
 groupBy function from K to U.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath

2014-11-17 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-4461:
-

 Summary: Spark should not relies on mapred-site.xml for classpath
 Key: SPARK-4461
 URL: https://issues.apache.org/jira/browse/SPARK-4461
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Zhan Zhang


Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, 
the library is shipped to cluster with distributed cache at run-time, and may 
not be available at every node manager. 

Instead of relying on mapred-site.xml, spark should handle this by its own, for 
example, through ADD_JARs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

2014-11-17 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215327#comment-14215327
 ] 

Davies Liu commented on SPARK-4395:
---

Workaround:  remove cache() or cache() after inferSchema():
{code}
 ratings_base_RDD = sc.textFile().map(xxx)
 schemaRatings = sqlContext.inferSchema(ratings_base_RDD).cache()
 schemaRatings.registerTempTable(RatingsTable)
 sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
{code}

 Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
 --

 Key: SPARK-4395
 URL: https://issues.apache.org/jira/browse/SPARK-4395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: version 1.2.0-SNAPSHOT
Reporter: Sameer Farooqui

 When I run this command it hangs for one to many hours and then finally 
 returns with successful results:
  sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
 Note, the lab environment below is still active, so let me know if you'd like 
 to just access it directly.
 +++ My Environment +++
 - 1-node cluster in Amazon
 - RedHat 6.5 64-bit
 - java version 1.7.0_67
 - SBT version: sbt-0.13.5
 - Scala version: scala-2.11.2
 Ran: 
 sudo yum -y update
 git clone https://github.com/apache/spark
 sudo sbt assembly
 +++ Data file used +++
 http://blueplastic.com/databricks/movielens/ratings.dat
 {code}
  import re
  import string
  from pyspark.sql import SQLContext, Row
  sqlContext = SQLContext(sc)
  RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
 
  def parse_ratings_line(line):
 ... match = re.search(RATINGS_PATTERN, line)
 ... if match is None:
 ... # Optionally, you can change this to just ignore if each line of 
 data is not critical.
 ... raise Error(Invalid logline: %s % logline)
 ... return Row(
 ... UserID= int(match.group(1)),
 ... MovieID   = int(match.group(2)),
 ... Rating= int(match.group(3)),
 ... Timestamp = int(match.group(4)))
 ...
  ratings_base_RDD = 
  (sc.textFile(file:///home/ec2-user/movielens/ratings.dat)
 ...# Call the parse_apace_log_line function on each line.
 ....map(parse_ratings_line)
 ...# Caches the objects in memory since they will be queried 
 multiple times.
 ....cache())
  ratings_base_RDD.count()
 1000209
  ratings_base_RDD.first()
 Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
  schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
  schemaRatings.registerTempTable(RatingsTable)
  sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect()
 {code}
 (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-11-17 Thread Anson Abraham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212852#comment-14212852
 ] 

Anson Abraham edited comment on SPARK-1867 at 11/17/14 10:40 PM:
-

Yes.  i added 3 data nodes just for this.  And I had 2 of them as my worker 
node and the other as my master.  Still getting that issue.  Also the jar files 
were all supplied by Cloudera.  

Also all this was done through Cloudera manager parcels. I installed spark as 
standalone on CDH5.2 as stated above.   So the jars have to be the same.  But 
as a just in case, i rsync'd them across the machines and still hitting this 
issue.

This is all occurring when running through spark-shell of course.


was (Author: ansonism):
Yes.  i added 3 data nodes just for this.  And I had 2 of them as my worker 
node and the other as my master.  Still getting that issue.  Also the jar files 
were all supplied by Cloudera.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,

[jira] [Updated] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath

2014-11-17 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-4461:
--
Description: 
Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, 
the library is shipped to cluster with distributed cache at run-time, and may 
not be available at every node manager. 

Instead of relying on mapred-site.xml, spark should handle this by its own, for 
example, through ADD_JARs, SPARK_CLASSPATH, etc

  was:
Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, 
the library is shipped to cluster with distributed cache at run-time, and may 
not be available at every node manager. 

Instead of relying on mapred-site.xml, spark should handle this by its own, for 
example, through ADD_JARs.


 Spark should not relies on mapred-site.xml for classpath
 

 Key: SPARK-4461
 URL: https://issues.apache.org/jira/browse/SPARK-4461
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Zhan Zhang

 Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, 
 the library is shipped to cluster with distributed cache at run-time, and may 
 not be available at every node manager. 
 Instead of relying on mapred-site.xml, spark should handle this by its own, 
 for example, through ADD_JARs, SPARK_CLASSPATH, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.


 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:

Target Version/s: 1.3.0  (was: 1.2.0)

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4338) Remove yarn-alpha support

2014-11-17 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4338:
-
Assignee: Sandy Ryza

 Remove yarn-alpha support
 -

 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4462) flume-sink build broken in SBT

Michael Armbrust created SPARK-4462:
---

 Summary: flume-sink build broken in SBT
 Key: SPARK-4462
 URL: https://issues.apache.org/jira/browse/SPARK-4462
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das


{code}
$ sbt streaming-flume-sink/compile
Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home as 
default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from 
/Users/marmbrus/workspace/spark/project/project
[info] Loading project definition from 
/Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with 
same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project 
resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * com.typesafe.sbt:sbt-git:0.6.1 - 0.6.2
[warn]  * com.typesafe.sbt:sbt-site:0.7.0 - 0.7.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Loading project definition from /Users/marmbrus/workspace/spark/project
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * org.apache.maven.wagon:wagon-provider-api:1.0-beta-6 - 2.2
[warn] Run 'evicted' to see detailed eviction warnings
NOTE: SPARK_HIVE is deprecated, please use -Phive and -Phive-thriftserver flags.
Enabled default scala profile
[info] Set current project to spark-parent (in build 
file:/Users/marmbrus/workspace/spark/)
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn]  * com.google.guava:guava:10.0.1 - 14.0.1
[warn] Run 'evicted' to see detailed eviction warnings
[info] Compiling 5 Scala sources and 3 Java sources to 
/Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/classes...
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:19:
 object slf4j is not a member of package org
[error] import org.slf4j.{Logger, LoggerFactory}
[error]^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:29:
 not found: type Logger
[error]   @transient private var log_ : Logger = null
[error] ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:32:
 not found: type Logger
[error]   protected def log: Logger = {
[error]  ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:40:
 not found: value LoggerFactory
[error]   log_ = LoggerFactory.getLogger(className)
[error]  ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/EventBatch.java:9:
 object specific is not a member of package org.apache.avro
[error] public class EventBatch extends 
org.apache.avro.specific.SpecificRecordBase implements 
org.apache.avro.specific.SpecificRecord {
[error] ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/EventBatch.java:9:
 object specific is not a member of package org.apache.avro
[error] public class EventBatch extends 
org.apache.avro.specific.SpecificRecordBase implements 
org.apache.avro.specific.SpecificRecord {
[error] 
   ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/SparkSinkEvent.java:9:
 object specific is not a member of package org.apache.avro
[error] public class SparkSinkEvent extends 
org.apache.avro.specific.SpecificRecordBase implements 
org.apache.avro.specific.SpecificRecord {
[error] ^
[error] 
/Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/SparkSinkEvent.java:9:
 object specific is not a member of package org.apache.avro
[error] public class SparkSinkEvent extends 
org.apache.avro.specific.SpecificRecordBase implements 
org.apache.avro.specific.SpecificRecord {
[error]

[jira] [Updated] (SPARK-3184) Allow user to specify num tasks to use for a table


 [ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3184:

Target Version/s: 1.3.0  (was: 1.2.0)

 Allow user to specify num tasks to use for a table
 --

 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive


 [ 
https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4443:

Priority: Critical  (was: Major)

 Statistics bug for external table in spark sql hive
 ---

 Key: SPARK-4443
 URL: https://issues.apache.org/jira/browse/SPARK-4443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Priority: Critical
 Fix For: 1.2.0


 When table is external, the `totalSize` is always zero, which will influence 
 join strategy(always use broadcast join for external table)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2873) OOM happens when group by and join operation with big data


 [ 
https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2873:

Target Version/s: 1.3.0  (was: 1.2.0)

 OOM happens when group by and join operation with big data 
 ---

 Key: SPARK-2873
 URL: https://issues.apache.org/jira/browse/SPARK-2873
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: guowei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4074) No exception for drop nonexistent table


 [ 
https://issues.apache.org/jira/browse/SPARK-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4074:

Target Version/s: 1.3.0  (was: 1.2.0)

 No exception for drop nonexistent table
 ---

 Key: SPARK-4074
 URL: https://issues.apache.org/jira/browse/SPARK-4074
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3720) support ORC in spark sql


 [ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3720.
-
Resolution: Duplicate

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2472) Spark SQL Thrift server sometimes assigns wrong job group name


 [ 
https://issues.apache.org/jira/browse/SPARK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2472:

Target Version/s: 1.3.0  (was: 1.2.0)

 Spark SQL Thrift server sometimes assigns wrong job group name
 --

 Key: SPARK-2472
 URL: https://issues.apache.org/jira/browse/SPARK-2472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian
Priority: Minor

 Sample beeline session used to reproduce this issue:
 {code}
 0: jdbc:hive2://localhost:1 drop table test;
 +-+
 | result  |
 +-+
 +-+
 No rows selected (0.614 seconds)
 0: jdbc:hive2://localhost:1 create table hive_table_copy as select * 
 from hive_table;
 +--++
 | key  | value  |
 +--++
 +--++
 No rows selected (0.493 seconds)
 0
 {code}
 The second statement results in two stages, the first stage is labeled with 
 the first {{drop table}} statement rather than the CTAS statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables


 [ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3298:

Assignee: (was: Michael Armbrust)

 [SQL] registerAsTable / registerTempTable overwrites old tables
 ---

 Key: SPARK-3298
 URL: https://issues.apache.org/jira/browse/SPARK-3298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Priority: Minor
  Labels: newbie

 At least in Spark 1.0.2,  calling registerAsTable(a) when a had been 
 registered before does not cause an error.  However, there is no way to 
 access the old table, even though it may be cached and taking up space.
 How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables


 [ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3298:

Target Version/s: 1.3.0  (was: 1.2.0)

 [SQL] registerAsTable / registerTempTable overwrites old tables
 ---

 Key: SPARK-3298
 URL: https://issues.apache.org/jira/browse/SPARK-3298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Priority: Minor
  Labels: newbie

 At least in Spark 1.0.2,  calling registerAsTable(a) when a had been 
 registered before does not cause an error.  However, there is no way to 
 access the old table, even though it may be cached and taking up space.
 How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2554) CountDistinct and SumDistinct should do partial aggregation


 [ 
https://issues.apache.org/jira/browse/SPARK-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2554:

Target Version/s: 1.3.0  (was: 1.2.0)

 CountDistinct and SumDistinct should do partial aggregation
 ---

 Key: SPARK-2554
 URL: https://issues.apache.org/jira/browse/SPARK-2554
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 {{CountDistinct}} and {{SumDistinct}} should first do a partial aggregation 
 and return unique value sets in each partition as partial results. Shuffle IO 
 can be greatly reduced in in cases that there are only a few unique values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3379) Implement 'POWER' for sql


 [ 
https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3379:

Target Version/s: 1.3.0  (was: 1.2.0)

 Implement 'POWER' for sql
 -

 Key: SPARK-3379
 URL: https://issues.apache.org/jira/browse/SPARK-3379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Add support for the mathematical function POWER within spark sql. Spitted 
 from SPARK-3176



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4269) Make wait time in BroadcastHashJoin configurable


 [ 
https://issues.apache.org/jira/browse/SPARK-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4269:

Target Version/s: 1.3.0  (was: 1.2.0)

 Make wait time in BroadcastHashJoin configurable
 

 Key: SPARK-4269
 URL: https://issues.apache.org/jira/browse/SPARK-4269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
 Fix For: 1.2.0


 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to 
 wait for the execution and broadcast of the small table. 
 In my opinion, it should be a configurable value since broadcast may exceed 5 
 minutes in some case, like in a busy/congested network environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl


 [ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3955:

Component/s: Build

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl


 [ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3955:

Target Version/s: 1.3.0  (was: 1.1.1, 1.2.0)

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat


 [ 
https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2551:

Target Version/s: 1.3.0  (was: 1.2.0)

 Cleanup FilteringParquetRowInputFormat
 --

 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] 
 and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did 
 some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
 cleaned up once PARQUET-16 is fixed.
 A PR for PARQUET-16 is 
 [here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2449) Spark sql reflection code requires a constructor taking all the fields for the table


 [ 
https://issues.apache.org/jira/browse/SPARK-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2449:

Target Version/s: 1.3.0  (was: 1.2.0)

 Spark sql reflection code requires a constructor taking all the fields for 
 the table
 

 Key: SPARK-2449
 URL: https://issues.apache.org/jira/browse/SPARK-2449
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Ian O Connell

 The reflection code does a lookup for the fields passed to the constructor to 
 make the types for the table. Specifically the code:
   val params = t.member(nme.CONSTRUCTOR).asMethod.paramss
 in ScalaReflection.scala
 Simple repo case from the spark shell:
 trait PersonTrait extends Product
 case class Person(a: Int) extends PersonTrait
 val l: List[PersonTrait] = List(1, 2, 3, 4).map(Person(_))
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 sc.parallelize(l).registerAsTable(people)
 scala sc.parallelize(l).registerAsTable(people)
 scala.ScalaReflectionException: none is not a method
   at scala.reflect.api.Symbols$SymbolApi$class.asMethod(Symbols.scala:279)
   at 
 scala.reflect.internal.Symbols$SymbolContextApiImpl.asMethod(Symbols.scala:73)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:52)
   at 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4453) Simplify Parquet record filter generation


 [ 
https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4453:

Assignee: Cheng Lian

 Simplify Parquet record filter generation
 -

 Key: SPARK-4453
 URL: https://issues.apache.org/jira/browse/SPARK-4453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0, 1.1.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Current Parquet record filter code uses {{CatalystFilter}} and its 
 sub-classes to represent all Spark SQL Parquet filter. Essentially, these 
 classes combines the original Catalyst predicate expression together with the 
 generated Parquet filter. {{ParquetFilters.findExpression}} then uses these 
 classes to filter out all expressions that can be pushed down.
 However, this {{findExpression}} function is not necessary at the first 
 place, since we already know whether a predicate can be pushed down or not 
 while trying to generate its corresponding filter.
 With this in mind, the code size of Parquet record filter generation can be 
 reduced significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2178) createSchemaRDD is not thread safe


 [ 
https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2178:

Target Version/s: 1.3.0  (was: 1.2.0)

 createSchemaRDD is not thread safe
 --

 Key: SPARK-2178
 URL: https://issues.apache.org/jira/browse/SPARK-2178
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust

 This is because implicit type tags are not thread safe.  We could fix this 
 with compile time macros (which could also make the conversion a lot faster).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4453) Simplify Parquet record filter generation


 [ 
https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4453:

Priority: Critical  (was: Major)

 Simplify Parquet record filter generation
 -

 Key: SPARK-4453
 URL: https://issues.apache.org/jira/browse/SPARK-4453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0, 1.1.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 Current Parquet record filter code uses {{CatalystFilter}} and its 
 sub-classes to represent all Spark SQL Parquet filter. Essentially, these 
 classes combines the original Catalyst predicate expression together with the 
 generated Parquet filter. {{ParquetFilters.findExpression}} then uses these 
 classes to filter out all expressions that can be pushed down.
 However, this {{findExpression}} function is not necessary at the first 
 place, since we already know whether a predicate can be pushed down or not 
 while trying to generate its corresponding filter.
 With this in mind, the code size of Parquet record filter generation can be 
 reduced significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks