Re: Question about enabling some of missing rules.

2016-05-15 Thread Nicholas Chammas
Ah I see, good references. Perhaps it's really then a committer judgement call on how many changes become "too many" for a single PR. 2016년 5월 15일 (일) 오후 11:16, Hyukjin Kwon 님이 작성: > Thank you so much for detailed explanation and the history. > > > I understood and it seems *ProcedureDeclarationCh

Re: Question about enabling some of missing rules.

2016-05-15 Thread Hyukjin Kwon
Thank you so much for detailed explanation and the history. I understood and it seems *ProcedureDeclarationChecker* should not be enabled. However, it seems *RedundantIfChecker* okay because there are only two errors for this across the code base. I have seen some rules have been added time t

Re: Question about enabling some of missing rules.

2016-05-15 Thread Nicholas Chammas
Relevant discussion from some time ago: https://issues.apache.org/jira/browse/SPARK-3849?focusedCommentId=14168961&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14168961 In short, if enabling a new style rule requires sweeping changes throughout the code base, then

Question about enabling some of missing rules.

2016-05-15 Thread Hyukjin Kwon
Hi all, Lately, I made a list of rules currently not applied on Spark from http://www.scalastyle.org/rules-dev.html and then I tried to test them. I found two rules that I think might be helpful but I am not too sure. Could I ask both can be added? *RedundantIfChecker *(See http://www.scalastyl

PySpark mixed with Jython

2016-05-15 Thread Holden Karau
I've been doing some looking at EclairJS (Spark + Javascript) which takes a really interesting approach. The driver program is run in node and the workers are run in nashorn. I was wondering if anyone has given much though to optionally exposing an interface for PySpark in a similar fashion. For so

Spark shuffling OutOfMemoryError Java heap space

2016-05-15 Thread Renyi Xiong
Hi I am consistently observing driver OutOfMemoryError (Java heap space) during shuffling operation indicated by the log: 16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full metadata will be sent

Re: Shrinking the DataFrame lineage

2016-05-15 Thread Hamel Kothari
I don't know about the second one but for question #1: When you convert from a cached DF to an RDD (via a map function or the "rdd" value) the types are converted from the off-heap types to on-heap types. If your rows are fairly large/complex this can have a pretty big performance impact so I would