Memory config issues

2015-01-18 Thread Alessandro Baretta
 All,

I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have
plenty of RAM, so I should be able to brute-force my way through, but I
can't quite figure out what memory option affects what process.

My current memory configuration is the following:
export SPARK_WORKER_MEMORY=83971m
export SPARK_DAEMON_MEMORY=15744m

What does each of these config options do exactly?

Also, how come the executors page of the web UI shows no memory usage:

0.0 B / 42.4 GB

And where does 42.4 GB come from?

Alex


GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Michael Malak
According to:
https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting
 

Note that TriangleCount requires the edges to be in canonical orientation 
(srcId  dstId)

But isn't this overstating the requirement? Isn't the requirement really that 
IF there are duplicate edges between two vertices, THEN those edges must all be 
in the same direction (in order for the groupEdges() at the beginning of 
triangleCount() to produce the intermediate results that triangleCount() 
expects)?

If so, should I enter a JIRA ticket to clarify the documentation?

Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650 will 
make it into Spark 1.3 anyway?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Memory config issues

2015-01-18 Thread Alessandro Baretta
Akhil,

Ah, very good point. I guess SET spark.sql.shuffle.partitions=1024 should
do it.

Alex

On Sun, Jan 18, 2015 at 10:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Its the executor memory (spark.executor.memory) which you can set while
 creating the spark context. By default it uses 0.6% of the executor memory
 for Storage. Now, to show some memory usage, you need to cache (persist)
 the RDD. Regarding the OOM Exception, you can increase the level of
 parallelism (also you can increase the number of partitions depending on
 your data size) and it should be fine.

 Thanks
 Best Regards

 On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

  All,

 I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have
 plenty of RAM, so I should be able to brute-force my way through, but I
 can't quite figure out what memory option affects what process.

 My current memory configuration is the following:
 export SPARK_WORKER_MEMORY=83971m
 export SPARK_DAEMON_MEMORY=15744m

 What does each of these config options do exactly?

 Also, how come the executors page of the web UI shows no memory usage:

 0.0 B / 42.4 GB

 And where does 42.4 GB come from?

 Alex





Re: RDD order guarantees

2015-01-18 Thread Reynold Xin
Hi Ewan,

Not sure if there is a JIRA ticket (there are too many that I lose track).

I chatted briefly with Aaron on this. The way we can solve it is to create
a new FileSystem implementation that overrides the listStatus method, and
then in Hadoop Conf set the fs.file.impl to that.

Shouldn't be too hard. Would you be interested in working on it?




On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs ewan.hi...@ugent.be wrote:

  Yes, I am running on a local file system.

 Is there a bug open for this? Mingyu Kim reported the problem last April:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html

 -Ewan


 On 01/16/2015 07:41 PM, Reynold Xin wrote:

 You are running on a local file system right? HDFS orders the file based
 on names, but local file system often don't. I think that's why the
 difference.

  We might be able to do a sort and order the partitions when we create a
 RDD to make this universal though.

 On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs ewan.hi...@ugent.be wrote:

 Hi all,
 Quick one: when reading files, are the orders of partitions guaranteed to
 be preserved? I am finding some weird behaviour where I run sortByKeys() on
 an RDD (which has 16 byte keys) and write it to disk. If I open a python
 shell and run the following:

 for part in range(29):
 print map(ord,
 open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
 'r').read(16))

 Then each partition is in order based on the first value of each
 partition.

 I can also call TeraValidate.validate from TeraSort and it is happy with
 the results. It seems to be on loading the file that the reordering
 happens. If this is expected, is there a way to ask Spark nicely to give me
 the RDD in the order it was saved?

 This is based on trying to fix my TeraValidate code on this branch:
 https://github.com/ehiggs/spark/tree/terasort

 Thanks,
 Ewan

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Reynold Xin
We will merge https://issues.apache.org/jira/browse/SPARK-3650  for 1.3.
Thanks for reminding!


On Sun, Jan 18, 2015 at 8:34 PM, Michael Malak 
michaelma...@yahoo.com.invalid wrote:

 According to:

 https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting

 Note that TriangleCount requires the edges to be in canonical orientation
 (srcId  dstId)

 But isn't this overstating the requirement? Isn't the requirement really
 that IF there are duplicate edges between two vertices, THEN those edges
 must all be in the same direction (in order for the groupEdges() at the
 beginning of triangleCount() to produce the intermediate results that
 triangleCount() expects)?

 If so, should I enter a JIRA ticket to clarify the documentation?

 Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650
 will make it into Spark 1.3 anyway?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Memory config issues

2015-01-18 Thread Akhil Das
Its the executor memory (spark.executor.memory) which you can set while
creating the spark context. By default it uses 0.6% of the executor memory
for Storage. Now, to show some memory usage, you need to cache (persist)
the RDD. Regarding the OOM Exception, you can increase the level of
parallelism (also you can increase the number of partitions depending on
your data size) and it should be fine.

Thanks
Best Regards

On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

  All,

 I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have
 plenty of RAM, so I should be able to brute-force my way through, but I
 can't quite figure out what memory option affects what process.

 My current memory configuration is the following:
 export SPARK_WORKER_MEMORY=83971m
 export SPARK_DAEMON_MEMORY=15744m

 What does each of these config options do exactly?

 Also, how come the executors page of the web UI shows no memory usage:

 0.0 B / 42.4 GB

 And where does 42.4 GB come from?

 Alex



Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver

2015-01-18 Thread Ted Yu
Please tale a look at SPARK-4048 and SPARK-5108

Cheers

On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote:

 Hi,

 I took a source code of Spark 1.2.0 and tried to build it together with
 hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift )
 I used Hadoop 2.6.0.

 The build was fine without problems, however in run time, while trying to
 access swift:// name space i got an exception:
 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
  at

 org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524)
  at

 org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732)
 ...and the long stack trace goes here

 Digging into the problem i saw the following:
 Jackson versions 1.9.X are not backward compatible, in particular they
 removed JsonClass annotation.
 Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to
 older version of jackson.

 This is the main  pom.xml of Spark 1.2.0 :

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.8.8/version
   /dependency

 Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 .
 If we change version to 1.9.13, than all will work fine and there will be
 no run time exceptions while accessing Swift. The following change will
 solve the problem:

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.9.13/version
   /dependency

 I am trying to resolve this somehow so people will not get into this
 issue.
 Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13?
 Can we remove 1.8.8 and put 1.9.13 for Avro?
 It looks to me that all works fine when Spark build with jackson 1.9.13,
 but i am not an expert and not sure what should be tested.

 Thanks,
 Gil Vernik.



Re: Semantics of LGTM

2015-01-18 Thread Reynold Xin
Maybe just to avoid LGTM as a single token when it is not actually
according to Patrick's definition, but anybody can still leave comments
like:

The direction of the PR looks good to me. or +1 on the direction

The build part looks good to me

...


On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu
wrote:

 +1 to Patrick's proposal of strong LGTM semantics.  On past projects, I've
 heard the semantics of LGTM expressed as I've looked at this thoroughly
 and take as much ownership as if I wrote the patch myself.  My
 understanding is that this is the level of review we expect for all patches
 that ultimately go into Spark, so it's important to have a way to concisely
 describe when this has been done.

 Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
 cases I've seen, if someone else says I looked at this very quickly and
 didn't see any glaring problems, it doesn't add any value for subsequent
 reviewers (someone still needs to take a thorough look).

 -Kay

 On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote:

  Yeah, the ASF +1 has become partly overloaded to mean both I would like
  to see this feature and this patch should be committed, although, at
  least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote)
  should unambiguously mean the latter unless qualified in some other way.
 
  I don't have any opinion on the specific characters, but I agree with
  Aaron that it would be nice to have some sort of abbreviation for both
 the
  strong and weak forms of approval.
 
  -Sandy
 
   On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  
   I think the ASF +1 is *slightly* different than Google's LGTM, because
   it might convey wanting the patch/feature to be merged but not
   necessarily saying you did a thorough review and stand behind it's
   technical contents. For instance, I've seen people pile on +1's to try
   and indicate support for a feature or patch in some projects, even
   though they didn't do a thorough technical review. This +1 is
   definitely a useful mechanism.
  
   There is definitely much overlap though in the meaning, though, and
   it's largely because Spark had it's own culture around reviews before
   it was donated to the ASF, so there is a mix of two styles.
  
   Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
   proposed originally (unlike the one Sandy proposed, e.g.). This is
   what I've seen every project using the LGTM convention do (Google, and
   some open source projects such as Impala) to indicate technical
   sign-off.
  
   - Patrick
  
   On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com
  wrote:
   I think I've seen something like +2 = strong LGTM and +1 = weak
 LGTM;
   someone else should review before. It's nice to have a shortcut which
  isn't
   a sentence when talking about weaker forms of LGTM.
  
   On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
  
   I think clarifying these semantics is definitely worthwhile. Maybe
 this
   complicates the process with additional terminology, but the way I've
  used
   these has been:
  
   +1 - I think this is safe to merge and, barring objections from
 others,
   would merge it immediately.
  
   LGTM - I have no concerns about this patch, but I don't necessarily
  feel
   qualified to make a final call about it.  The TM part acknowledges
 the
   judgment as a little more subjective.
  
   I think having some concise way to express both of these is useful.
  
   -Sandy
  
   On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   Hey All,
  
   Just wanted to ping about a minor issue - but one that ends up
 having
   consequence given Spark's volume of reviews and commits. As much as
   possible, I think that we should try and gear towards Google Style
   LGTM on reviews. What I mean by this is that LGTM has the following
   semantics:
  
   I know this code well, or I've looked at it close enough to feel
   confident it should be merged. If there are issues/bugs with this
 code
   later on, I feel confident I can help with them.
  
   Here is an alternative semantic:
  
   Based on what I know about this part of the code, I don't see any
   show-stopper problems with this patch.
  
   The issue with the latter is that it ultimately erodes the
   significance of LGTM, since subsequent reviewers need to reason
 about
   what the person meant by saying LGTM. In contrast, having strong
   semantics around LGTM can help streamline reviews a lot, especially
 as
   reviewers get more experienced and gain trust from the comittership.
  
   There are several easy ways to give a more limited endorsement of a
   patch:
   - I'm not familiar with this code, but style, etc look good
 (general
   endorsement)
   - The build changes in this code LGTM, but I haven't reviewed the
   rest (limited LGTM)
  
   If people are okay with this, I might add a short 

Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver

2015-01-18 Thread Sean Owen
Agree, I think this can / should be fixed with a slightly more
conservative version of https://github.com/apache/spark/pull/3938
related to SPARK-5108.

On Sun, Jan 18, 2015 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote:
 Please tale a look at SPARK-4048 and SPARK-5108

 Cheers

 On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote:

 Hi,

 I took a source code of Spark 1.2.0 and tried to build it together with
 hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift )
 I used Hadoop 2.6.0.

 The build was fine without problems, however in run time, while trying to
 access swift:// name space i got an exception:
 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
  at

 org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524)
  at

 org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732)
 ...and the long stack trace goes here

 Digging into the problem i saw the following:
 Jackson versions 1.9.X are not backward compatible, in particular they
 removed JsonClass annotation.
 Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to
 older version of jackson.

 This is the main  pom.xml of Spark 1.2.0 :

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.8.8/version
   /dependency

 Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 .
 If we change version to 1.9.13, than all will work fine and there will be
 no run time exceptions while accessing Swift. The following change will
 solve the problem:

   dependency
 !-- Matches the version of jackson-core-asl pulled in by avro --
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 version1.9.13/version
   /dependency

 I am trying to resolve this somehow so people will not get into this
 issue.
 Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13?
 Can we remove 1.8.8 and put 1.9.13 for Avro?
 It looks to me that all works fine when Spark build with jackson 1.9.13,
 but i am not an expert and not sure what should be tested.

 Thanks,
 Gil Vernik.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org