Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Theodore Vasiloudis
Hello, I was wondering if there is an easy way launch EC2 instances which have a Spark built for Scala 2.11. The only way I can think of is to prepare the sources for 2.11 as shown in the Spark build instructions ( http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211), u

Re: Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
e returns to achieve the animation > and this won't work via a logging framework. stderr is where log-like > output goes, because stdout is for program output. > > On Wed, Apr 1, 2015 at 10:56 AM, Theodore Vasiloudis > wrote: > > Since switching to Spark 1.2.1 I'm seeing l

Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Since switching to Spark 1.2.1 I'm seeing logging for the stage progress (ex.): [error] [Stage 2154:> (14 + 8) / 48][Stage 2210:> (0 + 0) / 48] Any reason why these are error level logs? Shouldn't they be info level? In any case is there a way to disable them other than

EC2 Having script run at startup

2015-03-24 Thread Theodore Vasiloudis
Hello, in the context of SPARK-2394 Make it easier to read LZO-compressed files from EC2 clusters , I was wondering: Is there an easy way to make a user-provided script run at every machine in a cluster launched on EC2? Regards, Theodore --

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
7;ll check if increasing this value improves performance. Decreasing the number of partitions has a large negative effect on the runtime. On Mon, Dec 8, 2014 at 5:46 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > > On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasil

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
see how you hope to generate all incoming edge pairs without > repartitioning the data by dstID. You need to perform this shuffle for > joining too. Otherwise two incoming edges could be in separate partitions > and never meet. Am I missing something? > > On Mon, Dec 8, 2

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
gt;> in case of self join I would recommend to create an rdd that has explicit >> partitioner and has been cached. >> On Dec 8, 2014 8:52 AM, "Theodore Vasiloudis" < >> theodoros.vasilou...@gmail.com> wrote: >> >>> Hello all, >>> >>>

Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis
Hello all, I am working on a graph problem using vanilla Spark (not GraphX) and at some point I would like to do a self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to get all pairs of incoming edges. Since this is the performance bottleneck for my code, I was wondering if the

Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Theodore Vasiloudis
Hello everyone, I was wondering what is the most efficient way for retrieving the top K values per key in a (key, value) RDD. The simplest way I can think of is to do a groupByKey, sort the iterables and then take the top K elements for every key. But reduceByKey is an operation that can be ver

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Theodore Vasiloudis
Great, Ian's approach seems to work fine. Can anyone provide an explanation as to why this works, but passing the CoreNLP object itself as transient does not? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html Sent