Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas  wrote:

> KafkaOutputServicePool


Could you please give an example code of how KafkaOutputServicePool would
look like? When I tried object pooling I end up with various not
serializable exceptions.

Thanks!
Josh


Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das 
wrote:

> For SparkStreaming applications, there is already a tab called "Streaming"
> which displays the basic statistics.


Would I just need to extend this tab to add the throughput?


Re: throughput in the web console?

2015-02-25 Thread Josh J
Let me ask like this, what would be the easiest way to display the
throughput in the web console? Would I need to create a new tab and add the
metrics? Any good or simple examples showing how this can be done?

On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das 
wrote:

> Did you have a look at
>
>
> https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener
>
> And for Streaming:
>
>
> https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
>
>
>
> Thanks
> Best Regards
>
> On Tue, Feb 24, 2015 at 10:29 PM, Josh J  wrote:
>
>> Hi,
>>
>> I plan to run a parameter search varying the number of cores, epoch, and
>> parallelism. The web console provides a way to archive the previous runs,
>> though is there a way to view in the console the throughput? Rather than
>> logging the throughput separately to the log files and correlating the logs
>> files to the web console processing times?
>>
>> Thanks,
>> Josh
>>
>
>


throughput in the web console?

2015-02-24 Thread Josh J
Hi,

I plan to run a parameter search varying the number of cores, epoch, and
parallelism. The web console provides a way to archive the previous runs,
though is there a way to view in the console the throughput? Rather than
logging the throughput separately to the log files and correlating the logs
files to the web console processing times?

Thanks,
Josh


measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi,

I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each stage?

Thanks,
Josh


Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J
>
> We have dockerized Spark Master and worker(s) separately and are using it
> in
> our dev environment.


Is this setup available on github or dockerhub?

On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian 
wrote:

> We have dockerized Spark Master and worker(s) separately and are using it
> in
> our dev environment. We don't use Mesos though, running it in Standalone
> mode, but adding Mesos should not be that difficult I think.
>
> Regards
>
> Venkat
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi,

I'm trying to run Spark Streaming standalone on two nodes. I'm able to run
on a single node fine. I start both workers and it registers in the Spark
UI. However, the application says

"SparkDeploySchedulerBackend: Asked to remove non-existent executor 2"

Any ideas?

Thanks,
Josh


sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
Hi,

I'm trying to using sampling with Spark Streaming. I imported the following

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._


I then call sample


val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap,StorageLevel.MEMORY_AND_DISK).map(_._2)

streamtoread.sample(withReplacement = true, fraction = fraction)


How do I use the sample
()
method with Spark Streaming?


Thanks,

Josh


Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the
write to Kafka?

On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith  wrote:

> I'd be interested in finding the answer too. Right now, I do:
>
> val kafkaOutMsgs = kafkInMessages.map(x=>myFunc(x._2,someParam))
> kafkaOutMsgs.foreachRDD((rdd,time) => { rdd.foreach(rec => {
> writer.output(rec) }) } ) //where writer.ouput is a method that takes a
> string and writer is an instance of a producer class.
>
>
>
>
>
> On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi <
> max.toma...@gmail.com> wrote:
>
>> Hello all,
>> after having applied several transformations to a DStream I'd like to
>> publish all the elements in all the resulting RDDs to Kafka. What the best
>> way to do that would be? Just using DStream.foreach and then RDD.foreach ?
>> Is there any other built in utility for this use case?
>>
>> Thanks a lot,
>> Max
>>
>> --
>> 
>> Massimiliano Tomassi
>> 
>> e-mail: max.toma...@gmail.com
>> 
>>
>
>


kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi,

In the spark docs

it mentions "However, output operations (like foreachRDD) have *at-least
once* semantics, that is, the transformed data may get written to an
external entity more than once in the event of a worker failure. "

I would like to setup a Kafka pipeline whereby I write my data to a single
topic 1, then I continue to process using spark streaming and write the
transformed results to topic2, and finally I read the results from topic 2.
How do I configure the spark streaming so that I can maintain exactly once
semantics when writing to topic 2?

Thanks,
Josh


Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper <http://dl.acm.org/citation.cfm?id=2670995>.

On Fri, Nov 14, 2014 at 10:42 AM, Josh J  wrote:

> Hi,
>
> I was wondering if the adaptive stream processing and dynamic batch
> processing was available to use in spark streaming? If someone could help
> point me in the right direction?
>
> Thanks,
> Josh
>
>


Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi,

I was wondering if the adaptive stream processing and dynamic batch
processing was available to use in spark streaming? If someone could help
point me in the right direction?

Thanks,
Josh


Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called "union"

On Tue, Nov 11, 2014 at 2:41 PM, Josh J  wrote:

> Hi,
>
> Is it possible to concatenate or append two Dstreams together? I have an
> incoming stream that I wish to combine with data that's generated by a
> utility. I then need to process the combined Dstream.
>
> Thanks,
> Josh
>


concat two Dstreams

2014-11-11 Thread Josh J
Hi,

Is it possible to concatenate or append two Dstreams together? I have an
incoming stream that I wish to combine with data that's generated by a
utility. I then need to process the combined Dstream.

Thanks,
Josh


convert List to dstream

2014-11-10 Thread Josh J
Hi,

I have some data generated by some utilities that returns the results as
a List. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors

val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray())
   list_queue.add(list_scalaconverted)
val list_stream = ssc.queueStream(list_queue)

 found   : org.apache.spark.rdd.RDD[Object]

 required: org.apache.spark.rdd.RDD[String]

Note: Object >: String, but class RDD is invariant in type T.

You may wish to define T as -T instead. (SLS 4.5)


and


 found   : java.util.LinkedList[org.apache.spark.rdd.RDD[String]]

 required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]]


Thanks,

Josh


Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
Please find my code here
<https://gist.github.com/joshjdevl/b9af68b11398fd1823c4>.

On Tue, Nov 4, 2014 at 11:33 AM, Josh J  wrote:

> I'm using the same code
> <https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687>,
> though still receive
>
>  not enough arguments for method sortBy: (f: String => K, ascending:
> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>
> Unspecified value parameter f.
>
> On Tue, Nov 4, 2014 at 11:28 AM, Josh J  wrote:
>
>> Hi,
>>
>> Does anyone have any good examples of using sortby for RDDs and scala?
>>
>> I'm receiving
>>
>>  not enough arguments for method sortBy: (f: String => K, ascending:
>> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
>> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>>
>> Unspecified value parameter f.
>>
>>
>> I tried to follow the example in the test case
>> <https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala>
>>  by
>> using the same approach even same method names and parameters though no
>> luck.
>>
>>
>> Thanks,
>>
>> Josh
>>
>
>


Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
I'm using the same code
<https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687>,
though still receive

 not enough arguments for method sortBy: (f: String => K, ascending:
Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

Unspecified value parameter f.

On Tue, Nov 4, 2014 at 11:28 AM, Josh J  wrote:

> Hi,
>
> Does anyone have any good examples of using sortby for RDDs and scala?
>
> I'm receiving
>
>  not enough arguments for method sortBy: (f: String => K, ascending:
> Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
> scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].
>
> Unspecified value parameter f.
>
>
> I tried to follow the example in the test case
> <https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala>
>  by
> using the same approach even same method names and parameters though no
> luck.
>
>
> Thanks,
>
> Josh
>


scala RDD sortby compilation error

2014-11-04 Thread Josh J
Hi,

Does anyone have any good examples of using sortby for RDDs and scala?

I'm receiving

 not enough arguments for method sortBy: (f: String => K, ascending:
Boolean, numPartitions: Int)(implicit ord: Ordering[K], implicit ctag:
scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String].

Unspecified value parameter f.


I tried to follow the example in the test case

by
using the same approach even same method names and parameters though no
luck.


Thanks,

Josh


Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
>> If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?



On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen  wrote:

> If you iterated over an RDD's partitions, I'm not sure that in
> practice you would find the order matches the order they were
> received. The receiver is replicating data to another node or node as
> it goes and I don't know much is guaranteed about that.
>
> If you want to permute an RDD, how about a sortBy() on a good hash
> function of each value plus some salt? (Haven't thought this through
> much but sounds about right.)
>
> On Mon, Nov 3, 2014 at 4:59 PM, Josh J  wrote:
> > When I'm outputting the RDDs to an external source, I would like the
> RDDs to
> > be outputted in a random shuffle so that even the order is random. So far
> > what I understood is that the RDDs do have a type of order, in that the
> > order for spark streaming RDDs would be the order in which spark
> streaming
> > read the tuples from source (e.g. ordered by roughly when the producer
> sent
> > the tuple in addition to any latency)
> >
> > On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen  wrote:
> >>
> >> I think the answer will be the same in streaming as in the core. You
> >> want a random permutation of an RDD? in general RDDs don't have
> >> ordering at all -- excepting when you sort for example -- so a
> >> permutation doesn't make sense. Do you just want a well-defined but
> >> random ordering of the data? Do you just want to (re-)assign elements
> >> randomly to partitions?
> >>
> >> On Mon, Nov 3, 2014 at 4:33 PM, Josh J  wrote:
> >> > Hi,
> >> >
> >> > Is there a nice or optimal method to randomly shuffle spark streaming
> >> > RDDs?
> >> >
> >> > Thanks,
> >> > Josh
> >
> >
>


Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
read the tuples from source (e.g. ordered by roughly when the producer sent
the tuple in addition to any latency)

On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen  wrote:

> I think the answer will be the same in streaming as in the core. You
> want a random permutation of an RDD? in general RDDs don't have
> ordering at all -- excepting when you sort for example -- so a
> permutation doesn't make sense. Do you just want a well-defined but
> random ordering of the data? Do you just want to (re-)assign elements
> randomly to partitions?
>
> On Mon, Nov 3, 2014 at 4:33 PM, Josh J  wrote:
> > Hi,
> >
> > Is there a nice or optimal method to randomly shuffle spark streaming
> RDDs?
> >
> > Thanks,
> > Josh
>


random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi,

Is there a nice or optimal method to randomly shuffle spark streaming RDDs?

Thanks,
Josh


Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
Sorry, I should've included some stats with my email

I execute each job in the following manner

./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory
1g --executor-memory 1g --executor-cores 1 UBER.JAR
${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1


The box has

24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz

32 GB RAM


Thanks,

Josh

On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta 
wrote:

> Try reducing the resources (cores and memory) of each application.
>
>
>
> > On Oct 28, 2014, at 7:05 PM, Josh J  wrote:
> >
> > Hi,
> >
> > How do I run multiple spark applications in parallel? I tried to run on
> yarn cluster, though the second application submitted does not run.
> >
> > Thanks,
> > Josh
>


run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi,

How do I run multiple spark applications in parallel? I tried to run on
yarn cluster, though the second application submitted does not run.

Thanks,
Josh


combine rdds?

2014-10-27 Thread Josh J
Hi,

How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is not above some threshold.

Thanks,
Josh


exact count using rdd.count()?

2014-10-27 Thread Josh J
Hi,

Is the following guaranteed to always provide an exact count?

foreachRDD(foreachFunc = rdd => {
   rdd.count()

In the literature it mentions "However, output operations (like foreachRDD)
have *at-least once* semantics, that is, the transformed data may get
written to an external entity more than once in the event of a worker
failure."

http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node

Thanks,
Josh


docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi,

Is there a dockerfiles available which allow to setup a docker spark 1.1.0
cluster?

Thanks,
Josh


streaming join sliding windows

2014-10-22 Thread Josh J
Hi,

How can I join neighbor sliding windows in spark streaming?

Thanks,
Josh


dynamic sliding window duration

2014-10-07 Thread Josh J
Hi,

I have a source which fluctuates in the frequency of streaming tuples. I
would like to process certain batch counts, rather than batch window
durations. Is it possible to either

1) define batch window sizes
or
2) dynamically adjust the duration of the sliding window?

Thanks,
Josh


Re: countByWindow save the count ?

2014-08-26 Thread Josh J
Thanks. I''m just confused on the syntax, I'm not sure which variables or
where the value of the count is stored so that I can save it. Any examples
or tips?


On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov 
wrote:

> You could try to use foreachRDD on the result of countByWindow with a
> function that performs the save operation.
>
>
> On Fri, Aug 22, 2014 at 1:58 AM, Josh J  wrote:
>
>> Hi,
>>
>> Hopefully a simple question. Though is there an example of where to save
>> the output of countByWindow ? I would like to save the results to external
>> storage (kafka or redis). The examples show only stream.print()
>>
>> Thanks,
>> Josh
>>
>
>


countByWindow save the count ?

2014-08-22 Thread Josh J
Hi,

Hopefully a simple question. Though is there an example of where to save
the output of countByWindow ? I would like to save the results to external
storage (kafka or redis). The examples show only stream.print()

Thanks,
Josh


DStream start a separate DStream

2014-08-21 Thread Josh J
Hi,

I would like to have a sliding window dstream perform a streaming
computation and store these results. Once these results are stored, I then
would like to process the results. Though I must wait until the final
computation done for all tuples in the sliding window, before I begin the
new DStream. How can I accomplish this with spark?

Sincerely,
Josh


multiple windows from the same DStream ?

2014-08-21 Thread Josh J
Hi,

Can I build two sliding windows in parallel from the same Dstream ? Will
these two window streams run in parallel and process the same data? I wish
to do two different functions (aggregration on one window and storage for
the other window) across the same original dstream data though the same
window sizes.

JavaPairReceiverInputDStream messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);

Duration windowLength = new Duration(3);
Duration slideInterval = new Duration(3);
JavaPairDStream windowMessages1 =
messages.window(windowLength,slideInterval);
JavaPairDStream windowMessages2 =
messages.window(windowLength,slideInterval);


Thanks,
Josh


Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi,

Whats the difference between amplab docker
 and spark docker
?

Thanks,
Josh