StreamingKMeans does not update cluster centroid locations

2016-02-18 Thread ramach1776
I have streaming application wherein I train the model using a receiver input
stream in 4 sec batches

val stream = ssc.receiverStream(receiver) //receiver gets new data every
batch
model.trainOn(stream.map(Vectors.parse))
If I use
model.latestModel.clusterCenters.foreach(println)

the value of cluster centers remain unchanged from the very initial value it
got during first iteration (when the streaming app started)

when I use the model to predict cluster assignment with a labeled input the
assignments change over time as expected

  testData.transform {rdd =>
rdd.map(lp => (lp.label, model.latestModel().predict(lp.features)))
  }.print()










--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKMeans-does-not-update-cluster-centroid-locations-tp26275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread ramach1776
bq. streamingContext.remember("duration") did not help

Can you give a bit more detail on the above ?
Did you mean the job encountered OOME later on ?

Which Spark release are you using ?

 tried these 2 global settings (and restarted the app) after enabling cache
for stream1
conf.set("spark.streaming.unpersist", "true")

streamingContext.remember(Seconds(batchDuration * 4))

batch duration is 4 sec

Using spark-1.4.1. The application runs for about 4-5 hrs then see out of
memory error





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259p26269.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



adding a split and union to a streaming application cause big performance hit

2016-02-17 Thread ramach1776
We have a streaming application containing approximately 12 jobs every batch,
running in streaming mode (4 sec batches). Each  job has several
transformations and 1 action (output to cassandra) which causes the
execution of the job (DAG)

For example the first job,

/job 1
---> receive Stream A --> map --> filter -> (union with another stream B)
--> map -->/ groupbykey --> transform --> reducebykey --> map

Likewise we go thro' few more transforms and save to database (job2,
job3...)

Recently we added a new transformation further downstream wherein we union
the output of DStream from job 1 (in italics) with output from a new
transformation(job 5). It appears the whole execution thus far is repeated
which is redundant (I can see this in execution graph & also performance ->
processing time).

That is, with this additional transformation (union with a stream processed
upstream) each batch runs as much as 2.5 times slower compared to runs
without the union. If I cache the DStream from job 1(italics), performance
improves substantially but hit out of memory errors within few hours.

What is the recommended way to cache/unpersist in such a scenario? there is
no dstream level "unpersist"
setting "spark.streaming.unpersist" to true and
streamingContext.remember("duration") did not help.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org