Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread Sean Owen
You say you reduceByKey but are you really collecting all the tuples
for a vehicle in a collection, like what groupByKey does already? Yes,
if one vehicle has a huge amount of data that could fail.

Otherwise perhaps you are simply not increasing memory from the default.

Maybe you can consider using something like vehicle and *day* as a
key. This would make you process each day of data separately, but if
that's fine for you, might drastically cut down the data associated to
a single key.

Spark Streaming has a windowing function, and there is a window
function for an entire RDD, but I am not sure if there is support for
a 'window by key' anywhere. You can perhaps get your direct approach
of collecting events working with some of the changes above.

Otherwise I think you have to roll your own to some extent, creating
the overlapping buckets of data, which will mean mapping the data to
several copies of itself. This might still be quite feasible depending
on how big a lag you are thinking of.

PS for the interested, this is what LAG is:
http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php#lag

On Wed, Oct 15, 2014 at 1:37 AM, Manas Kar manasdebashis...@gmail.com wrote:
 Hi,
  I have an RDD containing Vehicle Number , timestamp, Position.
  I want to get the lag function equivalent to my RDD to be able to create
 track segment of each Vehicle.

 Any help?

 PS: I have tried reduceByKey and then splitting the List of position in
 tuples. For me it runs out of memory every time because of the volume of
 data.

 ...Manas

 For some reason I have never got any reply to my emails to the user group. I
 am hoping to break that trend this time. :)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread manasdebashiskar
It is wonderful to see some idea.
Now the questions:
1) What is a track segment?
 Ans) It is the line that contains two adjacent points when all points are
arranged by time. Say a vehicle moves (t1, p1) - (t2, p2) - (t3, p3).
Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1  t2
 t3)
2) What is Lag function.
Ans) Sean's link explains it.

Little bit more to my requirement:
 What I need to calculate is a density Map of vehicles in a certain area.
Because of a user specific requirement I can't use just points but I will
have to use segments.
 I already have a gridRDD containing 1km polygons for the whole world.
My approach is
1) create a tracksegmentRDD of Vehicle, segment
2) do a cartesian of tracksegmentRDD and gridRDD and for each row check if
the segment intersects the polygon. If it does then count it as 1.
3) Group the result above by vehicle(probably reduceByKey(_ + _) ) to get
the density Map

I am checking an issue
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html
which seems to have some potential. I will give it a try.

..Manas

On Wed, Oct 15, 2014 at 2:55 AM, sowen [via Apache Spark User List] 
ml-node+s1001560n16471...@n3.nabble.com wrote:

 You say you reduceByKey but are you really collecting all the tuples
 for a vehicle in a collection, like what groupByKey does already? Yes,
 if one vehicle has a huge amount of data that could fail.

 Otherwise perhaps you are simply not increasing memory from the default.

 Maybe you can consider using something like vehicle and *day* as a
 key. This would make you process each day of data separately, but if
 that's fine for you, might drastically cut down the data associated to
 a single key.

 Spark Streaming has a windowing function, and there is a window
 function for an entire RDD, but I am not sure if there is support for
 a 'window by key' anywhere. You can perhaps get your direct approach
 of collecting events working with some of the changes above.

 Otherwise I think you have to roll your own to some extent, creating
 the overlapping buckets of data, which will mean mapping the data to
 several copies of itself. This might still be quite feasible depending
 on how big a lag you are thinking of.

 PS for the interested, this is what LAG is:

 http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php#lag

 On Wed, Oct 15, 2014 at 1:37 AM, Manas Kar [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16471i=0 wrote:

  Hi,
   I have an RDD containing Vehicle Number , timestamp, Position.
   I want to get the lag function equivalent to my RDD to be able to
 create
  track segment of each Vehicle.
 
  Any help?
 
  PS: I have tried reduceByKey and then splitting the List of position in
  tuples. For me it runs out of memory every time because of the volume of
  data.
 
  ...Manas
 
  For some reason I have never got any reply to my emails to the user
 group. I
  am hoping to break that trend this time. :)

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16471i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16471i=2



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Lag-function-equivalent-in-an-RDD-tp16448p16471.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Lag function equivalent in an RDD, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16448code=bWFuYXNkZWJhc2hpc2thckBnbWFpbC5jb218MTY0NDh8LTM0Nzc4MjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lag-function-equivalent-in-an-RDD-tp16448p16498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but lag function isnt familiar to me  ..
But have you tried configuring the spark appropriately
http://spark.apache.org/docs/latest/configuration.html


On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com
wrote:

 Hi,
  I have an RDD containing Vehicle Number , timestamp, Position.
  I want to get the lag function equivalent to my RDD to be able to
 create track segment of each Vehicle.

 Any help?

 PS: I have tried reduceByKey and then splitting the List of position in
 tuples. For me it runs out of memory every time because of the volume of
 data.

 ...Manas

 *For some reason I have never got any reply to my emails to the user
 group. I am hoping to break that trend this time. :)*




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates