Hi Mirko,
Thanks for the reply.
Lets assume i have a record every 1 second for every given entity.
entity_id | timestamp | data
1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02
The problem is that in reality this is not coming sorted by entity_id ,
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.
Georgi
On 19.09.2014 10:34, Mirko Kämpf wrote:
Hi Georgi,
I would already emit the new time stamp (with resolution 10 min) in
the mapper. This allows you to (pre)aggregate the data already in the
mapper and you have less traffic during the shuffle & sort stage.
Changing the resolution means you have to aggregate the individual
entities or do you still need all individual entities and just want to
translate the timestamp to another resolution (5s => 10 min)?
Cheers,
Mirko
2014-09-19 9:17 GMT+01:00 Georgi Ivanov <[email protected]
<mailto:[email protected]>>:
Hello,
I have time related data like this :
entity_id, timestamp , data
The resolution of the data is something like 5 seconds.
I want to extract the data with 10 minutes resolution.
So what i can do is :
Just emit everything in the mapper as data is not sorted there .
Emit only every 10 minutes from reducer. The reducer is receiving
data sorted by entity_id,timestamp pair (secondary sorting)
This will work fine, but it will take forever, since i have to
process TB's of data.
Also the data emitted to the reducer will be huge( as i am not
filtering in map phase at all) and the number of reducers is much
smaller than the number of mappers.
Are there any better ideas how to do this ?
Georgi