Hello,
I have time related data like this :
entity_id, timestamp , data
The resolution of the data is something like 5 seconds.
I want to extract the data with 10 minutes resolution.
So what i can do is :
Just emit everything in the mapper as data is not sorted there .
Emit only every 10 minutes from reducer. The reducer is receiving data
sorted by entity_id,timestamp pair (secondary sorting)
This will work fine, but it will take forever, since i have to process
TB's of data.
Also the data emitted to the reducer will be huge( as i am not filtering
in map phase at all) and the number of reducers is much smaller than the
number of mappers.
Are there any better ideas how to do this ?
Georgi