Hi,

I’ve been working with a pipleline that was initially aimed at processing high 
speed sensor data, but for a proof of concept I’m feeding simulated data from a 
CSV file. Each row of the file is a sample across a number of time series, and 
I’ve been using the streaming environment to process each column of the file in 
parallel. The columns are each around 6 million samples in length. The outputs 
are being sunk into a Derby database table.

However, I’m not getting a very high throughput even running across two 32G 
Dell laptops with the database on a 16Mb Macbook. According to the metrics it 
seems I’m only dropping records on to the database at a rate of around 20-30 
per second (I assume per parallel pipe of which I’m running 16 across the two 
laptops). The pipeline runs a couple of windowing operations one of length 10 
and one of length 100 with a small amount of computation, but it does yield a 
considerable amount of output a 10G CSV file yielding a database of around 
100Gb+.

I’m thinking that the slow rate is due to using stream process to process a 
batch. So I’ve I’ve been looking at the batch support in Flink (1.8) intending 
to move the code over from stream to batch execution. However, I can’t make 
head’n tail of the batch DataSet documentation. For example, in the streaming 
environment I was setting a watermark after I read each line of the file to 
keep the time series in order, and using keyBy to split up the individual time 
series by column number. I can’t find the equivalent operations in the batch 
interface.

Could someone guide me to some relevant online documentation,and before anyone 
says I have chatted with Dr Google for a good while to no satisfactory outcome

TIA

Nick Walton

Reply via email to