Correction, the http post may or may not be faster than writing directly to 
SMB, but hopefully we can improve that speed in a more scalable manner than 
SMB. 

--Ken

On May 16, 2014, at 11:17 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:

> Hi all, 
> 
> Sorry for the possible repost--hadn't seen this in the list after 18 hours 
> and figured I'd try again....
> 
> We are experimenting as using Kafka as a midpoint between microscopes and a 
> Spark cluster for data analysis. Our microscopes almost universally use 
> Windows machines for acquisition (as do most scientific instruments), and our 
> compute cluster (which runs Spark among many other things) runs Linux. We use 
> Isilon for file storage primarily, although we also have a GPFS cluster for 
> HPC. 
> 
> We have a working http post system going into Kafka from the Windows 
> acquisition machine, which is performing more reliably and faster than an SMB 
> connection to the Isilon or GPFS clusters. Unfortunately, the Spark streaming 
> consumer is much slower than reading from disk (Isilon or GPFS) on the Spark 
> cluster. 
> 
> My proposal would be to not only improve the Spark streaming, but also to 
> have a consumer (or multiple consumers!) that writes to disk, either over NFS 
> or "locally" via a GPFS client. 
> 
> As I am a systems engineer, I'm not equipped to write this, so I'm wondering 
> if anyone has done this sort of thing with Kafka before. I know there are 
> HDFS consumers out there, and our Isilons can do HDFS, but the implementation 
> on the Isilon is very limited at this time, and the ability to write to local 
> filesystem or NFS would give us much more flexibility. 
> 
> Ideally, I would like to be able to use Kafka as a high speed transfer point 
> between acquisition instruments (usually running Windows) and several kinds 
> of storage, so that we could write virtually simultaneously to archive 
> storage for the raw data and to HPC scratch for data analysis, thereby 
> limiting the penalty incurred from data movement between storage tiers. 
> 
> Thanks for any input you have,
> 
> --Ken

Reply via email to