Chen,
Have a look at Pail (https://github.com/nathanmarz/dfs-datastores), I've no experience with it but according to a book written by its creator it's a good library :D I think it fits your model: - when writing, all your distributed data providers (bolts in your case) write to the same "pail", e.g. /data/logs/ts-1234567 - Behind the scene, /data/logs/ts-1234567 is actually an HDFS folder and pail makes sure each source is appending to a different, potentially small, file inside that folder - when reading, you can ask pail to "absorb" the pail /data/logs/ts-1234567 into one single stream of data that you can feed into hive or wherever. Does this make sense for you use case? Cheers S On Wed, Jan 8, 2014 at 2:51 AM, Chen Wang <[email protected]>wrote: > Dongchao, > the problem is that i would not want to write each entry (very small) to > hdfs, this will make hive loading very inefficient.(though i can do file > merging in separate job). So ideally, i would like to write all entries > within the same 6 min to the same file. > right now i am actually thinking about adding a timer(say 6min) in my > bolt, collect all input to memory, and write to a single file on time > out... > Chen > > > On Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <[email protected]>wrote: > >> Hi ,some suggestions >> >> You didn’t need to “instruct data within the same hourly tenth to the >> same bolt” , just write the entries within the same hourly tenth(6 >> min) to the same hdfs directory . >> >> Because hive partition locates to one hdfs directory ,not one hdfs >> file . >> >> thks >> >> ding >> >> *发件人:* Chen Wang [mailto:[email protected]] >> *发送时间:* 2014年1月8日 7:47 >> *收件人:* [email protected] >> *主题:* write to the same file in bolt? >> >> >> >> Hey Guys, >> >> I am using storm to read data from our socket server, entry by entry. >> Each entry has a time stamp. In my bolt, i will need to write the entries >> within the same hourly tenth(6 min) to the same hdfs file, so that later i >> can load them to hive. (with hourly tenth 6min as the partition). >> >> >> >> In order to achieve that, i will either need >> >> 1 instruct data within the same hourly tenth to the same bolt >> >> or 2. share the same file writer for all bolts that deal with data >> within the same hourly tenth. >> >> >> >> How can I achieve this? or if there is some other approach for this >> problem? >> >> Thank you very much! >> >> Chen >> >> >> >> >> > >
