Oh, just realized: actually chapter 3 of the book I was referring to is free on the editor's web page, you'll find there an illustrated explanation of Pail:
http://manning.com/marz/ On Wed, Jan 8, 2014 at 12:03 PM, Svend Vanderveken < [email protected]> wrote: > Chen, > > > Have a look at Pail (https://github.com/nathanmarz/dfs-datastores), I've > no experience with it but according to a book written by its creator it's a > good library :D > > I think it fits your model: > > > - when writing, all your distributed data providers (bolts in your > case) write to the same "pail", e.g. /data/logs/ts-1234567 > - Behind the scene, /data/logs/ts-1234567 is actually an HDFS folder > and pail makes sure each source is appending to a different, potentially > small, file inside that folder > - when reading, you can ask pail to "absorb" the > pail /data/logs/ts-1234567 into one single stream of data that you can feed > into hive or wherever. > > > Does this make sense for you use case? > > Cheers > > S > > > > > > > > > On Wed, Jan 8, 2014 at 2:51 AM, Chen Wang <[email protected]>wrote: > >> Dongchao, >> the problem is that i would not want to write each entry (very small) to >> hdfs, this will make hive loading very inefficient.(though i can do file >> merging in separate job). So ideally, i would like to write all entries >> within the same 6 min to the same file. >> right now i am actually thinking about adding a timer(say 6min) in my >> bolt, collect all input to memory, and write to a single file on time >> out... >> Chen >> >> >> On Tue, Jan 7, 2014 at 5:00 PM, Ding,Dongchao <[email protected]>wrote: >> >>> Hi ,some suggestions >>> >>> You didn’t need to “instruct data within the same hourly tenth to the >>> same bolt” , just write the entries within the same hourly tenth(6 >>> min) to the same hdfs directory . >>> >>> Because hive partition locates to one hdfs directory ,not one hdfs >>> file . >>> >>> thks >>> >>> ding >>> >>> *发件人:* Chen Wang [mailto:[email protected]] >>> *发送时间:* 2014年1月8日 7:47 >>> *收件人:* [email protected] >>> *主题:* write to the same file in bolt? >>> >>> >>> >>> Hey Guys, >>> >>> I am using storm to read data from our socket server, entry by entry. >>> Each entry has a time stamp. In my bolt, i will need to write the entries >>> within the same hourly tenth(6 min) to the same hdfs file, so that later i >>> can load them to hive. (with hourly tenth 6min as the partition). >>> >>> >>> >>> In order to achieve that, i will either need >>> >>> 1 instruct data within the same hourly tenth to the same bolt >>> >>> or 2. share the same file writer for all bolts that deal with data >>> within the same hourly tenth. >>> >>> >>> >>> How can I achieve this? or if there is some other approach for this >>> problem? >>> >>> Thank you very much! >>> >>> Chen >>> >>> >>> >>> >>> >> >> >
