Alan,
The reason I am trying to write to the same file is that i don't want to
persist each entry as a small file to hdfs. It will make hive loading very
inefficient, right? (although i could do file merging in a separate job).

My current thought is that i probably could set up a timer(say 6min) in my
bolt, collect all data within that time frame in memory, and then write
them to hfds one file per partition. (e.g if the data in memory spans
partition1, and partition2, i will create two folders(p1, p2) if not exist,
and only create one file under each folder.) but even with this approach  i
am still having the risk of generating small files if the timer span cross
the partition boundaries.

Another hive question i have is that:
if i created a hive table, and drop hdfs files directly following the
table's folder structure, will hive table automatically load those
file(select from will see the data)? I tried with external table, it seems
that i will still need to load the partition whenever i created a new
partition folder.

Thanks,
Chen



On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates <ga...@hortonworks.com> wrote:

> I am not wise enough in the ways of Storm to tell you how you should
> partition data across bolts.  However, there is no need in Hive for all
> data for a partition to be in the same file, only in the same directory.
>  So if each bolt creates a file for each partition and then all those files
> are placed in one directory and loaded into Hive it will work.
>
> Alan.
>
> On Jan 6, 2014, at 6:26 PM, Chen Wang <chen.apache.s...@gmail.com> wrote:
>
> > Alan,
> > the problem is that the data is partitioned by epoch ten hourly, and i
> want all data belong to that partition to be written into one file named
> with that partition. How can i share the file writer across different bolt?
> should I instruct data within the same partition to the same bolt?
> > Thanks,
> > Chen
> >
> >
> > On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
> > You shouldn’t need to write each record to a separate file.  Each Storm
> bolt should be able to write to it’s own file, appending records as it
> goes.  As long as you only have one writer per file this should be fine.
>  You can then close the files every 15 minutes (or whatever works for you)
> and have a separate job that creates a new partition in your Hive table
> with the files created by your bolts.
> >
> > Alan.
> >
> > On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.s...@gmail.com>
> wrote:
> >
> >> Guys,
> >> I am using storm to read data stream from our socket server, entry by
> entry, and then write them to file: one entry per file.  At some point, i
> need to import the data into my hive table. There are several approaches i
> could think of:
> >> 1. directly write to hive hdfs file whenever I get the entry(from our
> socket server). The problem is that this could be very inefficient,  since
> we have huge amount of data stream, and I would not want to write to hive
> hdfs one by one.
> >> Or
> >> 2 i can write the entries to files(normal file or hdfs file) on the
> disk, and then have a separate job to merge those small files into big one,
> and then load them into hive table.
> >> The problem with this is, a) how can I merge small files into big files
> for hive? b) what is the best file size to upload to hive?
> >>
> >> I am seeking advice on both approaches, and appreciate your insight.
> >> Thanks,
> >> Chen
> >>
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Reply via email to