I have had a similar issue where I wanted a single source of truth between
Search and HDFS. First, if you zoom out a little, eventually you are going
to have some compute engine(s) process the data. If you store it in a
compute neutral tier like kafka then you will need to suck the data out at
runtime and stage it for the compute engine to use. So pick your poison,
process at ingest and store multiple copies of data, one per compute
engine, OR store in a neutral store and process at runtime. I am not saying
one is better than the other but that's how I see the trade-off so
depending on your use cases, YMMV.

What I do is:
- store raw data into kafka
- use spark streaming to transform data to JSON and post it back to kafka
- Hang multiple data stores off kafka that ingest the JSON
- Not do any other transformations in the "consumer" stores and store the
copy as immutable event

So I do have multiple copies (one per compute tier) but they all look the
same.

Unless different compute engines, natively start to use a common data
storage format, I don't see how one could get away from storing multiple
copies. Primarily, I see Lucene based products have their format, the
Hadoop ecosystem seems congregating around Parquet and then the NoSQL
players have their formats (one per each product).

My 2 cents worth :)



On Mon, Jul 13, 2015 at 10:35 AM, Daniel Schierbeck <
daniel.schierb...@gmail.com> wrote:

> Am I correct in assuming that Kafka will only retain a file handle for the
> last segment of the log? If the number of handles grows unbounded, then it
> would be an issue. But I plan on writing to this topic continuously anyway,
> so not separating data into cold and hot storage is the entire point.
>
> Daniel Schierbeck
>
> > On 13. jul. 2015, at 15.41, Scott Thibault <
> scott.thiba...@multiscalehn.com> wrote:
> >
> > We've tried to use Kafka not as a persistent store, but as a long-term
> > archival store.  An outstanding issue we've had with that is that the
> > broker holds on to an open file handle on every file in the log!  The
> other
> > issue we've had is when you create a long-term archival log on shared
> > storage, you can't simply access that data from another cluster b/c of
> meta
> > data being stored in zookeeper rather than in the log.
> >
> > --Scott Thibault
> >
> >
> > On Mon, Jul 13, 2015 at 4:44 AM, Daniel Schierbeck <
> > daniel.schierb...@gmail.com> wrote:
> >
> >> Would it be possible to document how to configure Kafka to never delete
> >> messages in a topic? It took a good while to figure this out, and I see
> it
> >> as an important use case for Kafka.
> >>
> >> On Sun, Jul 12, 2015 at 3:02 PM Daniel Schierbeck <
> >> daniel.schierb...@gmail.com> wrote:
> >>
> >>>
> >>>> On 10. jul. 2015, at 23.03, Jay Kreps <j...@confluent.io> wrote:
> >>>>
> >>>> If I recall correctly, setting log.retention.ms and
> >> log.retention.bytes
> >>> to
> >>>> -1 disables both.
> >>>
> >>> Thanks!
> >>>
> >>>>
> >>>> On Fri, Jul 10, 2015 at 1:55 PM, Daniel Schierbeck <
> >>>> daniel.schierb...@gmail.com> wrote:
> >>>>
> >>>>>
> >>>>>> On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com> wrote:
> >>>>>>
> >>>>>> There are two ways you can configure your topics, log compaction and
> >>> with
> >>>>>> no cleaning. The choice depends on your use case. Are the records
> >>>>> uniquely
> >>>>>> identifiable and will they receive updates? Then log compaction is
> >> the
> >>>>> way
> >>>>>> to go. If they are truly read only, you can go without log
> >> compaction.
> >>>>>
> >>>>> I'd rather be free to use the key for partitioning, and the records
> >> are
> >>>>> immutable — they're event records — so disabling compaction
> altogether
> >>>>> would be preferable. How is that accomplished?
> >>>>>>
> >>>>>> We have a small processes which consume a topic and perform upserts
> >> to
> >>>>> our
> >>>>>> various database engines. It's easy to change how it all works and
> >>> simply
> >>>>>> consume the single source of truth again.
> >>>>>>
> >>>>>> I've written a bit about log compaction here:
> >>>
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> >>>>>>
> >>>>>> On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> >>>>>> daniel.schierb...@gmail.com> wrote:
> >>>>>>
> >>>>>>> I'd like to use Kafka as a persistent store – sort of as an
> >>> alternative
> >>>>> to
> >>>>>>> HDFS. The idea is that I'd load the data into various other systems
> >> in
> >>>>>>> order to solve specific needs such as full-text search, analytics,
> >>>>> indexing
> >>>>>>> by various attributes, etc. I'd like to keep a single source of
> >> truth,
> >>>>>>> however.
> >>>>>>>
> >>>>>>> I'm struggling a bit to understand how I can configure a topic to
> >>> retain
> >>>>>>> messages indefinitely. I want to make sure that my data isn't
> >> deleted.
> >>>>> Is
> >>>>>>> there a guide to configuring Kafka like this?
> >
> >
> >
> > --
> > *This e-mail is not encrypted.  Due to the unsecured nature of
> unencrypted
> > e-mail, there may be some level of risk that the information in this
> e-mail
> > could be read by a third party.  Accordingly, the recipient(s) named
> above
> > are hereby advised to not communicate protected health information using
> > this e-mail address.  If you desire to send protected health information
> > electronically, please contact MultiScale Health Networks at
> (206)538-6090*
>

Reply via email to