We could do something like have some other topology or job that kicks off
when an HDFS file is closed.

So before we start a new file, we “queue” a log to some conversion
topology/job whatever or something like that.




On July 15, 2019 at 10:04:08, Michael Miklavcic (michael.miklav...@gmail.com)
wrote:

Adding to what Ryan said (and I agree), there are a couple additional
consequences:

   1. There are questions around just how optimal an ORC file written in
   real-time can actually be. In order to get columns of data striped
   effectively, you need a sizable number of k rows. That's probably unlikely
   in real-time, though some of these storage formats also have "engines"
   running that manage compactions (like HBase does), but I haven't checked on
   this in a while. I think Kudu may do this, actually, but again that's a
   whole new storage engine, not just a format.
   2. More importantly - loss of data - HDFS is the source of truth. We
   guarantee at-least-once processing. In order to achieve efficient columnar
   storage that makes a columnar format feasible, it's likely that we'd have
   to make larger batches in indexing. This creates a potential for lag in the
   system where we would now have to do more to worry about Storm failures
   than we do currently. With HDFS writing our partial files are still written
   even if there's a failure in the topology or elsewhere. It does have take
   up more disk space, but we felt this was a reasonable tradeoff
   architecturally for something that should be feasible to be written ad-hoc.

That being said, you could certainly write conversion jobs that should be
able to lag the real-time processing just enough to get the benefits of
real-time and still do a decent job of getting your data into a more
efficient storage format, if you choose.

Cheers,
Mike


On Mon, Jul 15, 2019 at 7:00 AM Ryan Merriman <merrim...@gmail.com> wrote:

> The short answer is no.  Offline conversion to other formats (as you
> describe) is a better approach anyways.  Writing to a Parquet/ORC file is
> more compute intensive than just writing JSON data directly to HDFS and not
> something you need to do in real-time since you have the same data
> available in ES/Solr.  This would slow down the batch indexing topology for
> no real gain.
>
> On Jul 15, 2019, at 7:25 AM, <stephane.d...@orange.com> <
> stephane.d...@orange.com> wrote:
>
> Hello all,
>
>
>
> I have a question regarding batch indexing. As as I can see, data are
> stored in json format in hdfs. Nevertheless, this uses a lot of storage
> because of json verbosity, enrichment,.. Is there any way to use parquet
> for example? I guess it’s possible to do it the day after, I mean you read
> the json and with spark you save as another format, but is it possible to
> choose the format at the batch indexing configuration level?
>
>
>
> Thanks a lot
>
>
>
> Stéphane
>
>
>
>
>
>

Reply via email to