We could do something like have some other topology or job that kicks off when an HDFS file is closed.
So before we start a new file, we “queue” a log to some conversion topology/job whatever or something like that. On July 15, 2019 at 10:04:08, Michael Miklavcic (michael.miklav...@gmail.com) wrote: Adding to what Ryan said (and I agree), there are a couple additional consequences: 1. There are questions around just how optimal an ORC file written in real-time can actually be. In order to get columns of data striped effectively, you need a sizable number of k rows. That's probably unlikely in real-time, though some of these storage formats also have "engines" running that manage compactions (like HBase does), but I haven't checked on this in a while. I think Kudu may do this, actually, but again that's a whole new storage engine, not just a format. 2. More importantly - loss of data - HDFS is the source of truth. We guarantee at-least-once processing. In order to achieve efficient columnar storage that makes a columnar format feasible, it's likely that we'd have to make larger batches in indexing. This creates a potential for lag in the system where we would now have to do more to worry about Storm failures than we do currently. With HDFS writing our partial files are still written even if there's a failure in the topology or elsewhere. It does have take up more disk space, but we felt this was a reasonable tradeoff architecturally for something that should be feasible to be written ad-hoc. That being said, you could certainly write conversion jobs that should be able to lag the real-time processing just enough to get the benefits of real-time and still do a decent job of getting your data into a more efficient storage format, if you choose. Cheers, Mike On Mon, Jul 15, 2019 at 7:00 AM Ryan Merriman <merrim...@gmail.com> wrote: > The short answer is no. Offline conversion to other formats (as you > describe) is a better approach anyways. Writing to a Parquet/ORC file is > more compute intensive than just writing JSON data directly to HDFS and not > something you need to do in real-time since you have the same data > available in ES/Solr. This would slow down the batch indexing topology for > no real gain. > > On Jul 15, 2019, at 7:25 AM, <stephane.d...@orange.com> < > stephane.d...@orange.com> wrote: > > Hello all, > > > > I have a question regarding batch indexing. As as I can see, data are > stored in json format in hdfs. Nevertheless, this uses a lot of storage > because of json verbosity, enrichment,.. Is there any way to use parquet > for example? I guess it’s possible to do it the day after, I mean you read > the json and with spark you save as another format, but is it possible to > choose the format at the batch indexing configuration level? > > > > Thanks a lot > > > > Stéphane > > > > > >