The short answer is no. Offline conversion to other formats (as you describe) is a better approach anyways. Writing to a Parquet/ORC file is more compute intensive than just writing JSON data directly to HDFS and not something you need to do in real-time since you have the same data available in ES/Solr. This would slow down the batch indexing topology for no real gain.
> On Jul 15, 2019, at 7:25 AM, <[email protected]> > <[email protected]> wrote: > > Hello all, > > I have a question regarding batch indexing. As as I can see, data are stored > in json format in hdfs. Nevertheless, this uses a lot of storage because of > json verbosity, enrichment,.. Is there any way to use parquet for example? I > guess it’s possible to do it the day after, I mean you read the json and with > spark you save as another format, but is it possible to choose the format at > the batch indexing configuration level? > > Thanks a lot > > Stéphane > >
