Re: batch indexing in JSON format

Ryan Merriman Mon, 15 Jul 2019 06:00:42 -0700

The short answer is no.  Offline conversion to other formats (as you describe) 
is a better approach anyways.  Writing to a Parquet/ORC file is more compute 
intensive than just writing JSON data directly to HDFS and not something you 
need to do in real-time since you have the same data available in ES/Solr.  
This would slow down the batch indexing topology for no real gain.


> On Jul 15, 2019, at 7:25 AM, <[email protected]> 
> <[email protected]> wrote:
> 
> Hello all,
>  
> I have a question regarding batch indexing. As as I can see, data are stored 
> in json format in hdfs. Nevertheless, this uses a lot of storage because of 
> json verbosity, enrichment,.. Is there any way to use parquet for example? I 
> guess it’s possible to do it the day after, I mean you read the json and with 
> spark you save as another format, but is it possible to choose the format at 
> the batch indexing configuration level?
>  
> Thanks a lot
>  
> Stéphane
>  
>

Re: batch indexing in JSON format

Reply via email to