RE: batch indexing in JSON format

2019-07-15 Thread stephane.davy
Thanks Simon, saving as hive table is also what I had in mind, so easy to do with spark. Stéphane From: Simon Elliston Ball [mailto:si...@simonellistonball.com] Sent: Monday, July 15, 2019 17:43 To: user@metron.apache.org Subject: Re: batch indexing in JSON format Most users will have

Re: batch indexing in JSON format

2019-07-15 Thread Simon Elliston Ball
Most users will have a batch process converting the JSON short term output into ORC or Parquet files, often adding them to hive tables at the same time. I usually do this with a spark job run every hour, or even every 15mins or less in some cases for high throughput environments. Anecdotally, I’

RE: batch indexing in JSON format

2019-07-15 Thread stephane.davy
Hello all, Thanks for your useful answers, it all make sense for me now. So we will probably go to post-processing file conversion. Have a good day, Stéphane From: Otto Fowler [mailto:ottobackwa...@gmail.com] Sent: Monday, July 15, 2019 16:19 To: user@metron.apache.org Subject: Re

Re: batch indexing in JSON format

2019-07-15 Thread Otto Fowler
We could do something like have some other topology or job that kicks off when an HDFS file is closed. So before we start a new file, we “queue” a log to some conversion topology/job whatever or something like that. On July 15, 2019 at 10:04:08, Michael Miklavcic (michael.miklav...@gmail.com)

Re: batch indexing in JSON format

2019-07-15 Thread Michael Miklavcic
Adding to what Ryan said (and I agree), there are a couple additional consequences: 1. There are questions around just how optimal an ORC file written in real-time can actually be. In order to get columns of data striped effectively, you need a sizable number of k rows. That's probably un

Re: batch indexing in JSON format

2019-07-15 Thread Ryan Merriman
The short answer is no. Offline conversion to other formats (as you describe) is a better approach anyways. Writing to a Parquet/ORC file is more compute intensive than just writing JSON data directly to HDFS and not something you need to do in real-time since you have the same data available

batch indexing in JSON format

2019-07-15 Thread stephane.davy
Hello all, I have a question regarding batch indexing. As as I can see, data are stored in json format in hdfs. Nevertheless, this uses a lot of storage because of json verbosity, enrichment,.. Is there any way to use parquet for example? I guess it’s possible to do it the day after, I mean you