Thanks Simon, saving as hive table is also what I had in mind, so easy to do
with spark.
Stéphane
From: Simon Elliston Ball [mailto:si...@simonellistonball.com]
Sent: Monday, July 15, 2019 17:43
To: user@metron.apache.org
Subject: Re: batch indexing in JSON format
Most users will have
Most users will have a batch process converting the JSON short term output into
ORC or Parquet files, often adding them to hive tables at the same time. I
usually do this with a spark job run every hour, or even every 15mins or less
in some cases for high throughput environments. Anecdotally, I’
Hello all,
Thanks for your useful answers, it all make sense for me now. So we will
probably go to post-processing file conversion.
Have a good day,
Stéphane
From: Otto Fowler [mailto:ottobackwa...@gmail.com]
Sent: Monday, July 15, 2019 16:19
To: user@metron.apache.org
Subject: Re
We could do something like have some other topology or job that kicks off
when an HDFS file is closed.
So before we start a new file, we “queue” a log to some conversion
topology/job whatever or something like that.
On July 15, 2019 at 10:04:08, Michael Miklavcic (michael.miklav...@gmail.com)
Adding to what Ryan said (and I agree), there are a couple additional
consequences:
1. There are questions around just how optimal an ORC file written in
real-time can actually be. In order to get columns of data striped
effectively, you need a sizable number of k rows. That's probably un
The short answer is no. Offline conversion to other formats (as you describe)
is a better approach anyways. Writing to a Parquet/ORC file is more compute
intensive than just writing JSON data directly to HDFS and not something you
need to do in real-time since you have the same data available
Hello all,
I have a question regarding batch indexing. As as I can see, data are stored
in json format in hdfs. Nevertheless, this uses a lot of storage because of
json verbosity, enrichment,.. Is there any way to use parquet for example? I
guess its possible to do it the day after, I mean you