On Thu, Feb 23, 2017, at 03:19 PM, Ted Dunning wrote: > On Tue, Feb 21, 2017 at 10:32 PM, John Omernik <[email protected]> wrote: > > > I guess, I am just looking for ideas, how would YOU get data from Parquet > > files into Elastic Search? I have Drill and Spark at the ready, but want to > > be able to handle it as efficiently as possible. Ideally, if we had a well > > written ES plugin, I could write a query that inserted into an index and > > streamed stuff in... but barring that, what other methods have people used? > > > > My traditional method has been to use Python's version of the ES batch > load > API. This runs ES pretty hard, but you would need more to saturate a > really > large ES cluster. Often I export a JSON file using whatever tool (Drill > would work) and then use the python on that file. Avoids questions of > Python reading obscure stuff. I think that Python is now able to read and > write Parquet, but that is pretty new stuff, so I would stay old school > there.
If you want to try it, see https://pyarrow.readthedocs.io/en/latest/parquet.html You can use `conda install pyarrow` to get it, probably next monday it will also be pip-installable. It's based on Apache Arrow and Apache Parquet C++, we're happy about any feedback!
