Nice, I will check that out Uwe. (pyarrow) Ted, that was my first approach, but getting the right way to get the json file(s) out of 3+GB of parquet was challenging.... I tried Drills rest API push to ES Rest API, but that was not performant at all...
I did stumble across this: https://github.com/moshe/elasticsearch_loader Which was interesting, it would skip drill, read the parquet files directly and stream them (in definable batches) into Elastic search. I ran into some issues (that I posted and the author is aware of). First, you have to specify all the files, so if you have a directory of Parquet, it's a bit clumsy. Some fancy BASHing works around that.... Second, if you specify a large number of files, due to the way a status bar is implemented, it forces a HUGE preread of the files. The author has promised the ability to disable this in the future. (when I say HUGE I mean YUGE, the BIGGEST preread). Basically to calculate the status bar, on a 3GB table of Parquet read from MapR FS, it required 36GB of ram. (Not reasonable). Basically I used by hacky BASHing to run his tool on every file one at a time to avoid that. (I want to see if I can distribute on mesos too :) Third... I did see lots of elastic search load during the import. I was running three nodes using MapR FS via NFS/Fuse as the storage location (a volume for each node). And between Elasticsearch, mfs, and the fuse client, there was a lot of CPU usage to do the load... I wish I was a FS expert to figure out how to tune things on that front... but alas, just a Bash/Python happy scripter. I am excited to see the pyarrow stuff! John On Thu, Feb 23, 2017 at 8:29 AM, Uwe L. Korn <[email protected]> wrote: > On Thu, Feb 23, 2017, at 03:19 PM, Ted Dunning wrote: > > On Tue, Feb 21, 2017 at 10:32 PM, John Omernik <[email protected]> wrote: > > > > > I guess, I am just looking for ideas, how would YOU get data from > Parquet > > > files into Elastic Search? I have Drill and Spark at the ready, but > want to > > > be able to handle it as efficiently as possible. Ideally, if we had a > well > > > written ES plugin, I could write a query that inserted into an index > and > > > streamed stuff in... but barring that, what other methods have people > used? > > > > > > > My traditional method has been to use Python's version of the ES batch > > load > > API. This runs ES pretty hard, but you would need more to saturate a > > really > > large ES cluster. Often I export a JSON file using whatever tool (Drill > > would work) and then use the python on that file. Avoids questions of > > Python reading obscure stuff. I think that Python is now able to read and > > write Parquet, but that is pretty new stuff, so I would stay old school > > there. > > If you want to try it, see > https://pyarrow.readthedocs.io/en/latest/parquet.html > > You can use `conda install pyarrow` to get it, probably next monday it > will also be pip-installable. It's based on Apache Arrow and Apache > Parquet C++, we're happy about any feedback! >
