Re: Drill and Elasticsearch

John Omernik Thu, 23 Feb 2017 11:11:54 -0800

Nice, I will check that out Uwe.  (pyarrow)

Ted, that was my first approach, but getting the right way to get the json
file(s) out of 3+GB of parquet was challenging.... I tried Drills rest API
push to ES Rest API, but that was not performant at all...

I did stumble across this: https://github.com/moshe/elasticsearch_loader

Which was interesting, it would skip drill, read the parquet files directly
and stream them (in definable batches) into Elastic search. I ran into some
issues (that I posted and the author is aware of).

First, you have to specify all the files, so if you have a directory of
Parquet, it's a bit clumsy. Some fancy BASHing works around that....

Second, if you specify a large number of files, due to the way a status bar
is implemented, it forces a HUGE preread of the files. The author has
promised the ability to disable this in the future. (when I say HUGE I mean
YUGE, the BIGGEST preread).  Basically to calculate the status bar, on a
3GB table of Parquet read from MapR FS, it required 36GB of ram. (Not
reasonable). Basically I used by hacky BASHing to run his tool on every
file one at a time to avoid that. (I  want to see if I can distribute on
mesos too :)

Third... I did see lots of elastic search load during the import.  I was
running three nodes using MapR FS via NFS/Fuse as the storage location (a
volume for each node).  And between Elasticsearch, mfs, and the fuse
client, there was a lot of CPU usage to do the load... I wish I was a FS
expert to figure out how to tune things on that front... but alas, just a
Bash/Python happy scripter.

I am excited to see the pyarrow stuff!

John

On Thu, Feb 23, 2017 at 8:29 AM, Uwe L. Korn <[email protected]> wrote:

> On Thu, Feb 23, 2017, at 03:19 PM, Ted Dunning wrote:
> > On Tue, Feb 21, 2017 at 10:32 PM, John Omernik <[email protected]> wrote:
> >
> > > I guess, I am just looking for ideas, how would YOU get data from
> Parquet
> > > files into Elastic Search? I have Drill and Spark at the ready, but
> want to
> > > be able to handle it as efficiently as possible.  Ideally, if we had a
> well
> > > written ES plugin, I could write a query that inserted into an index
> and
> > > streamed stuff in... but barring that, what other methods have people
> used?
> > >
> >
> > My traditional method has been to use Python's version of the ES batch
> > load
> > API. This runs ES pretty hard, but you would need more to saturate a
> > really
> > large ES cluster. Often I export a JSON file using whatever tool (Drill
> > would work) and then use the python on that file. Avoids questions of
> > Python reading obscure stuff. I think that Python is now able to read and
> > write Parquet, but that is pretty new stuff, so I would stay old school
> > there.
>
> If you want to try it, see
> https://pyarrow.readthedocs.io/en/latest/parquet.html
>
> You can use `conda install pyarrow` to get it, probably next monday it
> will also be pip-installable. It's based on Apache Arrow and Apache
> Parquet C++, we're happy about any feedback!
>

Re: Drill and Elasticsearch

Reply via email to