Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

Aris Tue, 10 Feb 2015 17:19:52 -0800

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and
Spark/SparkSQL 1.2.0, from Costin Leau's advice.


I want to query ElasticSearch for a bunch of JSON documents from within
SparkSQL, and then use a SQL query to simply query for a column, which is
actually a JSON key -- normal things that SparkSQL does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the
ElasticSearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that
if I have JSON keys physically in the order D, C, B, A in the json
documents, the elastic search connector discovers those keys BUT then sorts
them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I
actually get column D (because the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and
that key should be NULL, the whole application actually crashes and gives
me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is
totally screwed up.

In the above example with physical JSONs containing keys in the order
D,C,B,A, if one of the JSON documents is missing the key/column I am
querying for, I get that java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?

Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

Reply via email to