Re: msgpack handling lists with elements of different types

2018-10-17 Thread Jean-Claude Cote
I see what you mean by a schema only gets you so far. Your Fred Flinstone
example shows how you almost need the ability to apply a transformation at
the reader level (instead of at the projection level) to properly read such
data files.

I think I agree with Charles Givre. I've always like the tag line from
Drill using the data "in-situ". I like the fact that you can just write
files to disk without an ingestion process and start playing with them.




On Wed, Oct 17, 2018 at 2:02 PM Paul Rogers 
wrote:

> Hi JC,
>
> Bingo, you just hit the core problem with schema-on-read: there is no
> "right" rule for how to handle ambiguous or inconsistent schemas. Take your
> string/binary example. You determined that the binary fields were actually
> strings (encoded in what, UTF-8? ASCII? Host's native codeset?) The answer
> could have been the opposite: maybe these are packet sniffs and the data is
> typically binary, except when the analyzer was able to pull out strings.
> The point is, there is no right answer: it depends.
>
> The same is true with heterogeneous lists, inconsistent maps, etc. Do
> these represent a lazy script (writing numbers sometimes as strings,
> sometimes as numbers) or a deeper problem: that you are supposed to look at
> a "type" code in the object to determine the meaning of the other fields.
>
> I wrestled with these issues myself when rewriting the JSON reader to use
> the new result set loader. My notes are in DRILL-4710. I ended up with a
> slogan: "Drill cannot predict the future."
>
> Frankly, this issue has existed as long as Drill has existed. Somehow
> we've muddled through, which might be an indication that this issue is not
> worth fixing. (In the Drill book, for example, I document these issues and
> conclude by telling people to use Spark or Hive to ETL the data into
> Parquet.) Since Parquet is Drill's primary format, odd cases in JSON tend
> to not get much attention.
>
> You are right: the only way I know of to resolve the issue is for the user
> to tell us their intention. We just suggested that one way to express
> intention is to do what Impala does, and what the book documents: have the
> user use Spark or Hive to ETL the data into a clean, unambiguous Parquet
> format. That is, delegate the solution of the problem to other tools in the
> big data stack.
>
> We've also suggested that users sole the problem via very clever views and
> use of all-text mode and numbers-as-double mode, doing lots of cases. But,
> this does not scale (and the options must be set manually prior to each
> query, then reset for the next.) There are cases, documented in DRILL-4710,
> where even this does not work. (Column c is sometimes a map, sometimes a
> scalar, say.)
>
> You've invented a mechanism for expressing schema, and the team appears to
> be working on a Drill metastore. So, that's a third solution.
>
> The fourth solution is to build on what you've done with MsgPack: write a
> custom parser for each odd file format. This might be needed if the format
> is more odd than a schema can fix. Perhaps a custom "meta-parser" on top of
> JSON or MsgPack would be needed to convert data from the odd file format to
> the extended-relational format which Drill uses.
>
> Here are two of the classics that fall into that category. The "tuple as
> an array" format:
>
> {fields: ["fred", "flintstone", 123.45, true, null]}
>
> The "metadata" format:
>
> { field1: { name: "first", type: "string", value: "fred"},
>   field2: { name: "last", type: "string", value: "flintstone"},
>   field3: { name: "balance", type: "money", value: 123.45},
>   field4: { name: "is vip", type: "boolean", value: true}, ...
> }
>
> I'm not making these up, I've seen them used in practice. Unless the
> schema is very expressive, it probably can't handle these, which is why
> some code will be need (in Spark/Hive or in a Drill plugin of some kind.)
>
> Charles Givre makes a very good point: he suggests that Drill's unique
> opportunity is to handle such odd files clearly, avoiding the need for ETL.
> That is, rather than thinking of Drill as a junior version of Impala (read
> one format really, really well), think of it as the open source version of
> Splunk (read all formats via adapters.)
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, October 17, 2018, 6:43:04 AM PDT, Jean-Claude Cote <
> jcc...@gmail.com> wrote:
>
>  I'm writing a msgpack reader and have encountered datasets where an array
> contains different types for example a VARCHAR and a BINARY. Turns out the
> BINARY is actually a string. I know this is probably just not modeled
> correctly in the first place but I'll still going to modify the reading of
> list so that it takes note of the first element in the list and tries to
> coerce subsequent elements that are not of the same type.
>
> {
> "column": [["name", \\0xAA\\0xBB],["surname", \\0xAA\\0xBB]]
> }
>
> However I have an other scenario where it's actually the field of a map
> that change type
> {
> 

Re: msgpack handling lists with elements of different types

2018-10-17 Thread Paul Rogers
Hi JC,

Bingo, you just hit the core problem with schema-on-read: there is no "right" 
rule for how to handle ambiguous or inconsistent schemas. Take your 
string/binary example. You determined that the binary fields were actually 
strings (encoded in what, UTF-8? ASCII? Host's native codeset?) The answer 
could have been the opposite: maybe these are packet sniffs and the data is 
typically binary, except when the analyzer was able to pull out strings. The 
point is, there is no right answer: it depends.

The same is true with heterogeneous lists, inconsistent maps, etc. Do these 
represent a lazy script (writing numbers sometimes as strings, sometimes as 
numbers) or a deeper problem: that you are supposed to look at a "type" code in 
the object to determine the meaning of the other fields.

I wrestled with these issues myself when rewriting the JSON reader to use the 
new result set loader. My notes are in DRILL-4710. I ended up with a slogan: 
"Drill cannot predict the future."

Frankly, this issue has existed as long as Drill has existed. Somehow we've 
muddled through, which might be an indication that this issue is not worth 
fixing. (In the Drill book, for example, I document these issues and conclude 
by telling people to use Spark or Hive to ETL the data into Parquet.) Since 
Parquet is Drill's primary format, odd cases in JSON tend to not get much 
attention.

You are right: the only way I know of to resolve the issue is for the user to 
tell us their intention. We just suggested that one way to express intention is 
to do what Impala does, and what the book documents: have the user use Spark or 
Hive to ETL the data into a clean, unambiguous Parquet format. That is, 
delegate the solution of the problem to other tools in the big data stack.

We've also suggested that users sole the problem via very clever views and use 
of all-text mode and numbers-as-double mode, doing lots of cases. But, this 
does not scale (and the options must be set manually prior to each query, then 
reset for the next.) There are cases, documented in DRILL-4710, where even this 
does not work. (Column c is sometimes a map, sometimes a scalar, say.)

You've invented a mechanism for expressing schema, and the team appears to be 
working on a Drill metastore. So, that's a third solution.

The fourth solution is to build on what you've done with MsgPack: write a 
custom parser for each odd file format. This might be needed if the format is 
more odd than a schema can fix. Perhaps a custom "meta-parser" on top of JSON 
or MsgPack would be needed to convert data from the odd file format to the 
extended-relational format which Drill uses.

Here are two of the classics that fall into that category. The "tuple as an 
array" format:

{fields: ["fred", "flintstone", 123.45, true, null]}

The "metadata" format:

{ field1: { name: "first", type: "string", value: "fred"},
  field2: { name: "last", type: "string", value: "flintstone"},
  field3: { name: "balance", type: "money", value: 123.45},
  field4: { name: "is vip", type: "boolean", value: true}, ...
}

I'm not making these up, I've seen them used in practice. Unless the schema is 
very expressive, it probably can't handle these, which is why some code will be 
need (in Spark/Hive or in a Drill plugin of some kind.)

Charles Givre makes a very good point: he suggests that Drill's unique 
opportunity is to handle such odd files clearly, avoiding the need for ETL. 
That is, rather than thinking of Drill as a junior version of Impala (read one 
format really, really well), think of it as the open source version of Splunk 
(read all formats via adapters.)

Thanks,
- Paul

 

On Wednesday, October 17, 2018, 6:43:04 AM PDT, Jean-Claude Cote 
 wrote:  
 
 I'm writing a msgpack reader and have encountered datasets where an array
contains different types for example a VARCHAR and a BINARY. Turns out the
BINARY is actually a string. I know this is probably just not modeled
correctly in the first place but I'll still going to modify the reading of
list so that it takes note of the first element in the list and tries to
coerce subsequent elements that are not of the same type.

{
"column": [["name", \\0xAA\\0xBB],["surname", \\0xAA\\0xBB]]
}

However I have an other scenario where it's actually the field of a map
that change type
{
"column": [
{
"dataType": 1,
"value": 19
},
{
"dataType": 5,
"value": "string data"
}
]
}

When reading such a structure a BigInt writer is used to write out the
value of the first map but the same BigInt writer is used for value field
of the second map. I understand that drill will represent the "value" field
in a BitInt vector in memory.

My question is how to best address situations like this one. What
alternatives is there. Read the value type as ANY? This situation is deeply
nested should I put a means to ignore elements at the certain depth? Is it
even possible to handle these situations gracefully? Is this a situation
where a schema would be helpful 

msgpack handling lists with elements of different types

2018-10-17 Thread Jean-Claude Cote
I'm writing a msgpack reader and have encountered datasets where an array
contains different types for example a VARCHAR and a BINARY. Turns out the
BINARY is actually a string. I know this is probably just not modeled
correctly in the first place but I'll still going to modify the reading of
list so that it takes note of the first element in the list and tries to
coerce subsequent elements that are not of the same type.

{
"column": [["name", \\0xAA\\0xBB],["surname", \\0xAA\\0xBB]]
}

However I have an other scenario where it's actually the field of a map
that change type
{
"column": [
{
"dataType": 1,
"value": 19
},
{
"dataType": 5,
"value": "string data"
}
]
}

When reading such a structure a BigInt writer is used to write out the
value of the first map but the same BigInt writer is used for value field
of the second map. I understand that drill will represent the "value" field
in a BitInt vector in memory.

My question is how to best address situations like this one. What
alternatives is there. Read the value type as ANY? This situation is deeply
nested should I put a means to ignore elements at the certain depth? Is it
even possible to handle these situations gracefully? Is this a situation
where a schema would be helpful in determining what to do with fields that
are problematic.

Thank you
jc