Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Further testing revealed limitations in the underlying Jackson parser. That
parser will not recover from other errors such as:
```
{ a: }
```
See
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/518
+1. Looks like there has been enough review and there is good enough reason
to merge this in
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user kfaraaz commented on the issue:
https://github.com/apache/drill/pull/518
The below JSON is invalid, due to presence of duplicate key 'key'. Today
Drill returns a DATA_READ error, does your proposed fix handle this case too ?
[root@centos-01 ~]# cat f1.json
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
I have a similar data set checked in. Hope that is good enough.
On Fri, Oct 7, 2016 at 10:00 AM, Paul Rogers
wrote:
> Base data set:
>
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Base data set:
{ "p": 1, "a": { "x": 10, "y": 20, "z": 30 }, "b": 50, "c": 60 }
{ "p": 2, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
{ "p": 3, "a": { "x":
Github user zfong commented on the issue:
https://github.com/apache/drill/pull/518
@paul-rogers - since you had concerns about particular test cases, and it
looks like you've confirmed that those are non-issues, would it make sense for
you to share those with @ssriniva123 and he can
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Ran some tests. The results look good. In particular, files with nested
structures produced the correct results. Since it was the nested structure case
that had me a bit worried, looks like the
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Paul,
The code you have listed is semantically equivalent to that of what I
already I have submitted for pull and will not solve handling of all malformed
json records. Also the code for
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
The open question was how we can discard a partly-built record during
recovery. As far as I can tell (veterans, please correct me), the
JSONRecordReader keeps track of the record count. So, all
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
As it turns out, the sample code shown was actually tested with a stock
Jackson JSON parser: it does work. No parser changes are needed.
The issue is not whether we can make the parser do
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
There is not much to do except change the JSON parser to support this
functionality.
- Indicate to the parser that a current record terminates when it
encounters a \n (Of course
this
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Looks like you are right; the JsonParser is more than a simple tokenizer.
We're not the first to try this:
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Apologize for getting back on this thread late, got tied up with some
issues@work.
Paul,
The json parser is not just a tokenizer, it keeps track of the JSON
structure and understands
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Poking around in the code a bit more, it looks like we rely on the Jackson
JsonParser which delivers a stream of tokens. Here's the code in JsonReader:
public ReadState
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Paul,
Thanks for taking the time out in writing out a detailed email. Here are
some of my thoughts.
- Drill uses the com.fasterxml.jackson.core.json.UTF8StreamJsonParser for
parsing
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Upon reflection, it seems that newline is not an adequate marker to
separate JSON records. Many of our samples have internal newlines. If a newline
appears inside the JSON record, then we are
Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Thanks much for your contribution! Sorry it is taking a while to review the
code.
I've read over the code changes a couple of times. The options part is
fine. However, the parser part
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Folks,
I have tests on my local drill setup with various combinations of invalid
json format. The only caveat is that some records may be skipped if JSON is not
properly delimitted by }. Can
Thanks Jason.
I am in the process of doing more unit testing, corner case testing etc
before submitting for final review.
Thanks
Subbu
On Mon, Aug 22, 2016 at 9:51 AM, jaltekruse wrote:
> Github user jaltekruse commented on the issue:
>
>
Github user jaltekruse commented on the issue:
https://github.com/apache/drill/pull/518
After this was discussed at the Hangout a few weeks back, I had been
thinking about it more.
My initial request was for warnings to be returned along with query
results. Some initial work
Hey Parth and Subbu,
Sorry for missing the last message on this thread, I will be able to attend
the hangout tomorrow to discuss my concern. As I had said previously, I am
mostly trying to make sure that there is agreement about the impact of this
change on user behavior and expectations before
Hi Subbu,
Yes we can discuss this on the next hangout. If Jason is able to attend
we can discuss some way to address his concern.
Parth
On Wed, Aug 3, 2016 at 10:24 AM, Subbu Srinivasan
wrote:
> Hi Folks,
> When can we discuss this feature? Would next hangout be
Hi Folks,
When can we discuss this feature? Would next hangout be appropriate?
Thanks
Subbu
On Mon, Jul 25, 2016 at 10:20 AM, Subbu Srinivasan
wrote:
> This mechanism falls in line with other JSON processing similar to serde's
> with Hive, UDF's enabled at global level
This mechanism falls in line with other JSON processing similar to serde's
with Hive, UDF's enabled at global level will apply to all users and is
outlined using documentation.
What is your stance if we move to the JSONFormatPlugin?
On Fri, Jul 15, 2016 at 2:08 PM, jaltekruse
Github user jaltekruse commented on the issue:
https://github.com/apache/drill/pull/518
I don't think we should merge this without a mechanism to return a warning
to the user to tell them at least that some data was ignored, and ideally some
indication of how much data was discarded.
Github user chunhui-shi commented on the issue:
https://github.com/apache/drill/pull/518
Thanks for providing a way to skip bad json records and addressing my
comments.
Besides the minor improvements I suggested above: comment format changes
that may be due to line width in your
Github user jinfengni commented on the issue:
https://github.com/apache/drill/pull/518
@chunhui-shi , I saw you made comments days ago. Can you pls take a look at
the new patch to see if it addressed your comment? thx.
---
If your project is set up for it, you can reply to
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
I have made several modifications to get accurate line nos:
- The main change is to move the json parser to the end of the current
record being processed since
previously multiple
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
What is the recommendation with this bug? Should we just remove the line
nos for now and just name the file name. I was not able to work on this last
couple of days, not sure if there is
Github user amansinha100 commented on the issue:
https://github.com/apache/drill/pull/518
hmm.. I still see 4 commits in the pull request. Can you squash them into
one ? (let me know if you need help with that). Also, the commit message needs
to be in the format: "DRILL-4653:
Github user amansinha100 commented on the issue:
https://github.com/apache/drill/pull/518
Looks much better. Sorry for the nitpick but I still have a couple more
related to the coding conventions. :)Also, could you squash the commits
into 1 and use the DRILL-: format for
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
I have made changes are recommended by reviewers:
- Changed JSON_READER_SKIP_INVALID_RECORDS_FLAG constant
- Modified unit test to use builder framework
- Code indendation changes.
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
I updated the pull request with more changes requested by the reviewers.
On Wed, Jun 15, 2016 at 5:28 PM, Aman Sinha
wrote:
> Yes, it does in fact
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
For some reason the build terminated- What needs to be done to fire it
again?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
I have added support so that it prints the offending file/line nos.
Error parsing JSON in DRILL-4653.json : line nos :13
Error parsing JSON in DRILL-4653.json : line nos :14
---
If
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Yes, I looked at DRILL-3764. The changes made to JSONProcessing do not
conflict with my changes.
The only conflict I see is these two constants, I have defined my own.
String
Github user amansinha100 commented on the issue:
https://github.com/apache/drill/pull/518
Yes, it does in fact have a conflict with DRILL-3764 which has changes to
the JsonRecordReader, although this issue is still in progress. I noticed
that @adeneche mentioned this in the JIRA.
Github user parthchandra commented on the issue:
https://github.com/apache/drill/pull/518
Does this conflict with Drill 3764 (see
https://github.com/apache/drill/pull/497)?. I think there was a broader
discussion around this as part of enabling the functionality of
skipping/logging
Github user ssriniva123 commented on the issue:
https://github.com/apache/drill/pull/518
Do u anticipate this to be logged using log4j?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user amansinha100 commented on the issue:
https://github.com/apache/drill/pull/518
A high level comment I have about this PR is that keeping the number of
'malformed' records in the json file may not be sufficient..I would expect the
user would want to know the file name and
40 matches
Mail list logo