[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2017-11-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248656#comment-16248656
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Further testing revealed limitations in the underlying Jackson parser. That 
parser will not recover from other errors such as:
```
{ a: }
```
See [DRILL-5953](https://issues.apache.org/jira/browse/DRILL-5953) for 
details.


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-10 Thread Kunal Khatua (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655750#comment-15655750
 ] 

Kunal Khatua commented on DRILL-4653:
-

[~ssriniva123] , while the feature is disabled by default, we should mark it as 
resolved only if it passes with the feature enabled.
 
[~khfaraaz] Please reopen this bug if the FAIL case would qualify as a blocker 
for closing this bug, so that we are tracking this correctly. 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-07 Thread Subbu Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645298#comment-15645298
 ] 

Subbu Srinivasan commented on DRILL-4653:
-

Will look at those JSON issues shortly.


On Thu, Nov 3, 2016 at 11:51 AM, Khurram Faraaz (JIRA) 



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-03 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633817#comment-15633817
 ] 

Khurram Faraaz commented on DRILL-4653:
---

How about the other cases that FAIL even with 
`store.json.reader.skip_invalid_records`=true ?

> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-02 Thread Subbu Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630420#comment-15630420
 ] 

Subbu Srinivasan commented on DRILL-4653:
-

No- The default mode has to be off, this is the consensus of the community
during discussions.


On Wed, Nov 2, 2016 at 12:03 PM, Khurram Faraaz (JIRA) 



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-02 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630075#comment-15630075
 ] 

Khurram Faraaz commented on DRILL-4653:
---

I don't this this is fixed, there are still some cases that need to be taken 
care of. Please see below.
Also, more importantly this checking for malformed JSON should be ON/enabled by 
default in Drill. Users will like to ignore bad records, rather than see an 
Exception/Error and then our support suggest them to enable this 
skip_invalid_records. This I believe should be ON by default in Drill.

[test@cent01 drill_4653]# cat badjson_01.json
{"key":"test string"}
{"key":"foo"}
{"key":"foobar"
{"key":"blah"}
{"key":"temp"}

[test@cent01 drill_4653]# cat badjson_02.json
{
"key":"foo",
"badarray":[1,3,4,5,6,7,8,,
"key":"test string",
"key":"foobar"
}
[test@cent01 drill_4653]#

[test@cent01 drill_4653]# cat badjson_03.json
{
"key":"foo",
"key":"foobar",
"key":"test string",
"key":"string",
"key":
}
[test@cent01 drill_4653]#

[test@cent01 drill_4653]# cat badjson_04.json
{"key":1}
{"key":2}
{"key":3}
{"key":
[test@cent01 drill_4653]

[test@cent01 drill_4653]# cat badjson_05.json
{
"key1":"foobar",
"key2":[1,3,4,5,6,7,8,9],
"key3":{ "key4":},
"key5":"foo"
}
[test@cent01 drill_4653]

[test@cent01 drill_4653]# cat badjson_06.json
{
"name":"John Doe",
"age":33,
"dept":"IT",
"address":{
  "street":"some street",
  "city":"some city",
  "zip":
  }
"isManager":"yes"
}
[test@cent01 drill_4653]

{noformat}
0: jdbc:drill:schema=dfs.tmp> alter session set 
`store.json.reader.skip_invalid_records`=true;
+---+--+
|  ok   | summary  |
+---+--+
| true  | store.json.reader.skip_invalid_records updated.  |
+---+--+
1 row selected (0.334 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_01.json`;
+--+
| key  |
+--+
| test string  |
| foo  |
| temp |
+--+
3 rows selected (0.466 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_01.json`;
+--+
| key  |
+--+
| test string  |
| foo  |
| temp |
+--+
3 rows selected (0.222 seconds)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_02.json`;
Error: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid 
value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; 
line: 3, column: 32]

Line  3
Column  33
Field  badarray
Fragment 0:0

[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

Stack trace from drillbit.log for above failure
{noformat}
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Unexpected 
character (',' (code 44)): expected a valid value (number, String, array, 
object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; 
line: 3, column: 32]

Line  3
Column  33
Field  badarray

[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 ]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543)
 ~[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:586)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:372)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202) 
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:178) 
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]

[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-02 Thread Subbu Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629493#comment-15629493
 ] 

Subbu Srinivasan commented on DRILL-4653:
-

Did u set store.json.reader.skip_invalid_records to true before running
your tests?

On Wed, Nov 2, 2016 at 4:30 AM, Khurram Faraaz (JIRA) 



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-02 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628664#comment-15628664
 ] 

Khurram Faraaz commented on DRILL-4653:
---

[~kkhatua] I tried these tests with malformed JSON, on Drill 1.9.0 git commit 
ID : 83513daf
[~ssriniva123] Is this the expected behavior ?

[test@cent01 drill_4653]# cat badjson_01.json
{"key":"test string"}
{"key":"foo"}
{"key":"foobar"
{"key":"blah"}
{"key":"temp"}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_01.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('{' (code 
123)): was expecting comma to separate OBJECT entries

File  /tmp/badjson_01.json
Record  3
Column  2
Fragment 0:0

[Error Id: 76e0cc69-229b-40b7-93fd-9ca9f6a22473 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_01.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('{' (code 
123)): was expecting comma to separate OBJECT entries

File  /tmp/badjson_01.json
Record  3
Column  2
Fragment 0:0

[Error Id: 9918e669-1638-44f8-a4e1-ffa33b5ef830 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

case (2)

[test@cent01 drill_4653]# cat badjson_02.json
{
"key":"foo",
"badarray":[1,3,4,5,6,7,8,,
"key":"test string",
"key":"foobar"
}
[test@cent01 drill_4653]#

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_02.json`;
Error: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid 
value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@2380dfc9; 
line: 3, column: 32]

Line  3
Column  33
Field  badarray
Fragment 0:0

[Error Id: a25159c7-7770-4a1d-870c-dd479dd01a7d on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_02.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character (',' (code 
44)): expected a valid value (number, String, array, object, 'true', 'false' or 
'null')

File  /tmp/badjson_02.json
Record  1
Column  32
Fragment 0:0

[Error Id: 2ae443eb-7fc2-4648-b9ea-1742d23932ae on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

case (3)

[test@cent01 drill_4653]# cat badjson_03.json
{
"key":"foo",
"key":"foobar",
"key":"test string",
"key":"string",
"key":
}
[test@cent01 drill_4653]#

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 
125)): expected a value

File  /tmp/badjson_03.json
Record  1
Column  2
Fragment 0:0

[Error Id: 39d94490-5186-46d9-9631-94ec32d3094e on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json` where key 
='foobar';
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 
125)): expected a value

File  /tmp/badjson_03.json
Record  1
Column  2
Fragment 0:0

[Error Id: b20cd289-2c7b-41ec-b18c-6941205c4d1d on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

case (4)

[test@cent01 drill_4653]# cat badjson_04.json
{"key":1}
{"key":2}
{"key":3}
{"key":
[test@cent01 drill_4653]

{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_04.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input 
within/between OBJECT entries

File  /tmp/badjson_04.json
Record  4
Column  39
Fragment 0:0

[Error Id: 9cd4d9a8-5871-4eaa-a68e-c6eab3bf2e41 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

case (5)

[test@cent01 drill_4653]# cat badjson_05.json
{
"key1":"foobar",
"key2":[1,3,4,5,6,7,8,9],
"key3":{ "key4":},
"key5":"foo"
}
[test@cent01 drill_4653]

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_05.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 
125)): expected a value

File  /tmp/badjson_05.json
Record  1
Column  22
Fragment 0:0

[Error Id: 18be71b9-bb58-4cd5-9e74-3c19ab282cfd on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}

case (6)

[test@cent01 drill_4653]# cat badjson_06.json
{
"name":"John Doe",
"age":33,
"dept":"IT",
"address":{
  "street":"some street",
  "city":"some city",
  "zip":
  }
"isManager":"yes"
}
[test@cent01 drill_4653]

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_06.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 
125)): expected a value

File  /tmp/badjson_06.json
Record  1
Column  16
Fragment 0:0

[Error Id: bf83ea0e-d708-4c3b-b50c-6923fc17c6b6 on centos-01.qa.lab:31010] 
(state=,code=0)
{noformat}


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: 

[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-11-01 Thread Subbu Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627009#comment-15627009
 ] 

Subbu Srinivasan commented on DRILL-4653:
-

Yes will do.

Sent from my iPhone



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587057#comment-15587057
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/518


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577165#comment-15577165
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/518
  
+1. Looks like there has been enough review and there is good enough reason 
to merge this in


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557355#comment-15557355
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user kfaraaz commented on the issue:

https://github.com/apache/drill/pull/518
  
The below JSON is invalid, due to presence of duplicate key 'key'. Today 
Drill returns a DATA_READ error, does your proposed fix handle this case too ?
[root@centos-01 ~]# cat f1.json
{"key":"string", "key":123, "key":[1,2,3], "key":true, "key":false, 
"key":null, "key":{"key2":"b"}}

Error returned by Drill 1.9.0

0: jdbc:drill:schema=dfs.tmp> select * from `f1.json`;
Error: DATA_READ ERROR: Error parsing JSON - You tried to write a BigInt 
type when you are using a ValueWriter of type NullableVarCharWriterImpl.

File  /tmp/f1.json
Record  1
Fragment 0:0

[Error Id: 06411bc5-2d59-4681-a84f-3f49086e18c0 on centos-01.qa.lab:31010] 
(state=,code=0)


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1939#comment-1939
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have a similar data set checked in. Hope that is good enough.


On Fri, Oct 7, 2016 at 10:00 AM, Paul Rogers 
wrote:

> Base data set:
>
> { "p": 1, "a": { "x": 10, "y": 20, "z": 30 }, "b": 50, "c": 60 }
> { "p": 2, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
> { "p": 3, "a": { "x": 12, "y": 22, "z": 32 }, "b": 52, "c": 62 }
> { "p": 4, "a": { "x": 13, "y": 23, "z": 33 }, "b": 53, "c": 63 }
> { "p": 5, "a": { "x": 14, "y": 24, "z": 34 }, "b": 54, "c": 64 }
> { "p": 6, "a": { "x": 15, "y": 25, "z": 35 }, "b": 55, "c": 65 }
>
> Create various errors:
>
> { "p": 2x, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
>
> and
>
> { "p": 2, "a": { "x": 11x, "y": 21, "z": 31 }, "b": 51, "c": 61 }
>
> And so on.
>
> In running the tests, I consistently saw that only the second (bad) row
> was omitted. Other rows properly appeared, and no partial row appeared.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 

> .
>



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1635#comment-1635
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Base data set:

{ "p": 1, "a": { "x": 10, "y": 20, "z": 30 }, "b": 50, "c": 60 }
{ "p": 2, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
{ "p": 3, "a": { "x": 12, "y": 22, "z": 32 }, "b": 52, "c": 62 }
{ "p": 4, "a": { "x": 13, "y": 23, "z": 33 }, "b": 53, "c": 63 }
{ "p": 5, "a": { "x": 14, "y": 24, "z": 34 }, "b": 54, "c": 64 }
{ "p": 6, "a": { "x": 15, "y": 25, "z": 35 }, "b": 55, "c": 65 }

Create various errors:

{ "p": 2x, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }

and

{ "p": 2, "a": { "x": 11x, "y": 21, "z": 31 }, "b": 51, "c": 61 }

And so on.

In running the tests, I consistently saw that only the second (bad) row was 
omitted. Other rows properly appeared, and no partial row appeared.


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1218#comment-1218
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user zfong commented on the issue:

https://github.com/apache/drill/pull/518
  
@paul-rogers - since you had concerns about particular test cases, and it 
looks like you've confirmed that those are non-issues, would it make sense for 
you to share those with @ssriniva123 and he can then include them as unit tests 
with this pull request?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554155#comment-15554155
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Ran some tests. The results look good. In particular, files with nested 
structures produced the correct results. Since it was the nested structure case 
that had me a bit worried, looks like the code is good to go.

+1 (non-binding)


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525244#comment-15525244
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
The code you have listed is semantically equivalent to that of what I 
already I have submitted for pull and will not solve handling of all malformed 
json records. Also the code for reporting the 
error records is working correctly as long as is it is reported by the 
Parser correctly.

As I explained earlier the JSON parser is not just a simple tokenizer, it 
keeps track of internal state,
hence the issue. SERDE's in hive etc work because they  are record oriented 
with clean record demarkations using a new line.

One solution is to submit a patch to jackson parser to expose a method to 
skip to new line in the
event of a parsing exception. This can be parametrized so that behavior can 
customized.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525212#comment-15525212
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
The open question was how we can discard a partly-built record during 
recovery. As far as I can tell (veterans, please correct me), the 
JSONRecordReader keeps track of the record count. So, all we have to do is not 
increment the count when we want to discard a record. Look in

JSONRecordReader.next( )
  ...
  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
writer.setPosition(recordCount); // Sets the position for the next 
read.
write = jsonReader.write(writer); // Write the record. We can catch 
errors
  // and recover here??
 ...
  recordCount++; // Don't do this on a bad record
  ...
  writer.setValueCount(recordCount); // The record reader controls the 
record count.

This seems to show the elements of a solution:

1. Try to read the record.
2. If a failure occurs, catch it here and clean up, as in the previous post.
3. Don't increment the record count. We reuse the current one on the next 
record read.

Now the only open question is how we clean up the in-flight record in case 
some columns are not present in the next record. Anyone know how to set a 
vector position to null (for optional) default value (for required) or 
zero-length (for repeated)?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523997#comment-15523997
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
As it turns out, the sample code shown was actually tested with a stock 
Jackson JSON parser: it does work. No parser changes are needed.

The issue is not whether we can make the parser do what is needed: the code 
posted in the comment above demonstrated a solution.

The issue is how we incorporate that code into the JSON parser to clean up 
partial records and prevent schema changes. When I have time, I'll investigate 
that question in greater depth.

IMHO, without a proper fix, we should simply state that Drill does not 
support malformed JSON. If an input file might be incorrect, run it though a 
clean-up step before allowing Drill to scan it. Otherwise, we are opening the 
door to many hard-to-resolve bugs when people ask Drill to scan corrupt JSON: 
the result, without a proper fix, would be undefined -- which is worse than the 
current behavior that simply fails the scan with an error.

Let's follow up again after I (or someone) has had a chance to figure out 
if we can undo a partially built record. If we can do that, then we've got a 
path to a clean solution: recover the parser (as shown earlier) and discard the 
in-flight record (as we need to research.)


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523852#comment-15523852
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
There is not much to do except change the JSON parser to support this 
functionality.

- Indicate to the parser that a current record terminates when it 
encounters a \n (Of course
this then assumes that druid also aligns to record separators using a new 
line).
This is a change to make to the jackson parser.

- Right now the current code works for all the standard cases except one 
case where the inner 
most sub-structure within a JSON is malformed.

Given that this is a great recipe for approximation algorithms , I am 
requesting this change to be pulled in.

If need be we can work on change to the jackson parser using a different 
request.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512183#comment-15512183
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Looks like you are right; the JsonParser is more than a simple tokenizer.

We're not the first to try this: 
http://stackoverflow.com/questions/37511496/recover-from-malformed-json-with-jackson
 (no answer)

I tried an experiment and found that you are on the right track: the way 
you are using the JsonParser can be extended to ignore input until the start of 
the next object. A quick demonstration:

private static void recover(JsonParser parser) throws IOException {
  for ( ; ; ) {
JsonToken token;
try {
  token = parser.nextToken();
} catch( JsonParseException e ) { continue; }
if ( token == null ) return;
if ( token != JsonToken.END_OBJECT ) { continue; }
token = parser.nextToken();
if ( token == null ) return;
if ( token == JsonToken.START_OBJECT ) { return; }
  }
}

Basically, we keep reading tokens, and ignoring errors, until we 
successfully find the } { pair.

As we discussed before, to use the above in Drill, we have to discard the 
partly-built record, and start reading the next record assiming the parser is 
positioned **after** the START_OBJECT ("{") token, which we've already 
consumed. That should be simple.

Still, to do proper recovery, we have to discard the partly-built JSON 
record. I've not looked into how to do that. If we don't do that, we return the 
bogus partly-built record. Worse, if we recover by trying to build a new 
record, we create more partly-built records, but with a different schema, 
possibly triggering a schema change event when not really necessary.

Any ideas for how to solve that problem?



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505198#comment-15505198
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Apologize for getting back on this thread late, got tied up with some 
issues@work.

Paul,
The json parser is not just a tokenizer, it keeps track of the JSON 
structure and understands various aspects of it like root, array/objectcontext 
and all parsing is done under that context.

- we cannot keep track of {} accurately - For eg: The counting json 
processor does a parser. skipChildren which tries to skip to the end of the 
JSON, but this can rollover to next line when
there is a malformed JSON in the bottom most json sub object - see example 
below (missing " in last json structure). This is similar behavior with the 
JsonReader.

{"balance": 1000.0,"num": 100,"is_vip": true,"name": 
"foo3","curr":{"denom":"pound","test":{"value  :false}}}

- One possible solution is to rewind the input source to reset the stream 
(which is not recommended and there is no guarentee that all streams support 
mark/reset semantics.

Given where we are, I think the solution proposed works perfect for almost 
all malformed JSON's.





> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488657#comment-15488657
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Poking around in the code a bit more, it looks like we rely on the Jackson 
JsonParser which delivers a stream of tokens. Here's the code in JsonReader:

public ReadState write(ComplexWriter writer) throws IOException {
  JsonToken t = parser.nextToken();

As you pointed out, white space is consumed internally to the tokenizer. 
But we do get JSON tokens ({, }, identifier, number, etc.) Given this, you can 
scan ahead to find the close-open bracket pattern.

The trick seems to solve two problems. First, how do we handle tokens: The 
answer seems to be a few lines down:

ReadState readState = writeToVector(writer, t);

Given a start token, we write to vector. writeToVector is a 
recursive-descent parser that consumes tokens and does the right thing. Here, 
it accepts only a top-level object, rejecting all other tokens. For an object, 
we call writeDataSwitch (funny name, that) which recursively writes fields, 
objects, lists and the rest.

Since this is a recursive-descent parser, JSON structure is represented by 
the call stack. To bail out of an invalid parse, we have to unwind the stack, 
which is what an exception does. So, that part may be good.

The next step is how to get the tokenizer (JsonParser) pointed at the right 
point so we can try to read the next object. That is where we need to consume 
tokens until we find the end-start bracket pair.

But, note that we have now consumed the start bracket, so we can't read it 
again when we gain call writeToVector. Checking the code, the JsonParser has no 
unget( ) method, unfortunately. To work around that, we need to "push" the 
token back onto the input. (Either actually doing so, or by having internal 
state that says that we've already read the open bracket.) 

We also have to ask if the JSON parser keeps track of parse state. Looking 
at the code, it seems that JsonParser is really just a tokenizer: it has not 
state about the JSON structure. (Would be good to run a small test to verify 
this observation.)

By the way, the JSON parser class has lots of good stuff already there. For 
example, the parser itself will keep track of line numbers and file locations. 
Perhaps we can use that when reporting error positions.

The last bit is that we've been building up a record as we parser JSON. If 
we fail part way thorugh, we've got a half-build record. Again, here I'm a bit 
hazy on what Drill can do. Can we "unwind" the current record? Can we mark the 
record as one to ignore (with a select vector?) Or, do we live with the 
half-build record? Throwing away the half-built record would be best, if we can 
do it.

All that said, how much of the above does your proposed code change handle? 
What other parts might still need to be added?

Thanks!


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486473#comment-15486473
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
Thanks for taking the time out in writing out a detailed email. Here are 
some of my thoughts.

- Drill uses the com.fasterxml.jackson.core.json.UTF8StreamJsonParser for 
parsing of JSON records. This parser does not rely on line delimiters for 
record separators but instead uses
the JSON structure as a natural way to signal End of record (EOR). There 
are methods internal
to the parser which check for line feeds but is not exposed to callers.

- The CountingJsonReader uses the parser.skipChildren() method to skip the 
rest of the children for this record, hence it is not possible to accurately 
count and match the no of braces to cleanly skip that bad record.

- One thought is to tap the inputsource of the parser on an exception 
condition, but is not
encouraged.

My thought process was exactly along the lines you have been thinking. On 
an exception scenario the code attempts to locate a closing bracket(}) followed 
by a opening bracket ({).
This is what is being done in the BaseJsonProcessor.processJSONException 
method. Please note that it works in all cases except when we do not have 
proper brackets to signify end of a JSON record. 

Hope this explanation helps clarify.









> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486130#comment-15486130
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Upon reflection, it seems that newline is not an adequate marker to 
separate JSON records. Many of our samples have internal newlines. If a newline 
appears inside the JSON record, then we are subject to the same incorrect 
recovery as illustrated with the "a, x, bar, y" example in the earlier comment.

Further, if the JSON tokenizer is like most, it probably discards 
whitespace, not returning EOL as a token.

So, it seems that the best (or only) option is to scan for the "} {" pair. 
This requires two specific improvements:

* A "token discarder" that uses a state machine to look for the "} {" 
pairs, and
* An indirection around the get-token method so we can push the "{" token 
back onto the input.

These changes, along with the pseudo-code shown earlier may provide as good 
a solution as we can get. (Phrased that way because some errors will cause two 
records to be discarded, as explained earlier.) Combine that with the options 
and error reporting from the original pull request and we are probably pretty 
close.


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485850#comment-15485850
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
Thanks much for your contribution! Sorry it is taking a while to review the 
code.

I've read over the code changes a couple of times. The options part is 
fine. However, the parser part seems a bit too complex and may not handle all 
cases of interest. I worry that the code may only handle the error shown in the 
test input file: a missing close quote:

{ "foo" : "bar }
{ "foo" : "mumble" }

To do a more general recovery, we have to have some known good recovery 
point. Pure JSON is not suited to recovery. Fortunately, the JSON we read is 
not true JSON, rather it is a file of JSON objects. The file-of-json-object 
structure offers two possible recovery points.

If we require that each JSON record start on a new line, then we can use 
newline as our recovery point:

{ "foo" :: 10 }
{ "foo" : 20 }

In the above, the first record is badly formed. But, we know where the next 
record begins because of the newline separator.

It seems that the code may be using the above approach. But, since the only 
test case is a missing close quote, the parser will "naturally" read all tokens 
up to EOL. But, the implementation seems to not handle the double-colon 
example: the paser will fail on the second token. When resuming, the parser 
will see another colon and fail again. Maybe the parser would eventually fail 
enough to find the EOL. But, we can construct cases where things would fail:

{ "foo" :: 10, "x": { "bar", 20, "mumble" } }

If we don't discard tokens to the newline, the parser may decide that { 
"bar" ... starts a new valid record, with the result of causing a schema change 
when not needed.

The message here is that, on error, we must discard tokens to EOL, we can't 
just try to resume parsing at the point of failure.

If, however, we can't count on the newline, then we need some other 
syntactic trick. Perhaps we can look for "}/s*{" (that is, close bracket, 
optional white space, open bracket):

{ "foo" :: 
{ "foo" : 20 }
{ "foo" : 30 }

Here, we'd actually discard both the first and second records because we'd 
only find the recovery patttern between the second and third records. (The "} 
{" pattern can never occur inside a JSON object, it would have to be "}, {" 
instead.)

Using EOL as delimiter is easiest: we don't need lookahead. But, we require 
that a newline separate records. (Newlines within records can be handled, but 
we'll omit that for now...)

Using "}/s*{" as the recovery is a bit harder: we need to read ahead by one 
token, then push the token back on the input.

If we do the above, then the reader algorithm should look something like 
this:

while ( more records to read ) {
try {
read the record
} catch ( parse exception ) {
read and discard tokens until we get to an EOL (or the "}/s*{")
reset the value vector pointer back one step to discard any 
partially loaded values.
}
}

The above can be shown to be correct by working through a simple state 
machine and set of examples.

Alternatively, it might help to include here (or in comments in code) an 
explanation of the proposed recovery mechanism so it is easier for reviewers to 
verify correctness.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485676#comment-15485676
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r78475862
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JsonProcessor.java
 ---
@@ -30,6 +30,8 @@
 
   public static enum ReadState {
 END_OF_STREAM,
+JSON_RECORD_PARSE_ERROR,
--- End diff --

Would be helpful to add a comment to describe the meaning of these new 
states.


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484810#comment-15484810
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Folks,
I have tests on my local drill setup with various combinations of invalid 
json format. The only caveat is that some records may be skipped if JSON is not 
properly delimitted by }. Can this be reviewed for next release ? I can work on 
the documentation.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-08-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431172#comment-15431172
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user jaltekruse commented on the issue:

https://github.com/apache/drill/pull/518
  
After this was discussed at the Hangout a few weeks back, I had been 
thinking about it more.

My initial request was for warnings to be returned along with query 
results. Some initial work was posted last fall to add warnings to the RPC 
protocol, but unfortunately it was not brought to completion. These kinds of 
warnings can be received by JDBC and ODBC sources and many tools show them to 
users.

https://github.com/apache/drill/pull/263

In the hangout we discussed that this changeset is currently logging when 
part of a file is ignored. I don't believe that there is currently an 
expectation that users should have to grep through a log file to find out any 
additional information about the execution of a successful query, the log files 
are there for admins to debug system issues.

I still think it would be good to have a way to remove the need to look 
through the log file to see if this behavior was used in a query, but as there 
wasn't a lot of concern expressed by others when we discussed it, I'm changing 
my vote to a +0.


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380119#comment-15380119
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user jaltekruse commented on the issue:

https://github.com/apache/drill/pull/518
  
I don't think we should merge this without a mechanism to return a warning 
to the user to tell them at least that some data was ignored, and ideally some 
indication of how much data was discarded. While I do understand this is not 
the default behavior, I think there is still too high of a risk that an admin 
could set this at a global level and users would be unaware of some of their 
data being discarded.

I am willing to discuss the benefits of merging this before such a system 
exists, but until this issue has been thoroughly evaluated I am -1 on the 
change.

One improvement you could make to the current implementation is moving the 
option to the format plugin instead of the system/session list. This enables 
users to include setting the option in there query with the "table with 
options" syntax that was added last fall. We already have a JIRA open for 
moving the all_text_mode and read_numbers_as_double options to this location, 
because it doesn't really make sense to change query results based on session 
state. Unfortunately this change does not completely remove my initial concern, 
because not all users can modify or see the storage plugins in the case when 
web UI security is enabled. Non-admin users in these cases could be surprised 
by this behavior.

For examples of how this is done, you can look at the text plugin config, 
you would just need to add these options as properties to the json config which 
is currently mostly empty.

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONFormatPlugin.java#L93


https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java#L135

Select with options: https://issues.apache.org/jira/browse/DRILL-4047
Jira for moving the existing options: 
https://issues.apache.org/jira/browse/DRILL-4206


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378356#comment-15378356
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user chunhui-shi commented on the issue:

https://github.com/apache/drill/pull/518
  
Thanks for providing a way to skip bad json records and addressing my 
comments.
Besides the minor improvements I suggested above: comment format changes 
that may be due to line width in your env.,  and unit tests that should check 
the expected output as much as possible, could you please squash these multiple 
commits into one commit? 



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378334#comment-15378334
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r70879473
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -159,64 +164,91 @@ public void drill_3353() throws Exception {
   test("create table dfs_test.tmp.drill_3353 as select a from 
dfs.`${WORKING_PATH}/src/test/resources/jsoninput/drill_3353` where e = true");
   String query = "select t.a.d cnt from dfs_test.tmp.drill_3353 t 
where t.a.d is not null";
   test(query);
-  testBuilder()
-  .sqlQuery(query)
-  .unOrdered()
-  .baselineColumns("cnt")
-  .baselineValues("1")
-  .go();
+  testBuilder().sqlQuery(query).unOrdered().baselineColumns("cnt")
+  .baselineValues("1").go();
 } finally {
   testNoResult("alter session set `store.json.all_text_mode` = false");
 }
   }
 
-  @Test // See DRILL-3476
+  @Test
+  // See DRILL-3476
   public void testNestedFilter() throws Exception {
 String query = "select a from cp.`jsoninput/nestedFilter.json` t where 
t.a.b = 1";
 String baselineQuery = "select * from cp.`jsoninput/nestedFilter.json` 
t where t.a.b = 1";
-testBuilder()
-.sqlQuery(query)
-.unOrdered()
-.sqlBaselineQuery(baselineQuery)
+
testBuilder().sqlQuery(query).unOrdered().sqlBaselineQuery(baselineQuery)
 .go();
   }
 
- @Test // See DRILL-4653
-public void testSkippingInvalidJSONRecords() throws Exception {
-try
-{
-  String set = "alter session set `" + 
ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG+ "` = true";
-  String query = "select count(*) from cp.`jsoninput/DRILL-4653.json`";
+  @Test
+  // See DRILL-4653
+  /* Test for CountingJSONReader */
+  public void testCountingQuerySkippingInvalidJSONRecords() throws 
Exception {
+try {
+  String set = "alter session set `"
+  + ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG + "` = 
true";
+  String set1 = "alter session set `"
+  + ExecConstants.JSON_READER_PRINT_INVALID_RECORDS_LINE_NOS_FLAG
+  + "` = true";
+  String query = "select count(*) from 
cp.`jsoninput/drill4653/file.json`";
+  testNoResult(set);
+  testNoResult(set1);
+  
testBuilder().unOrdered().sqlQuery(query).sqlBaselineQuery(query).build()
--- End diff --

Should we verify the expected count against the result?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378336#comment-15378336
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r70879554
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -159,64 +164,91 @@ public void drill_3353() throws Exception {
   test("create table dfs_test.tmp.drill_3353 as select a from 
dfs.`${WORKING_PATH}/src/test/resources/jsoninput/drill_3353` where e = true");
   String query = "select t.a.d cnt from dfs_test.tmp.drill_3353 t 
where t.a.d is not null";
   test(query);
-  testBuilder()
-  .sqlQuery(query)
-  .unOrdered()
-  .baselineColumns("cnt")
-  .baselineValues("1")
-  .go();
+  testBuilder().sqlQuery(query).unOrdered().baselineColumns("cnt")
+  .baselineValues("1").go();
 } finally {
   testNoResult("alter session set `store.json.all_text_mode` = false");
 }
   }
 
-  @Test // See DRILL-3476
+  @Test
+  // See DRILL-3476
   public void testNestedFilter() throws Exception {
 String query = "select a from cp.`jsoninput/nestedFilter.json` t where 
t.a.b = 1";
 String baselineQuery = "select * from cp.`jsoninput/nestedFilter.json` 
t where t.a.b = 1";
-testBuilder()
-.sqlQuery(query)
-.unOrdered()
-.sqlBaselineQuery(baselineQuery)
+
testBuilder().sqlQuery(query).unOrdered().sqlBaselineQuery(baselineQuery)
 .go();
   }
 
- @Test // See DRILL-4653
-public void testSkippingInvalidJSONRecords() throws Exception {
-try
-{
-  String set = "alter session set `" + 
ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG+ "` = true";
-  String query = "select count(*) from cp.`jsoninput/DRILL-4653.json`";
+  @Test
+  // See DRILL-4653
+  /* Test for CountingJSONReader */
+  public void testCountingQuerySkippingInvalidJSONRecords() throws 
Exception {
+try {
+  String set = "alter session set `"
+  + ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG + "` = 
true";
+  String set1 = "alter session set `"
+  + ExecConstants.JSON_READER_PRINT_INVALID_RECORDS_LINE_NOS_FLAG
+  + "` = true";
+  String query = "select count(*) from 
cp.`jsoninput/drill4653/file.json`";
+  testNoResult(set);
+  testNoResult(set1);
+  
testBuilder().unOrdered().sqlQuery(query).sqlBaselineQuery(query).build()
+  .run();
+} finally {
+  String set = "alter session set `"
+  + ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG + "` = 
false";
   testNoResult(set);
-  testBuilder()
-  .unOrdered()
-  .sqlQuery(query)
-  .sqlBaselineQuery(query)
-  .build().run();
 }
-finally
-{
-  String set = "alter session set `" + 
ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG+ "` = false";
+  }
+
+  @Test
+  // See DRILL-4653
+  /* Test for CountingJSONReader */
+  public void testCountingQueryNotSkippingInvalidJSONRecords() throws 
Exception {
+try {
+  String query = "select count(*) from 
cp.`jsoninput/drill4653/file.json`";
+  
testBuilder().unOrdered().sqlQuery(query).sqlBaselineQuery(query).build()
--- End diff --

Should we compare with expected count here?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378249#comment-15378249
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r70872218
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -56,11 +57,19 @@ public void trySimpleQueryWithLimit() throws Exception {
 test("select * from cp.`limit/test1.json` limit 10");
   }
 
-  @Test// DRILL-1634 : retrieve an element in a nested array in a repeated 
map.  RepeatedMap (Repeated List (Repeated varchar))
+  @Test
+  // DRILL-1634 : retrieve an element in a nested array in a repeated map.
+  // RepeatedMap (Repeated List (Repeated varchar))
   public void testNestedArrayInRepeatedMap() throws Exception {
 test("select a[0].b[0] from cp.`jsoninput/nestedArray.json`");
 test("select a[0].b[1] from cp.`jsoninput/nestedArray.json`");
-test("select a[1].b[1] from cp.`jsoninput/nestedArray.json`");  // 
index out of the range. Should return empty list.
+test("select a[1].b[1] from cp.`jsoninput/nestedArray.json`"); // 
index out
--- End diff --

comment format


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378138#comment-15378138
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r70863798
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java
 ---
@@ -110,21 +118,29 @@ public void ensureAtLeastOneField(ComplexWriter 
writer) {
 emptyStatus.set(i, true);
   }
   if (i == 0 && !allTextMode) {
-// when allTextMode is false, there is not much benefit to 
producing all the empty
-// fields; just produce 1 field.  The reason is that the type of 
the fields is
-// unknown, so if we produce multiple Integer fields by default, a 
subsequent batch
-// that contains non-integer fields will error out in any case.  
Whereas, with
-// allTextMode true, we are sure that all fields are going to be 
treated as varchar,
-// so it makes sense to produce all the fields, and in fact is 
necessary in order to
+// when allTextMode is false, there is not much benefit to 
producing all
--- End diff --

Seems the line width changed here(line 121-132). Please reorganize the 
text. If possible keep the original text unchanged. The same in line 140-143


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371113#comment-15371113
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/518
  
@chunhui-shi , I saw you made comments days ago. Can you pls take a look at 
the new patch to see if it addressed your comment? thx. 
 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348682#comment-15348682
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have made several modifications to get accurate line nos:

- The main change is to move the json parser to the end of the current 
record being processed since
previously multiple exceptions were thrown.

- BaseJsonProcessor contains new method protected 
JsonExceptionProcessingState processJSONException().
-This is called by both CountingJsonReader and JsonProcessor whenever 
parser encounters a jackson parsing exception.
-I have also added a new system setting 
store.json.reader.print_skipped_invalid_record_number so that we can suppress 
printing of line numbers.
- Added more unit test cases.
- System tested with various combinations on a local drill bit.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.7.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336180#comment-15336180
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/518
  
hmm.. I still see 4 commits in the pull request.  Can you squash them into 
one ? (let me know if you need help with that).  Also, the commit message needs 
to be in the format:  "DRILL-4653: Malformed json"


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.7.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335370#comment-15335370
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have also squashed my commits


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.7.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334765#comment-15334765
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67431369
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +191,33 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
-//Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
-//  logger.debug("Wrote record.");
-  recordCount++;
-}else{
-//  logger.debug("Exiting.");
-  break outside;
-}
-
+outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
+try
+  {
+writer.setPosition(recordCount);
--- End diff --

Aman,
maven checkstyle:checkstyle did not report any errors before I did my last
check in. I have changed to reflect 2 spaces for indendation.

On Thu, Jun 16, 2016 at 2:22 PM, Aman Sinha 
wrote:

> In
> 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
> :
>
> > -  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
> > -writer.setPosition(recordCount);
> > -write = jsonReader.write(writer);
> > -
> > -if(write == ReadState.WRITE_SUCCEED) {
> > -//  logger.debug("Wrote record.");
> > -  recordCount++;
> > -}else{
> > -//  logger.debug("Exiting.");
> > -  break outside;
> > -}
> > -
> > +outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
> > +try
> > +  {
> > +writer.setPosition(recordCount);
>
> seems this is still doing indent of 4. We use 2 spaces (see
> https://drill.apache.org/docs/apache-drill-contribution-guidelines/
> scroll down to Step 2). Did it pass the mvn command line build without
> checkstyle violations ?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> 
,
> or mute the thread
> 

> .
>



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: 1.7.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334716#comment-15334716
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67426795
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -179,4 +180,43 @@ public void testNestedFilter() throws Exception {
 .sqlBaselineQuery(baselineQuery)
 .go();
   }
+
+ @Test // See DRILL-4653
+public void testSkippingInvalidJSONRecords() throws Exception {
+try
+{
+String set = "alter session set `" + 
ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG+ "` = true";
--- End diff --

these should be indented inside the try block with 2 spaces.   It is best 
to set the indent level in your IDE (I can help with Eclipse if you are using 
it;  if you are using IntelliJ I can find out from other developers using 
IntelliJ). 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334713#comment-15334713
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67426381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +191,33 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
-//Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
-//  logger.debug("Wrote record.");
-  recordCount++;
-}else{
-//  logger.debug("Exiting.");
-  break outside;
-}
-
+outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
+try
+  {
+writer.setPosition(recordCount);
--- End diff --

seems this is still doing indent of 4.  We use 2 spaces (see 
https://drill.apache.org/docs/apache-drill-contribution-guidelines/   scroll 
down to Step 2).   Did it pass the mvn command line build without checkstyle 
violations ? 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334706#comment-15334706
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/518
  
Looks much better.  Sorry for the nitpick but I still have a couple more 
related to the coding conventions. :)Also, could you squash the commits 
into 1 and use the DRILL-:   format for the commit ?   
thanks !


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334666#comment-15334666
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have made changes are recommended by reviewers:

- Changed JSON_READER_SKIP_INVALID_RECORDS_FLAG constant
- Modified unit test to use builder framework
- Code indendation changes.  



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334270#comment-15334270
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67390934
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -116,6 +117,7 @@ public void testMixedNumberTypes() throws Exception {
   .jsonBaselineFile("jsoninput/mixed_number_types.json")
   .build().run();
 } catch (Exception ex) {
+  ex.printStackTrace();
--- End diff --

Not a good idea to print stack trace in unit tests. The output of our unit 
tests is already too verbose.
Use junit.fail with the message from the exception instead?



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334261#comment-15334261
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67390506
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -179,4 +181,28 @@ public void testNestedFilter() throws Exception {
 .sqlBaselineQuery(baselineQuery)
 .go();
   }
+
+
+ @Test // See DRILL-4653
+  public void testSkippingInvalidJSONRecords() throws Exception {
--- End diff --

For both these tests could you pls use the testBuilder() framework ?  This 
is the recommended way to write the unit tests .. you can see one of the other 
tests in this file.  


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334254#comment-15334254
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67389956
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 ---
@@ -116,6 +117,7 @@ public void testMixedNumberTypes() throws Exception {
   .jsonBaselineFile("jsoninput/mixed_number_types.json")
   .build().run();
 } catch (Exception ex) {
+  ex.printStackTrace();
--- End diff --

not sure why this printStackTrace was added in a different test from the 
ones that you added...


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334250#comment-15334250
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67389846
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +194,37 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
 //Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
+   // try
+   // {
+  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
+  try{
+writer.setPosition(recordCount);
+write = jsonReader.write(writer);
+if(write == ReadState.WRITE_SUCCEED) {
 //  logger.debug("Wrote record.");
-  recordCount++;
-}else{
+  recordCount++;
+}else{
 //  logger.debug("Exiting.");
-  break outside;
-}
-
+  break outside;
+}
   }
-
-  jsonReader.ensureAtLeastOneField(writer);
-
+  catch(Exception ex)
+  {
+   ++parseErrorCount;
--- End diff --

the indentations seem to be off here as well as other places.. can you make 
sure the indentations match the rest of the code ?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334248#comment-15334248
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67389726
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +194,37 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
 //Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
+   // try
+   // {
+  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
+  try{
+writer.setPosition(recordCount);
+write = jsonReader.write(writer);
+if(write == ReadState.WRITE_SUCCEED) {
 //  logger.debug("Wrote record.");
-  recordCount++;
-}else{
+  recordCount++;
+}else{
 //  logger.debug("Exiting.");
-  break outside;
-}
-
+  break outside;
+}
   }
-
-  jsonReader.ensureAtLeastOneField(writer);
-
+  catch(Exception ex)
--- End diff --

minor style convention: can you put the catch() on the previous line to 
match the closing paren 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334242#comment-15334242
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67389362
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +194,37 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
 //Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
+   // try
--- End diff --

remove these commented lines


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334236#comment-15334236
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67389073
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -135,6 +135,9 @@
   BooleanValidator JSON_EXTENDED_TYPES = new 
BooleanValidator("store.json.extended_types", false);
   BooleanValidator JSON_WRITER_UGLIFY = new 
BooleanValidator("store.json.writer.uglify", false);
   BooleanValidator JSON_WRITER_SKIPNULLFIELDS = new 
BooleanValidator("store.json.writer.skip_null_fields", true);
+  String JSON_READER_SKIP_MALFORMED_RECORDS_FLAG = 
"store.json.reader.skip_malformed_records";
--- End diff --

Can you change this to 'skip_invalid_records' such that the name is 
somewhat consistent with the future similar option in DRILL-3764.  In the 
future the json option would likely be subsumed by the new global option. 


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334227#comment-15334227
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67388118
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -39,6 +40,7 @@
 import org.apache.drill.exec.vector.complex.fn.JsonReader;
 import org.apache.drill.exec.vector.complex.impl.VectorContainerWriter;
 import org.apache.hadoop.fs.Path;
+import org.apache.parquet.Log;
--- End diff --

Not sure why the parquet.log is included in the json reader


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333142#comment-15333142
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I updated the pull request with more changes requested by the reviewers.

On Wed, Jun 15, 2016 at 5:28 PM, Aman Sinha 
wrote:

> Yes, it does in fact have a conflict with DRILL-3764 which has changes to
> the JsonRecordReader, although this issue is still in progress. I noticed
> that @adeneche  mentioned this in the JIRA.
> @ssriniva123  did you get a chance to
> look at DRILL-3764 ?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 

> .
>



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-10 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325482#comment-15325482
 ] 

Deneche A. Hakim commented on DRILL-4653:
-

Sorry for taking so long, unfortunately I don't know the answer to this question

> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325472#comment-15325472
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

GitHub user ssriniva123 opened a pull request:

https://github.com/apache/drill/pull/518

DRILL-4653.json - Malformed JSON should not stop the entire query from 
progressing

https://issues.apache.org/jira/browse/DRILL-4653

- The default is to stop processing as is today when JSON parser encounters 
an exception
- Setting store.json.reader.skip_malformed_records will ensure that query 
progresses after
skipping the bad records
- Added two unit tests
- Also did testing after deploying the new build: Both positive and 
negative tests were done.
- Negative test result:
org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
Error parsing JSON - Unexpected character ('{' (code 123)): was expecting comma 
to separate OBJECT entries

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ssriniva123/drill master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #518


commit 4fc70faf0e5ad5d434b944d084bc0b0e90d41c39
Author: Subbu Srinivasan 
Date:   2016-06-10T22:58:49Z

Fixes for DRILL-4653.json




> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-06-05 Thread Subbu Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315948#comment-15315948
 ] 

Subbu Srinivasan commented on DRILL-4653:
-

Hi Deneche,
Quick question. Do u know where we have any doc on how drill downloads and
processes files from s3? It must be using
a location on disk? Where and how to configure in a prod env?

On Wed, May 4, 2016 at 10:24 AM, Deneche A. Hakim (JIRA) 



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-05-04 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271026#comment-15271026
 ] 

Deneche A. Hakim commented on DRILL-4653:
-

You may also be interested into the following JIRA: DRILL-3764

> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-05-03 Thread subbu srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269945#comment-15269945
 ] 

subbu srinivasan commented on DRILL-4653:
-

Folks,
I went through the code for JsonParsing. The main call for JSON deserialization 
happens to be
in JSONReader which is called from JSONRecordParser. The issue is that a 
handleAndRaise call is made to all caught exceptions.
Would the proposal below be of acceptance to the community.
The proposal is to catch the IOException and not bail out.
try{
outside: while(recordCount < BaseValueVector.INITIAL_VALUE_ALLOCATION) {
try
{
writer.setPosition(recordCount);
write = jsonReader.write(writer);
if(write == ReadState.WRITE_SUCCEED)
{ // logger.debug("Wrote record."); recordCount++; }
else
{ // logger.debug("Exiting."); break outside; }
}
catch(IOException ex)
{ logger.error("Ignoring record. Error parsing JSON: ", ex); ++parseErrorCount; 
}
}

> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)