[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428205#comment-16428205 ] Andrzej Wislowski commented on SOLR-12094: -- [~dweiss] I think it is a good idea. I will take a look at this code and try to create such patch. > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428202#comment-16428202 ] Noble Paul commented on SOLR-12094: --- I agree that we should be able to handle this use case as well. But, the primary objective is to handle streaming input well. Non streaming parsing should be optional > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428179#comment-16428179 ] Dawid Weiss commented on SOLR-12094: I understand the concept of "streaming" imports, but this just seems wrong to me here. An analogy here would be XSLT or other technologies where the implementation permits efficient "streaming" mode in certain cases, unless the input makes it impossible. I perceive a similar situation here: the parser should be able to handle the input efficiently if possible, but should also give the possibility for processing any type of input, even such that cannot be processed without bookkeeping of some history. Sure, an abuse case of millions of split nodes awaiting a single attribute is possible, but even then it'd be simpler to just say "yeah, buffer up until you can emit the output" than modify the structure of such a json (write a converter so that the nested nodes are always placed at the end of the parent). [~awislowski] do you think you'd be able to modify the patch so that it accepts an argument and switches between the 'strict streaming' mode and 'relaxed' mode? In 'strict streaming' mode there should be no buffering and the parser should complain with an exception if it encounters extra nodes after the split. In the 'relaxed mode' the parser should buffer up the information until it's complete and can be emitted. > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417008#comment-16417008 ] Noble Paul commented on SOLR-12094: --- You are right, it's not a good idea to ignore this. It should probably throw an exception of it encounters such a json. It's possible to implement a non streaming solution. User may pass an optional parameter to switch to that mode > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417003#comment-16417003 ] Dawid Weiss commented on SOLR-12094: I understand, but I also believe it's really likely that people have such nested JSONs and will want to use them. Now it quietly just discards those trailing entries and I don't think that's good either: it should either signal an exception (probably pointing at a non-streaming solution, if there is any) or work correctly. What do you think? > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416475#comment-16416475 ] Noble Paul commented on SOLR-12094: --- before going into the patch, I can see that it is not designed to work like that . The reason is that {{JsonRecordReader}} is a streaming parser. To include the {{'after'}} in the document, It must hold all the data in the {{'examsæ}} in memory. So, it is going to seriously affect the performance of the parser for the normal use case. > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12094) JsonRecordReader ignores root record fields after the split point
[ https://issues.apache.org/jira/browse/SOLR-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416136#comment-16416136 ] Dawid Weiss commented on SOLR-12094: I looked at the code of that streaming parser, it is quite complex; seems like all this node copying and record trickery could be avoided, but it'd be a significantly more complex patch then. [~noble.paul] - you seem to be involved much more in the parser development, would you like to take a look before I commit it in? > JsonRecordReader ignores root record fields after the split point > - > > Key: SOLR-12094 > URL: https://issues.apache.org/jira/browse/SOLR-12094 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Affects Versions: master (8.0) >Reporter: Przemysław Szeremiota >Priority: Major > Attachments: SOLR-12094.patch, SOLR-12094.patch, > json-record-reader-bug.patch > > > JsonRecordReader, when configured with other than top-level split, ignores > all top-level JSON nodes after the split ends, for example: > {code} > { > "first": "John", > "last": "Doe", > "grade": 8, > "exams": [ > { > "subject": "Maths", > "test": "term1", > "marks": 90 > }, > { > "subject": "Biology", > "test": "term1", > "marks": 86 > } > ], > "after": "456" > } > {code} > Node "after" won't be visible in SolrInputDocument constructed from > /update/json/docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org