[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890907#comment-15890907 ] Apache Spark commented on SPARK-18352: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17128 > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nathan Howell > Labels: releasenotes > Fix For: 2.2.0 > > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771689#comment-15771689 ] Apache Spark commented on SPARK-18352: -- User 'NathanHowell' has created a pull request for this issue: https://github.com/apache/spark/pull/16386 > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706213#comment-15706213 ] Josh Rosen commented on SPARK-18352: Yeah, I'll update my patch to roll back my JSON changes so it shouldn't conflict. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705944#comment-15705944 ] Reynold Xin commented on SPARK-18352: - I've asked [~joshrosen] to do that only for the text format, and not json. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705940#comment-15705940 ] Nathan Howell commented on SPARK-18352: --- Got hung up on some other stuff, haven't been able to get back to adding tests yet. WIP code is up here: https://github.com/NathanHowell/spark/commits/SPARK-18352 Question though. https://github.com/apache/spark/pull/15813 touches a bunch of areas I was also working on. Do you think this patch will land soon? Should I rework mine on top? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675966#comment-15675966 ] Nathan Howell commented on SPARK-18352: --- Sounds good to me. I have an implementation that's passing basic tests but needs to be cleaned up a bit. I'll get a pull request up in the next few days. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675517#comment-15675517 ] Reynold Xin commented on SPARK-18352: - Actually just talked to [~marmbrus] and now I understand more how JSON reader works. I'd say we always turn the top level array into multiple records, and then have only one option: wholeFile. This same option can be used in json and text. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675466#comment-15675466 ] Hyukjin Kwon commented on SPARK-18352: -- Ah, you meant producing each row while parsing the whole text in iteration. I see. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675446#comment-15675446 ] Reynold Xin commented on SPARK-18352: - No that's not sufficient. It doesn't do streaming. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675441#comment-15675441 ] Hyukjin Kwon commented on SPARK-18352: -- Hi [~rxin], I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally. Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally). > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675437#comment-15675437 ] Reynold Xin commented on SPARK-18352: - I guess maybe it should be a user-configurable option? Otherwise Spark on its own don't have enough information to disambiguate. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675421#comment-15675421 ] Nathan Howell commented on SPARK-18352: --- Do you have any ideas how to support this? {{DataFrameReader.schema}} currently takes a {{StructType}} and the existing row level json reader flattens arrays out to support this restriction. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675405#comment-15675405 ] Reynold Xin commented on SPARK-18352: - Are these actually record delimiters? If the top level structure is an array, would we want to parse a single file as multiple records? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15675386#comment-15675386 ] Nathan Howell commented on SPARK-18352: --- Any opinions on configuring this with an option instead of a creating a new data source? It looks fairly straightforward to support this as an option. E.g.: {code} // parse one json value per line // this would be the default behavior, for backwards compatibility spark.read.option("recordDelimiter", "line").json(???) // parse one json value per file spark.read.option("recordDelimiter", "file").json(???) {code} The refactoring work would be the same in either case, but it would require less plumbing for Python/Java/etc to enable this with an option. As an aside... it also is straightforward to extend this to support {{Text}} and {{UTF8String}} values directly, avoiding a string conversion of the entire column prior to parsing. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650025#comment-15650025 ] Reynold Xin commented on SPARK-18352: - Again, this has nothing to do with streaming. It should just be an option (e.g. multilineJson, or wholeFile) for JSON. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649982#comment-15649982 ] Thomas Sebastian commented on SPARK-18352: -- Hi Reynold, So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files? -Thomas > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649835#comment-15649835 ] Reynold Xin commented on SPARK-18352: - There is already a readStream.json. "Stream" here means not having to read the entire file in memory at once, but rather just "stream through" it, i.e. parse as we scan. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649803#comment-15649803 ] Jayadevan M commented on SPARK-18352: - [~rxin] Are you looking a new api like spark.readStream.json(path) similor to spark.read.json(path) ? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org